**This colab notebook contains scripts for ad-hoc analysis. Every analysis part is prefixed with a description**

**Analysis of ratios of data points (TN/TP/SP/....)**


This section contains stats about the ratios of categories of the data points as well as the points about their metrics (How many are TP, what is the average F1 score, .....)

In [None]:
import pandas as pd
from datetime import datetime, timedelta
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
extension = '/content'
selected_sig_df = pd.read_csv(extension + '/more_than_10_alert_summaries_speedometer3_tp6.csv', index_col=False)
stat_df = pd.read_csv(extension + '/alert_status_summary.csv', index_col=False)

In [None]:
display(stat_df.head(5))

In [None]:
stat_df['signature_id'] = stat_df['file_name'].str.extract(r'(\d+)')
stat_df['TN_SP'] = stat_df['TN'] + stat_df['SP']
stat_df['total_num'] = stat_df['TN'] + stat_df['SP'] + stat_df['FP'] + stat_df['FN'] + stat_df['TP']

In [None]:
# The False Positives are composed of the alerts with summaries that are false positive and the alerts with summaries that are still processing
stat_df['Precision'] = stat_df['TP'] / (stat_df['TP'] + (stat_df['FP'] + stat_df['SP']))
stat_df['Recall'] = stat_df['TP'] / (stat_df['TP'] + stat_df['FN'])
stat_df['F1_Score'] = 2 * (stat_df['Precision'] * stat_df['Recall']) / (stat_df['Precision'] + stat_df['Recall'])

stat_df['Precision_SP_is_TP'] = (stat_df['TP'] + stat_df['SP']) / ((stat_df['TP'] + stat_df['SP']) + stat_df['FP'])
stat_df['Recall_SP_is_TP'] = (stat_df['TP'] + stat_df['SP']) / ((stat_df['TP'] + stat_df['SP']) + stat_df['FN'])
stat_df['F1_Score_SP_is_TP'] = 2 * (stat_df['Precision_SP_is_TP'] * stat_df['Recall_SP_is_TP']) / (stat_df['Precision_SP_is_TP'] + stat_df['Recall_SP_is_TP'])

In [None]:
# Calculate the sum of numerical columns for the filtered DataFrame
numerical_cols = stat_df.select_dtypes(include=['number']).columns
column_sums = stat_df[numerical_cols].sum()
# stat_df[numerical_cols] = stat_df[numerical_cols].map('{:.2f}'.format)
#pd.set_option('display.float_format', '{:.2f}'.format)
# Display the column sums
print("Sum of numerical columns in stat_df:")
display(column_sums)

The following graphs showcase the distribution of number of alerts per alert summary according to the status of the alert summary. We mainly notice that alert summaries with FN are associated generally with one alert (For definition, we decided that every alert summary with at least one alert that is created manually should be classified as FN, subsequently making all its respective alerts inherit the same status). They also showcase the distribution of the metrics values across the timeseries signatures.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

cols = ['TP', 'TN', 'FP', 'FN', 'Precision', 'Recall', 'F1_Score', 'Precision_SP_is_TP', 'Recall_SP_is_TP', 'F1_Score_SP_is_TP']
for col in cols:
  sns.histplot(stat_df[col], kde=True, color='blue', bins=10)

  # Add labels and title
  plt.title('Normal Distribution of Values of column '+ col)
  plt.xlabel('Values')
  plt.ylabel('Count')

  # Show the plot
  plt.show()


In [None]:
selected_sig_list = selected_sig_df['test_series_signature_id'].unique()
selected_sig_list = list(map(str, selected_sig_list))
filtered_stat_df = stat_df[stat_df['signature_id'].isin(selected_sig_list)]
numerical_cols = filtered_stat_df.select_dtypes(include=['number']).columns
column_sums = filtered_stat_df[numerical_cols].sum()

# Display the column sums
print("Sum of numerical columns in filtered_stat_df:")
display(column_sums)

In [None]:
columns_to_average = ["Precision", "Recall", "F1_Score", 'Precision_SP_is_TP', 'Recall_SP_is_TP', 'F1_Score_SP_is_TP']
average_values = stat_df[columns_to_average].mean()
display(average_values)

In [None]:
selected_sig_list = selected_sig_df['test_series_signature_id'].unique()
selected_sig_list = list(map(str, selected_sig_list))

In [None]:
filtered_stat_df = stat_df[stat_df['signature_id'].isin(selected_sig_list)]
display(filtered_stat_df.head(5))

The following table represents the stats of averages of the metrics for the sample dataset utilized to generate the predictions

In [None]:
filtered_average_values = stat_df[stat_df['signature_id'].isin(selected_sig_list)][columns_to_average].mean()
display(filtered_average_values)

**Visualizing one timeseries**


The following part helps with generating the graphical visualizations of the Mozilla prediction, the baseline, and the CPDs predictions for one given timeseries signature. The needed files to conduct the visualization are as follows:
- *The CSV for the timeseries* : The CSV file having the labeled data points (TN, FN, TP, ....). This could be found in the `data` directory in the Github project
- *The summary file* : After generating the prediciton using TCPDBench, we get the summary file for the timeseries signature we would liek to visualize. This file contains the change point locations that are predicted by the utilized CPD methods

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import matplotlib.pyplot as plt
import os

In [None]:
def display_sample(dataf, sig_id, only_true=True, custom_indices=None):
    sample_df = dataf.copy()

    #plt.figure(figsize=(12, 8))
    plt.figure(figsize=(20, 10))
    color_mapping = {
        'TP': 'green',
        'FP': 'red',
        'SP': 'grey',
        'TN': 'blue',
        'FN': 'yellow'
    }

    for idx, row in sample_df.iterrows():
        plt.plot(idx, row['value'], marker='o', markersize=8, color=color_mapping.get(row['alert_status_general']), alpha=0.6)
        line_idx = []
        if only_true:
          line_idx = ['TP', 'FN']
        else:
          line_idx = ['TP', 'FN', 'FP', 'SP']

        # Add vertical line corresponding to each data point of interest
        if row['alert_status_general'] in line_idx:
            plt.axvline(x=idx, color=color_mapping.get(row['alert_status_general']), linestyle='--', alpha=0.4)

        if custom_indices:
            for idx in custom_indices:
                plt.axvline(x=idx, color='purple', linestyle='--', alpha=0.6)

    plt.title('Time Series Plot')
    plt.xlabel('Date')
    plt.ylabel(f'Test measurement values associated with signature ID {sig_id}')
    #plt.grid(True)
    plt.grid(axis='y')
    plt.xlim(sample_df.index.min(), sample_df.index.max())
    y_min = 0
    y_max = sample_df['value'].max() * 1.5
    plt.ylim(bottom=y_min, top=y_max)
    start_date = sample_df.index.min()
    end_date = sample_df.index.max()
    weekly_ticks = pd.date_range(start=start_date, end=end_date, freq='W-MON')
    plt.xticks(weekly_ticks, rotation=45)
    plt.show()

In [None]:
import json
with open('/content/summary_4361184.json', 'r') as file:
    data = json.load(file)['results']
df_dict = dict()
pred_data = []
for i in data:
  max_f1 = -1
  for j in data[i]:
    entry_dict = dict()
    if (j['status'] == 'SUCCESS'):
      if (j['scores']['f1'] > max_f1):
        entry_dict['algorithm'] = i
        entry_dict['cplocations'] = j['cplocations']
        entry_dict['f1'] = j['scores']['f1']
        max_f1 = j['scores']['f1']
    if (entry_dict):
      pred_data.append(entry_dict)
pred_df = pd.DataFrame(pred_data)

In [None]:
main_dir = '/content'
#for csv_file in os.listdir(main_dir):
#for csv_file in ['3869261_timeseries_data.csv', '4361184_timeseries_data.csv']:
for csv_file in ['4361184_timeseries_data.csv']:
    if not csv_file.endswith('.csv'):
        continue
    csv_path = os.path.join(main_dir, csv_file)
    df = pd.read_csv(csv_path)
    sig_id = csv_file.split('_')[0]
    display_sample(df, sig_id, False)

In [None]:
for index, row in pred_df.iterrows():
    df = pd.read_csv(csv_path)
    sig_id = '4361184'
    print(row['algorithm'])
    print(row['f1'])
    display_sample(df, sig_id, True, row['cplocations'])

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df = pd.read_csv('/content/2_rectified_alerts_data.csv')
df.head(5)

In [None]:
alert_status_mapping = {
    0: "untriaged",
    1: "downstream",
    2: "reassigned",
    3: "invalid",
    4: "improvement",
    5: "investigating",
    6: "wontfix",
    7: "fixed",
    8: "backedout"
}
test_status_mapping = {
    0: "untriaged",
    1: "downstream",
    2: "reassigned",
    3: "invalid",
    4: "acknowledged"
}
category_mapping = {
    'investigating': 'SP',
    'reassigned': 'TP',
    'invalid': 'FP',
    'improvement': 'TP',
    'fixed': 'TP',
    'wontfix': 'TP',
    'untriaged': 'SP',
    'backedout': 'TP',
    'downstream': 'TP',
    'acknowledged': 'TP',
}
df['alert_status_general'] = df['alert_status'].map(alert_status_mapping)
df["alert_status_general"] = df["alert_status_general"].replace(category_mapping)
df['test_status_general'] = df['test_status'].map(alert_status_mapping)
df["test_status_general"] = df["test_status_general"].replace(category_mapping)

In [None]:
df.loc[df['test_manually_created'] == True, 'test_status_general'] = "FN"
df.loc[df['test_manually_created'] == True, 'alert_status_general'] = "FN"

In [None]:
df_processed_1 = df[['test_id', 'test_status_general']].drop_duplicates()
df_processed_2 = df[['test_id', 'alert_status_general']].drop_duplicates()

The following stats showcase the number of data points per category (ones with alert statuses, which are the stats under test_status_general, and the ones with alert summary statuses, under alert_status_general)

In [None]:
print(df_processed_1['test_status_general'].value_counts())
print(df_processed_2['alert_status_general'].value_counts())

The following graphs showcase the distribution of number of alerts per alert summary according to the alert summary status. The main observation is that alert summaries with at least one alert created manually tend to have exactly one alert, unlike the alert summaries with status of TP or FP

In [None]:
import numpy as np
def display_hist(dataf, arg):
  counts = dataf.groupby('alert_id')['test_id'].nunique()
  '''
  log_counts = np.log1p(counts)
  sns.histplot(log_counts, kde=True, bins=len(counts), color='blue')
  plt.title('Distribution of Unique Values of alert IDs per alert summary ID')
  plt.xlabel('Number of Unique Values of alert IDs')
  plt.ylabel('Density')
  plt.show()
  '''
  bins = [0, 1, 2, 3, 4, 5, 10, float('inf')]
  labels = ['1', '2', '3', '4', '5', '6-10', '11+']
  binned_counts = pd.cut(counts, bins=bins, labels=labels, right=True)
  binned_counts_distribution = binned_counts.value_counts(sort=False)
  sns.barplot(x=binned_counts_distribution.index, y=binned_counts_distribution.values, color='blue')
  plt.title('Distribution of Unique Values of alert IDs per alert summary ID (' + arg + ')')
  plt.xlabel('Number of Unique Values of alert IDs')
  plt.ylabel('Density')
  plt.show()

alert_alert_summary_distro = df[['test_id', 'alert_id']]
display_hist(alert_alert_summary_distro, 'general')
alert_alert_summary_distro_fp = df[df['alert_status_general'] == 'FP'][['test_id', 'alert_id']]
display_hist(alert_alert_summary_distro_fp, 'Only False Positive summaries')
alert_alert_summary_distro_tp = df[df['alert_status_general'] == 'TP'][['test_id', 'alert_id']]
display_hist(alert_alert_summary_distro_tp, 'Only True Positive summaries')
alert_alert_summary_distro_fn = df[df['alert_status_general'] == 'FN'][['test_id', 'alert_id']]
display_hist(alert_alert_summary_distro_fn, 'Only False Negative summaries')

The following graph helps with understanding the tradeoff between Precision and Recall for hyper parameter configurations providing the best results on average. The needed data is a CSV from the TCPDBench after generating the stats (the CSVs exist under /TCPDBench/analysis/output)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_kcpa = pd.read_csv('/content/metrics_of_best_kcpa.csv')
df_mongodb = pd.read_csv('/content/metrics_of_best_mongodb.csv')
def plot_precision_recall_f1(df):
    # Identify the row with the highest F1 Score
    max_f1_row = df.loc[df['F1 Score'].idxmax()]

    # Plotting
    plt.figure(figsize=(8, 6))
    sns.scatterplot(data=df, x='Recall', y='Precision', s=100)

    # Highlight the point with the highest F1 Score
    plt.scatter(max_f1_row['Recall'], max_f1_row['Precision'], color='red', s=150, label='Max F1 Score')

    plt.xlim(0.15, 0.45)
    plt.ylim(0.15, 0.45)

    # Add plot details
    plt.title('Precision vs Recall')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.grid(True)

    # Remove legend
    plt.legend([], [], frameon=False)

    # Show the plot
    plt.tight_layout()
    plt.show()

    # Display the row with the highest F1 Score
    print("Row with the highest F1 Score:")
    print(max_f1_row)
plot_precision_recall_f1(df_kcpa)
plot_precision_recall_f1(df_mongodb)