data_extraction_transformation/notebooks/Ad_hoc_analysis.ipynb (590 lines of code) (raw):

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "**This colab notebook contains scripts for ad-hoc analysis. Every analysis part is prefixed with a description**" ], "metadata": { "id": "2Fm3bz3BA6Qh" } }, { "cell_type": "markdown", "source": [ "**Analysis of ratios of data points (TN/TP/SP/....)**\n", "\n", "\n", "This section contains stats about the ratios of categories of the data points as well as the points about their metrics (How many are TP, what is the average F1 score, .....)" ], "metadata": { "id": "SLcu7vzfCT6X" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "from datetime import datetime, timedelta\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', None)\n", "extension = '/content'\n", "selected_sig_df = pd.read_csv(extension + '/more_than_10_alert_summaries_speedometer3_tp6.csv', index_col=False)\n", "stat_df = pd.read_csv(extension + '/alert_status_summary.csv', index_col=False)" ], "metadata": { "id": "IjOVE5uaevdP" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "display(stat_df.head(5))" ], "metadata": { "id": "JnbczUOYf0C9" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "stat_df['signature_id'] = stat_df['file_name'].str.extract(r'(\\d+)')\n", "stat_df['TN_SP'] = stat_df['TN'] + stat_df['SP']\n", "stat_df['total_num'] = stat_df['TN'] + stat_df['SP'] + stat_df['FP'] + stat_df['FN'] + stat_df['TP']" ], "metadata": { "id": "tkcGUtARlx08" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# The False Positives are composed of the alerts with summaries that are false positive and the alerts with summaries that are still processing\n", "stat_df['Precision'] = stat_df['TP'] / (stat_df['TP'] + (stat_df['FP'] + stat_df['SP']))\n", "stat_df['Recall'] = stat_df['TP'] / (stat_df['TP'] + stat_df['FN'])\n", "stat_df['F1_Score'] = 2 * (stat_df['Precision'] * stat_df['Recall']) / (stat_df['Precision'] + stat_df['Recall'])\n", "\n", "stat_df['Precision_SP_is_TP'] = (stat_df['TP'] + stat_df['SP']) / ((stat_df['TP'] + stat_df['SP']) + stat_df['FP'])\n", "stat_df['Recall_SP_is_TP'] = (stat_df['TP'] + stat_df['SP']) / ((stat_df['TP'] + stat_df['SP']) + stat_df['FN'])\n", "stat_df['F1_Score_SP_is_TP'] = 2 * (stat_df['Precision_SP_is_TP'] * stat_df['Recall_SP_is_TP']) / (stat_df['Precision_SP_is_TP'] + stat_df['Recall_SP_is_TP'])" ], "metadata": { "id": "tGz5ZKM2QqgB" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Calculate the sum of numerical columns for the filtered DataFrame\n", "numerical_cols = stat_df.select_dtypes(include=['number']).columns\n", "column_sums = stat_df[numerical_cols].sum()\n", "# stat_df[numerical_cols] = stat_df[numerical_cols].map('{:.2f}'.format)\n", "#pd.set_option('display.float_format', '{:.2f}'.format)\n", "# Display the column sums\n", "print(\"Sum of numerical columns in stat_df:\")\n", "display(column_sums)" ], "metadata": { "id": "-xtJe00d0lmq" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The following graphs showcase the distribution of number of alerts per alert summary according to the status of the alert summary. We mainly notice that alert summaries with FN are associated generally with one alert (For definition, we decided that every alert summary with at least one alert that is created manually should be classified as FN, subsequently making all its respective alerts inherit the same status). They also showcase the distribution of the metrics values across the timeseries signatures." ], "metadata": { "id": "cDvir6_6FwKc" } }, { "cell_type": "code", "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "cols = ['TP', 'TN', 'FP', 'FN', 'Precision', 'Recall', 'F1_Score', 'Precision_SP_is_TP', 'Recall_SP_is_TP', 'F1_Score_SP_is_TP']\n", "for col in cols:\n", " sns.histplot(stat_df[col], kde=True, color='blue', bins=10)\n", "\n", " # Add labels and title\n", " plt.title('Normal Distribution of Values of column '+ col)\n", " plt.xlabel('Values')\n", " plt.ylabel('Count')\n", "\n", " # Show the plot\n", " plt.show()\n" ], "metadata": { "id": "8mgDcC30dtlh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "selected_sig_list = selected_sig_df['test_series_signature_id'].unique()\n", "selected_sig_list = list(map(str, selected_sig_list))\n", "filtered_stat_df = stat_df[stat_df['signature_id'].isin(selected_sig_list)]\n", "numerical_cols = filtered_stat_df.select_dtypes(include=['number']).columns\n", "column_sums = filtered_stat_df[numerical_cols].sum()\n", "\n", "# Display the column sums\n", "print(\"Sum of numerical columns in filtered_stat_df:\")\n", "display(column_sums)" ], "metadata": { "id": "nd3140fsDw2t" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "columns_to_average = [\"Precision\", \"Recall\", \"F1_Score\", 'Precision_SP_is_TP', 'Recall_SP_is_TP', 'F1_Score_SP_is_TP']\n", "average_values = stat_df[columns_to_average].mean()\n", "display(average_values)" ], "metadata": { "id": "gvWVoH1IgUpV" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "selected_sig_list = selected_sig_df['test_series_signature_id'].unique()\n", "selected_sig_list = list(map(str, selected_sig_list))" ], "metadata": { "id": "L3gSpNwvlchd" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "filtered_stat_df = stat_df[stat_df['signature_id'].isin(selected_sig_list)]\n", "display(filtered_stat_df.head(5))" ], "metadata": { "id": "N8tWYDc-ly2A" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The following table represents the stats of averages of the metrics for the sample dataset utilized to generate the predictions" ], "metadata": { "id": "d7KNx0X8HoxX" } }, { "cell_type": "code", "source": [ "filtered_average_values = stat_df[stat_df['signature_id'].isin(selected_sig_list)][columns_to_average].mean()\n", "display(filtered_average_values)" ], "metadata": { "id": "ZAfKOqiglHSt" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Visualizing one timeseries**\n", "\n", "\n", "The following part helps with generating the graphical visualizations of the Mozilla prediction, the baseline, and the CPDs predictions for one given timeseries signature. The needed files to conduct the visualization are as follows:\n", "- *The CSV for the timeseries* : The CSV file having the labeled data points (TN, FN, TP, ....). This could be found in the `data` directory in the Github project\n", "- *The summary file* : After generating the prediciton using TCPDBench, we get the summary file for the timeseries signature we would liek to visualize. This file contains the change point locations that are predicted by the utilized CPD methods" ], "metadata": { "id": "iTk-W3ZWIuPn" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', None)\n", "import matplotlib.pyplot as plt\n", "import os" ], "metadata": { "id": "4OEZVgaZNHce" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def display_sample(dataf, sig_id, only_true=True, custom_indices=None):\n", " sample_df = dataf.copy()\n", "\n", " #plt.figure(figsize=(12, 8))\n", " plt.figure(figsize=(20, 10))\n", " color_mapping = {\n", " 'TP': 'green',\n", " 'FP': 'red',\n", " 'SP': 'grey',\n", " 'TN': 'blue',\n", " 'FN': 'yellow'\n", " }\n", "\n", " for idx, row in sample_df.iterrows():\n", " plt.plot(idx, row['value'], marker='o', markersize=8, color=color_mapping.get(row['alert_status_general']), alpha=0.6)\n", " line_idx = []\n", " if only_true:\n", " line_idx = ['TP', 'FN']\n", " else:\n", " line_idx = ['TP', 'FN', 'FP', 'SP']\n", "\n", " # Add vertical line corresponding to each data point of interest\n", " if row['alert_status_general'] in line_idx:\n", " plt.axvline(x=idx, color=color_mapping.get(row['alert_status_general']), linestyle='--', alpha=0.4)\n", "\n", " if custom_indices:\n", " for idx in custom_indices:\n", " plt.axvline(x=idx, color='purple', linestyle='--', alpha=0.6)\n", "\n", " plt.title('Time Series Plot')\n", " plt.xlabel('Date')\n", " plt.ylabel(f'Test measurement values associated with signature ID {sig_id}')\n", " #plt.grid(True)\n", " plt.grid(axis='y')\n", " plt.xlim(sample_df.index.min(), sample_df.index.max())\n", " y_min = 0\n", " y_max = sample_df['value'].max() * 1.5\n", " plt.ylim(bottom=y_min, top=y_max)\n", " start_date = sample_df.index.min()\n", " end_date = sample_df.index.max()\n", " weekly_ticks = pd.date_range(start=start_date, end=end_date, freq='W-MON')\n", " plt.xticks(weekly_ticks, rotation=45)\n", " plt.show()" ], "metadata": { "id": "H7dRF-ld1Huw" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import json\n", "with open('/content/summary_4361184.json', 'r') as file:\n", " data = json.load(file)['results']\n", "df_dict = dict()\n", "pred_data = []\n", "for i in data:\n", " max_f1 = -1\n", " for j in data[i]:\n", " entry_dict = dict()\n", " if (j['status'] == 'SUCCESS'):\n", " if (j['scores']['f1'] > max_f1):\n", " entry_dict['algorithm'] = i\n", " entry_dict['cplocations'] = j['cplocations']\n", " entry_dict['f1'] = j['scores']['f1']\n", " max_f1 = j['scores']['f1']\n", " if (entry_dict):\n", " pred_data.append(entry_dict)\n", "pred_df = pd.DataFrame(pred_data)" ], "metadata": { "id": "3-5ThLbJ1fz1" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "main_dir = '/content'\n", "#for csv_file in os.listdir(main_dir):\n", "#for csv_file in ['3869261_timeseries_data.csv', '4361184_timeseries_data.csv']:\n", "for csv_file in ['4361184_timeseries_data.csv']:\n", " if not csv_file.endswith('.csv'):\n", " continue\n", " csv_path = os.path.join(main_dir, csv_file)\n", " df = pd.read_csv(csv_path)\n", " sig_id = csv_file.split('_')[0]\n", " display_sample(df, sig_id, False)" ], "metadata": { "id": "HMjhKVNEODKI" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "for index, row in pred_df.iterrows():\n", " df = pd.read_csv(csv_path)\n", " sig_id = '4361184'\n", " print(row['algorithm'])\n", " print(row['f1'])\n", " display_sample(df, sig_id, True, row['cplocations'])" ], "metadata": { "id": "VnvXvjLRBXDd" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import pandas as pd\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', None)\n", "df = pd.read_csv('/content/2_rectified_alerts_data.csv')\n", "df.head(5)" ], "metadata": { "id": "FPzatHp44OUk" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "alert_status_mapping = {\n", " 0: \"untriaged\",\n", " 1: \"downstream\",\n", " 2: \"reassigned\",\n", " 3: \"invalid\",\n", " 4: \"improvement\",\n", " 5: \"investigating\",\n", " 6: \"wontfix\",\n", " 7: \"fixed\",\n", " 8: \"backedout\"\n", "}\n", "test_status_mapping = {\n", " 0: \"untriaged\",\n", " 1: \"downstream\",\n", " 2: \"reassigned\",\n", " 3: \"invalid\",\n", " 4: \"acknowledged\"\n", "}\n", "category_mapping = {\n", " 'investigating': 'SP',\n", " 'reassigned': 'TP',\n", " 'invalid': 'FP',\n", " 'improvement': 'TP',\n", " 'fixed': 'TP',\n", " 'wontfix': 'TP',\n", " 'untriaged': 'SP',\n", " 'backedout': 'TP',\n", " 'downstream': 'TP',\n", " 'acknowledged': 'TP',\n", "}\n", "df['alert_status_general'] = df['alert_status'].map(alert_status_mapping)\n", "df[\"alert_status_general\"] = df[\"alert_status_general\"].replace(category_mapping)\n", "df['test_status_general'] = df['test_status'].map(alert_status_mapping)\n", "df[\"test_status_general\"] = df[\"test_status_general\"].replace(category_mapping)" ], "metadata": { "id": "XO7kmRkn4Zoi" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df.loc[df['test_manually_created'] == True, 'test_status_general'] = \"FN\"\n", "df.loc[df['test_manually_created'] == True, 'alert_status_general'] = \"FN\"" ], "metadata": { "id": "mfUhze4s9HID" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df_processed_1 = df[['test_id', 'test_status_general']].drop_duplicates()\n", "df_processed_2 = df[['test_id', 'alert_status_general']].drop_duplicates()" ], "metadata": { "id": "ow531hO56Tkv" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The following stats showcase the number of data points per category (ones with alert statuses, which are the stats under test_status_general, and the ones with alert summary statuses, under alert_status_general)" ], "metadata": { "id": "3kp-WfK2R8JK" } }, { "cell_type": "code", "source": [ "print(df_processed_1['test_status_general'].value_counts())\n", "print(df_processed_2['alert_status_general'].value_counts())" ], "metadata": { "id": "rszOKjde7ntT" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The following graphs showcase the distribution of number of alerts per alert summary according to the alert summary status. The main observation is that alert summaries with at least one alert created manually tend to have exactly one alert, unlike the alert summaries with status of TP or FP" ], "metadata": { "id": "0-xfZSkOSeeQ" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "def display_hist(dataf, arg):\n", " counts = dataf.groupby('alert_id')['test_id'].nunique()\n", " '''\n", " log_counts = np.log1p(counts)\n", " sns.histplot(log_counts, kde=True, bins=len(counts), color='blue')\n", " plt.title('Distribution of Unique Values of alert IDs per alert summary ID')\n", " plt.xlabel('Number of Unique Values of alert IDs')\n", " plt.ylabel('Density')\n", " plt.show()\n", " '''\n", " bins = [0, 1, 2, 3, 4, 5, 10, float('inf')]\n", " labels = ['1', '2', '3', '4', '5', '6-10', '11+']\n", " binned_counts = pd.cut(counts, bins=bins, labels=labels, right=True)\n", " binned_counts_distribution = binned_counts.value_counts(sort=False)\n", " sns.barplot(x=binned_counts_distribution.index, y=binned_counts_distribution.values, color='blue')\n", " plt.title('Distribution of Unique Values of alert IDs per alert summary ID (' + arg + ')')\n", " plt.xlabel('Number of Unique Values of alert IDs')\n", " plt.ylabel('Density')\n", " plt.show()\n", "\n", "alert_alert_summary_distro = df[['test_id', 'alert_id']]\n", "display_hist(alert_alert_summary_distro, 'general')\n", "alert_alert_summary_distro_fp = df[df['alert_status_general'] == 'FP'][['test_id', 'alert_id']]\n", "display_hist(alert_alert_summary_distro_fp, 'Only False Positive summaries')\n", "alert_alert_summary_distro_tp = df[df['alert_status_general'] == 'TP'][['test_id', 'alert_id']]\n", "display_hist(alert_alert_summary_distro_tp, 'Only True Positive summaries')\n", "alert_alert_summary_distro_fn = df[df['alert_status_general'] == 'FN'][['test_id', 'alert_id']]\n", "display_hist(alert_alert_summary_distro_fn, 'Only False Negative summaries')" ], "metadata": { "id": "YFkMGSOC_kQK" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The following graph helps with understanding the tradeoff between Precision and Recall for hyper parameter configurations providing the best results on average. The needed data is a CSV from the TCPDBench after generating the stats (the CSVs exist under /TCPDBench/analysis/output)" ], "metadata": { "id": "c5iqk8m5THdH" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "df_kcpa = pd.read_csv('/content/metrics_of_best_kcpa.csv')\n", "df_mongodb = pd.read_csv('/content/metrics_of_best_mongodb.csv')\n", "def plot_precision_recall_f1(df):\n", " # Identify the row with the highest F1 Score\n", " max_f1_row = df.loc[df['F1 Score'].idxmax()]\n", "\n", " # Plotting\n", " plt.figure(figsize=(8, 6))\n", " sns.scatterplot(data=df, x='Recall', y='Precision', s=100)\n", "\n", " # Highlight the point with the highest F1 Score\n", " plt.scatter(max_f1_row['Recall'], max_f1_row['Precision'], color='red', s=150, label='Max F1 Score')\n", "\n", " plt.xlim(0.15, 0.45)\n", " plt.ylim(0.15, 0.45)\n", "\n", " # Add plot details\n", " plt.title('Precision vs Recall')\n", " plt.xlabel('Recall')\n", " plt.ylabel('Precision')\n", " plt.grid(True)\n", "\n", " # Remove legend\n", " plt.legend([], [], frameon=False)\n", "\n", " # Show the plot\n", " plt.tight_layout()\n", " plt.show()\n", "\n", " # Display the row with the highest F1 Score\n", " print(\"Row with the highest F1 Score:\")\n", " print(max_f1_row)\n", "plot_precision_recall_f1(df_kcpa)\n", "plot_precision_recall_f1(df_mongodb)" ], "metadata": { "id": "7IcJO41yWGXh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [], "metadata": { "id": "6j6pnXbPgKvk" }, "execution_count": null, "outputs": [] } ] }