vertex_ai/04_experimentation.ipynb (935 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Fraudfinder - XGBoost Model Experimentation\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a href=\"https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/fraudfinder/blob/main/vertex_ai/04_experimentation.ipynb\">\n", " <img src=\"https://www.gstatic.com/cloud/images/navigation/vertex-ai.svg\" alt=\"Google Cloud Notebooks\">Open in Cloud Notebook\n", " </a>\n", " </td> \n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/fraudfinder/blob/main/vertex_ai/04_experimentation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Open in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/fraudfinder/blob/main/vertex_ai/04_experimentation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "[Fraudfinder](https://github.com/googlecloudplatform/fraudfinder) is a series of labs on how to build a real-time fraud detection system on Google Cloud. Throughout the Fraudfinder labs, you will learn how to read historical bank transaction data stored in data warehouse, read from a live stream of new transactions, perform exploratory data analysis (EDA), do feature engineering, ingest features into a feature store, train a model using feature store, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model with feature store, and monitor your model." ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "### Objective\n", "\n", "This notebook shows how to pull features from Feature Store for training, run data exploratory analysis on features, build a machine learning model locally, experiment with various hyperparameters, evaluate the model and deploy it to a Vertex AI endpoint. \n", "\n", "This lab uses the following Google Cloud services and resources:\n", "\n", "- [Vertex AI](https://cloud.google.com/vertex-ai/)\n", "- [BigQuery](https://cloud.google.com/bigquery/)\n", "\n", "Steps performed in this notebook:\n", "\n", "- Use a Feature Store to pull training data\n", "- Do some exploratory analysis on the extracted data\n", "- Train the model and track the results using Vertex AI Experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Costs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* BigQuery\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load configuration settings from the setup notebook\n", "\n", "Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "GCP_PROJECTS = !gcloud config get-value project\n", "PROJECT_ID = GCP_PROJECTS[0]\n", "BUCKET_NAME = f\"{PROJECT_ID}-fraudfinder\"\n", "config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py\n", "print(config.n)\n", "exec(config.n)" ] }, { "cell_type": "markdown", "metadata": { "id": "XoEqT2Y4DJmf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pRUOFELefqf1" }, "outputs": [], "source": [ "# General\n", "import os\n", "import sys\n", "from typing import Union, List\n", "import random\n", "from datetime import datetime, timedelta\n", "import time\n", "import json\n", "import logging\n", "\n", "# Feature Store\n", "from google.cloud import aiplatform as vertex_ai\n", "from google.cloud.aiplatform import Featurestore, EntityType, Feature\n", "\n", "# Data Preprocessing\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline\n", "\n", "# Model Training\n", "from google.cloud import bigquery\n", "from google.cloud import storage\n", "from sklearn import metrics\n", "from sklearn.metrics import precision_recall_fscore_support\n", "from sklearn.metrics import classification_report, f1_score, accuracy_score\n", "from sklearn.linear_model import LogisticRegression\n", "import xgboost as xgb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# General\n", "DATA_DIR = os.path.join(os.pardir, \"data\")\n", "DATA_URI = f\"gs://{BUCKET_NAME}/data\"\n", "TRAIN_DATA_URI = f\"{DATA_URI}/train\"\n", "\n", "# Feature Store\n", "START_DATE_TRAIN = (datetime.today() - timedelta(days=1)).strftime(\"%Y-%m-%d\")\n", "END_DATE_TRAIN = (datetime.today() - timedelta(days=1)).strftime(\"%Y-%m-%d\")\n", "CUSTOMER_ENTITY = \"customer\"\n", "TERMINAL_ENTITY = \"terminal\"\n", "SERVING_FEATURE_IDS = {CUSTOMER_ENTITY: [\"*\"], TERMINAL_ENTITY: [\"*\"]}\n", "READ_INSTANCES_TABLE = f\"ground_truth_{END_DATE_TRAIN}\"\n", "READ_INSTANCES_URI = f\"bq://{PROJECT_ID}.tx.{READ_INSTANCES_TABLE}\"\n", "\n", "# Training\n", "COLUMNS_IGNORE = [\n", " \"terminal_id\",\n", " \"customer_id\",\n", " \"entity_type_event\",\n", " \"entity_type_customer\",\n", " \"entity_type_terminal\",\n", "]\n", "TARGET = \"tx_fraud\"\n", "\n", "# Custom Training\n", "MODEL_NAME = f\"{MODEL_NAME}_xgb_exp_{ID}\"\n", "\n", "# Experiment\n", "EXPERIMENT_NAME = f\"ff-experiment-{ID}\"" ] }, { "cell_type": "markdown", "metadata": { "id": "XoEqT2Y4DJmf" }, "source": [ "### Initialize clients\n", "Next you have to initialize the [Vertex AI SDK](https://cloud.google.com/vertex-ai/docs/start/use-vertex-ai-python-sdk) and the Python BigQuery Client for your project, region and corresponding bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bq_client = bigquery.Client(project=PROJECT_ID, location=REGION)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vertex_ai.init(\n", " project=PROJECT_ID,\n", " location=REGION,\n", " staging_bucket=BUCKET_NAME,\n", " experiment=EXPERIMENT_NAME,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper Functions\n", "Next you will create a helper function for sending queries to BigQuery." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:\n", " \"\"\"\n", " Run a BigQuery query and return the job ID or result as a DataFrame\n", " Args:\n", " sql: SQL query, as a string, to execute in BigQuery\n", " Returns:\n", " df: DataFrame of results from query, or error, if any\n", " \"\"\"\n", "\n", " bq_client = bigquery.Client()\n", "\n", " # Try dry run before executing query to catch any errors\n", " job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)\n", " bq_client.query(sql, job_config=job_config)\n", "\n", " # If dry run succeeds without errors, proceed to run query\n", " job_config = bigquery.QueryJobConfig()\n", " client_result = bq_client.query(sql, job_config=job_config)\n", "\n", " job_id = client_result.job_id\n", "\n", " # Wait for query/job to finish running. then get & return data frame\n", " df = client_result.result().to_arrow().to_pandas()\n", " print(f\"Finished job_id: {job_id}\")\n", " return df\n", "\n", "\n", "def create_gcs_dataset(client, display_name: str, gcs_source: Union[str, List[str]]):\n", " \"\"\"\n", " A function to create a Vertex AI Dataset resource\n", " Args:\n", " client: Vertex AI Client instance\n", " display_name: The name of Vertex AI Dataset resource\n", " gcs_source: The URI of data on the bucket\n", " Returns:\n", " Vertex AI Dataset resource\n", " \"\"\"\n", " dataset = client.TabularDataset.create(\n", " display_name=display_name,\n", " gcs_source=gcs_source,\n", " )\n", "\n", " dataset.wait()\n", " return dataset\n", "\n", "\n", "def preprocess(df: pd.DataFrame):\n", " \"\"\"\n", " Converts categorical features to numeric. Removes unused columns.\n", " Args:\n", " df: Pandas df with raw data\n", " Returns:\n", " df with preprocessed data\n", " \"\"\"\n", " df = df.drop(columns=UNUSED_COLUMNS)\n", "\n", " # Drop rows with NaN\"s\n", " df = df.dropna()\n", "\n", " # Convert integer valued (numeric) columns to floating point\n", " numeric_columns = df.select_dtypes([\"int32\", \"float32\", \"float64\"]).columns\n", " df[numeric_columns] = df[numeric_columns].astype(\"float32\")\n", "\n", " dummy_columns = list(df.dtypes[df.dtypes == \"category\"].index)\n", " df = pd.get_dummies(df, columns=dummy_columns)\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching feature values for model training\n", "\n", "To fetch the feature values from Feature Store as part of the training data, we have to specify the following inputs to define what we want to use to lookup the values in Feature Store:\n", "\n", "- a file containing a table, with the entities and timestamps for each label\n", "- a list of features to fetch values for\n", "- the destination location and format\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read-instance list\n", "\n", "In our case, we need a csv file with content formatted like the table below:\n", "\n", "|customer |terminal|timestamp |\n", "|-----------------------------|--------|---------------------------------------------|\n", "|xxx3859 |xxx8811 |2021-07-07 00:01:10 UTC |\n", "|xxx4165 |xxx8810 |2021-07-07 00:01:55 UTC |\n", "|xxx2289 |xxx2081 |2021-07-07 00:02:12 UTC |\n", "|xxx3227 |xxx3011 |2021-07-07 00:03:23 UTC |\n", "|xxx2819 |xxx6263 |2021-07-07 00:05:30 UTC |\n", "\n", "where the column names are the names of entities in Feature Store and the timestamps represents the time an event occurred." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "read_instances_query = f\"\"\"\n", " SELECT\n", " raw_tx.TX_TS AS timestamp,\n", " raw_tx.CUSTOMER_ID AS customer,\n", " raw_tx.TERMINAL_ID AS terminal,\n", " raw_tx.TX_AMOUNT AS tx_amount,\n", " raw_lb.TX_FRAUD AS tx_fraud,\n", " FROM \n", " tx.tx as raw_tx\n", " LEFT JOIN \n", " tx.txlabels as raw_lb\n", " ON raw_tx.TX_ID = raw_lb.TX_ID\n", " WHERE\n", " DATE(raw_tx.TX_TS) = \"{START_DATE_TRAIN}\";\n", "\"\"\"\n", "print(read_instances_query)\n", "\n", "query_df = run_bq_query(read_instances_query)\n", "query_df.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instantiate Feature Store\n", "Now you can instantiate the feature store object using the ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ff_feature_store = Featurestore(FEATURESTORE_ID)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export a training sample of data to a pandas dataframe.\n", "First you need to fetch a batch of data. We will use this data to train a custom model. We will fetch a batch of data and create a Pandas dataframe. The dataframe makes it easier to inspect the data. \n", "\n", "In the next cell, you will send the dataframe to lookup feature values to Feature Store and retrieve the results as pandas DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_df = ff_feature_store.batch_serve_to_df(\n", " serving_feature_ids=SERVING_FEATURE_IDS,\n", " read_instances_df=query_df,\n", " pass_through_fields=[\"tx_fraud\", \"tx_amount\"],\n", ")\n", "\n", "sample_df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data exploration\n", "Here you will use a subset of data for data exploration to get a better understanding of the data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Examine the label distribution\n", "Let's create a plot and do an actual count of the values (fraud vs. non-fraud)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.title(\"Count of transactions by fraud label (tx_fraud = 0 or 1)\")\n", "sns.countplot(x=\"tx_fraud\", data=sample_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(sample_df.tx_fraud.value_counts() / sample_df.shape[0]) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see the data is imbalanced. We will fix this later in this notebook before building the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get a better understanding of the feature distributions using histograms\n", "Now you can do the same for the input features. It's good to plot the distributions of our features to get a better understanding of them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = [\n", " \"tx_amount\",\n", " \"customer_id_nb_tx_1day_window\",\n", " \"customer_id_nb_tx_7day_window\",\n", " \"customer_id_nb_tx_14day_window\",\n", " \"customer_id_avg_amount_1day_window\",\n", " \"customer_id_avg_amount_7day_window\",\n", " \"customer_id_avg_amount_14day_window\",\n", " \"terminal_id_risk_1day_window\",\n", " \"terminal_id_risk_7day_window\",\n", " \"terminal_id_risk_14day_window\",\n", " \"terminal_id_nb_tx_1day_window\",\n", " \"terminal_id_nb_tx_7day_window\",\n", " \"terminal_id_nb_tx_14day_window\",\n", "]\n", "\n", "sample_df[features].hist(figsize=(20, 10), grid=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Analyze the relationship between features and ```tx_fraud```\n", "Now you can also look at the relationship between our target and input features using a correlation heatmap. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 5))\n", "sns.heatmap(sample_df[sample_df.columns.difference(COLUMNS_IGNORE)].corr())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Some observations\n", "\n", "Based on this simple exploratory data analysis you can conclude:\n", "\n", "- The sample data is unbalanced.\n", "- You can probably remove some of the features with little predictive value.\n", "- You might want to extract subsets of the timestamp into separate features such as day, week, night, etc. (i.e. calculate some time-based embeddings).\n", "- You may want to do some variable selection.\n", "- You might need to scale some variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a custom fraud detection model\n", "\n", "### Fixing an imbalanced dataset\n", "In the real world, we must deal with an imbalance in our dataset. For example, we might randomly delete some of the non-fraudulent transactions to approximately match the number of fraudulent transactions. This technique is called undersampling.\n", "\n", "For this workshop, we will skip the data balance process because our sample data is small already, and the further reduction will compromise the quality of our results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shuffled_df = sample_df.sample(frac=1, random_state=4)\n", "fraud_df = shuffled_df.loc[shuffled_df[\"tx_fraud\"] == 1]\n", "non_fraud_df = shuffled_df.loc[shuffled_df[\"tx_fraud\"] == 0].sample(\n", " n=fraud_df.shape[0], random_state=42\n", ")\n", "balanced_df = pd.concat([fraud_df, non_fraud_df])\n", "(balanced_df.tx_fraud.value_counts() / balanced_df.shape[0]) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training\n", "In this section, you will train a model using the XGBoost algorithm. As we are still experimenting, you will use the XGBoost package interactively to train your model in this notebook.\n", "\n", "#### Why XGBoost?\n", "The extreme gradient-boosted (XGBoost) algorithm is an ML algorithm based on ensembles of decision trees, which works well with imbalanced data, handling missing values, and can be parallelized." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preparing datasets\n", "We split our data into training, test, and validation sets in the following cell. Training data is the primary source of input for training the ML model. Validation data determines our progress after each epoch or iteration of our training loop. Finally, the test data is data the model has never seen before and is used to assess model quality at the end of the training process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set up training variables\n", "LABEL_COLUMN = \"tx_fraud\"\n", "UNUSED_COLUMNS = [\"timestamp\", \"entity_type_customer\", \"entity_type_terminal\"]\n", "NA_VALUES = [\"NA\", \".\"]\n", "\n", "df_dataset = balanced_df\n", "df_train, df_test, df_val = np.split(\n", " df_dataset.sample(frac=1, random_state=42),\n", " [int(0.6 * len(df_dataset)), int(0.8 * len(df_dataset))],\n", ")\n", "\n", "# Training set\n", "preprocessed_train_data = preprocess(df_train)\n", "x_train = preprocessed_train_data[\n", " preprocessed_train_data.columns.drop(LABEL_COLUMN).to_list()\n", "].values\n", "y_train = preprocessed_train_data.loc[:, LABEL_COLUMN].astype(int)\n", "\n", "# Validation set\n", "preprocessed_val_data = preprocess(df_val)\n", "x_val = preprocessed_val_data[\n", " preprocessed_val_data.columns.drop(LABEL_COLUMN).to_list()\n", "].values\n", "y_val = preprocessed_val_data.loc[:, LABEL_COLUMN].astype(int)\n", "\n", "# Test set\n", "preprocessed_test_data = preprocess(df_test)\n", "x_test = preprocessed_test_data[\n", " preprocessed_test_data.columns.drop(LABEL_COLUMN).to_list()\n", "].values\n", "y_test = preprocessed_test_data.loc[:, LABEL_COLUMN].astype(int)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### Training the model\n", "Before training the XGBoost model, you can set some hyperparameters to help us improve the model's performance. We advise you to use [Vertex AI Vizier](https://cloud.google.com/vertex-ai/docs/vizier/overview), which automates the optimization of hyperparameters, to help with hyperparameter tuning. However, in this notebook, we specify these hyperparameters manually and randomly for the sake of simplicity and expedience:\n", "\n", "- `eta`: A regularization parameter to reduce feature weights in each boosting step.\n", "- `gamma`: A regularization parameter for tree pruning.\n", "- `max_depth`: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.\n", "\n", "For more information about parameters, check [the XGBoost documentation](https://xgboost.readthedocs.io/en/stable/parameter.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parameters = [\n", " {\"eta\": 0.2, \"gamma\": 0.0, \"max_depth\": 4},\n", " {\"eta\": 0.2, \"gamma\": 0.0, \"max_depth\": 5},\n", " {\"eta\": 0.2, \"gamma\": 0.1, \"max_depth\": 4},\n", " {\"eta\": 0.2, \"gamma\": 0.1, \"max_depth\": 5},\n", " {\"eta\": 0.3, \"gamma\": 0.0, \"max_depth\": 4},\n", " {\"eta\": 0.3, \"gamma\": 0.0, \"max_depth\": 5},\n", " {\"eta\": 0.3, \"gamma\": 0.1, \"max_depth\": 4},\n", " {\"eta\": 0.3, \"gamma\": 0.1, \"max_depth\": 5},\n", "]\n", "\n", "models = {}\n", "for i, params in enumerate(parameters):\n", " run_name = f\"ff-xgboost-local-run-t-{i}\"\n", " print(run_name)\n", " vertex_ai.start_run(run=run_name)\n", " vertex_ai.log_params(params)\n", " model = xgb.XGBClassifier(\n", " objective=\"reg:logistic\",\n", " max_depth=params[\"max_depth\"],\n", " gamma=params[\"gamma\"],\n", " eta=params[\"eta\"],\n", " use_label_encoder=False,\n", " )\n", " model.fit(x_train, y_train)\n", " models[run_name] = model\n", " y_pred_proba = model.predict_proba(x_val)[:, 1]\n", " y_pred = model.predict(x_val)\n", " acc_score = accuracy_score(y_val, y_pred)\n", " val_f1_score = f1_score(y_val, y_pred, average=\"weighted\")\n", " vertex_ai.log_metrics({\"acc_score\": acc_score, \"f1score\": val_f1_score})\n", " vertex_ai.end_run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XGBoost usually performs well on imbalanced datasets. You can also try another algorithm, such as logistic regression." ] }, { "cell_type": "markdown", "metadata": { "id": "A1PqKxlpOZa2" }, "source": [ "We can also extract all parameters and metrics associated with any experiment into a dataframe for further analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jbRf1WoH_vbY" }, "outputs": [], "source": [ "experiment_df = vertex_ai.get_experiment_df()\n", "experiment_df.sort_values([\"metric.f1score\"], ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "F19_5lw0MqXv" }, "source": [ "Also, you can visualize experiments using the Cloud Console on the [Vertex AI Experiments](https://console.cloud.google.com/ai/platform/experiments/experiments) page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Save model\n", "After running the experiments, you can choose one of the experiments and use XGBoost's save_model method to export the model to a local file named `model.bst`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_directory = \"../models\"\n", "!sudo mkdir -p -m 777 {model_directory}\n", "\n", "model = models[f\"ff-xgboost-local-run-t-{i}\"]\n", "artifact_filename = \"model.bst\"\n", "model_path = os.path.join(model_directory, artifact_filename)\n", "model.save_model(model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Evaluation\n", "\n", "Let's first test the model locally on your test dataset, to get predicted labels and an F1 score, which is an aggregation of the model's precision and recall. The F1 score becomes especially valuable when working on classification models in which your data set is imbalanced. Learn more about the F1 score [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). You will `precision_recall_fscore_support` from sklearn.\n", "\n", "You will use test data to evaluate the model performance. First, you will take our trained XGBoost model to generate predictions on our test data. After that, you will use these predictions to calculate the F1 score and evaluate model performance. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bst = xgb.Booster()\n", "bst.load_model(model_path)\n", "xgtest = xgb.DMatrix(x_test)\n", "y_pred_prob = bst.predict(xgtest)\n", "y_pred = y_pred_prob.round().astype(int)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# you can quickly check top five y_pred and see if we are getting labels as expected\n", "y_pred[:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_recall_fscore_support\n", "\n", "precision_recall_fscore_support(y_test.values, y_pred, average=\"weighted\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The decision to convert a predicted probability into a fraudulent/non-fraudulent class label is determined by a discrimination threshold, which uses a default value of 0.5. A transaction is predicted as non-fraudulent (class 0) if the probability is under threshold (0.5) and fraudulent (class 1) if it is equal to or greater than the threshold (0.5). This threshold determines the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) results which are typically used in the confusion matrix, precision, recall, and F1-score, all of which are used as evaluation metrics for a classification model.\n", "\n", "You might get different TP and FP rates if you change this threshold, especially if your data needs to be balanced. By fine-tuning this threshold, you might find a value that leads to near-optimal model performance based on your business tolerance for accepting the cost of FP or FN cases. \n", "\n", "In the next cell, we calculate the confusion matrix for different discrimination thresholds:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 5, figsize=(15, 4))\n", "\n", "for n, ax in enumerate(axes.flat):\n", " threshold = (n + 1) / 10\n", " y_pred = (y_pred_prob > threshold).astype(int)\n", " cfm = metrics.confusion_matrix(y_test, y_pred)\n", " sns.heatmap(cfm, annot=True, cmap=\"Reds\", fmt=\"d\", ax=ax, cbar=False)\n", " ax.title.set_text(\"Threshold=%.2f\" % threshold)\n", "\n", "plt.subplots_adjust(hspace=0.4, wspace=0.8)\n", "plt.suptitle(\"Confusion Matrix for various thresholds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We might get more insight into the optimal threshold by examining a Receiver Operator Characteristic (ROC) Curve. This will plot the true positive (TP) vs. false positive (FP) rates at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve. Also, we graph the Area Under the Curve (AUC), which ranges in value from 0 to 1.0. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "precision_recall_fscore_support(y_test.values, y_pred, average=\"weighted\")\n", "fpr, tpr, thresholds = metrics.roc_curve(y_test.values, y_pred_prob)\n", "auc = metrics.auc(fpr, tpr)\n", "\n", "plt.figure(figsize=(8, 6))\n", "# plot the roc curve for the model\n", "plt.plot(fpr, tpr, marker=\".\", label=\"xgboost: AUC = %.2f\" % auc)\n", "\n", "# generate general prediction (majority class)\n", "ns_probs = [0 for _ in range(len(y_test))]\n", "ns_fpr, ns_tpr, _ = metrics.roc_curve(y_test, ns_probs)\n", "ns_auc = metrics.auc(ns_fpr, ns_tpr)\n", "plt.plot(ns_fpr, ns_tpr, linestyle=\"--\", label=\"No Skill: AUC = %.2f\" % ns_auc)\n", "\n", "plt.ylabel(\"TP rate\")\n", "plt.xlabel(\"FP rate\")\n", "\n", "plt.legend(loc=4)\n", "plt.title(\"ROC Curve\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, what is the optimal threshold? As mentioned, this depends on the business use case; for example, if we believe the optimal threshold for our use case offers the highest TPR and minimum FPR, then we can calculate the optimal threshold as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Optimal threshold value is: %.3f\" % thresholds[np.argmax(tpr - fpr)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note that the metrics may be unexpectedly high (e.g., a perfect ROC AUC score of 1.0), and that this is because the underlying data was synthetically generated. \n", "\n", "In practice, the evaluation metrics are likely to be lower, so this notebook focuses on the overall architecture and how tools are used together rather than on the model's actual performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now understand our dataset more deeply, have achieved good results with the XGBoost algorithm, and have fine-tuned model hyperparameters. Now let's turn our attention to the following notebook to transition this training process from an manual approach to a more formalized method.\n", "\n", "You can continue with the next notebook: `05_model_training_xgboost_formalization.ipynb`." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "notebook_template.ipynb", "toc_visible": true }, "environment": { "kernel": "python3", "name": "common-cpu.m103", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m103" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }