01_exploratory_data_analysis.ipynb (620 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "ur8xi4C7S06n", "tags": [] }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "fsv4jGuU89rX" }, "source": [ "# FraudFinder - Exploratory Data Analysis\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a href=\"https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/fraudfinder/raw/main/01_exploratory_data_analysis.ipynb\">\n", " <img src=\"https://www.gstatic.com/cloud/images/navigation/vertex-ai.svg\" alt=\"Google Cloud Notebooks\">Open in Cloud Notebook\n", " </a>\n", " </td> \n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/fraudfinder/blob/main/01_exploratory_data_analysis.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Open in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/fraudfinder/blob/main/01_exploratory_data_analysis.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "827c41ab1a12" }, "source": [ "## Overview\n", "\n", "[FraudFinder](https://github.com/googlecloudplatform/fraudfinder) is a series of labs on how to build a real-time fraud detection system on Google Cloud. Throughout the Fraudfinder labs, you will learn how to read historical bank transaction data stored in data warehouse, read from a live stream of new transactions, perform exploratory data analysis (EDA), do feature engineering, ingest features into a feature store, train a model using feature store, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model with feature store, and monitor your model." ] }, { "cell_type": "markdown", "metadata": { "id": "45f6e923dc75" }, "source": [ "### Objective\n", "\n", "In this notebook, you will perform exploratory data analysis on the historical bank transactions stored in BigQuery. Please make sure you have completed the [environment setup notebook](00_environment_setup.ipynb) prior to running this notebook.\n", "\n", "This lab uses the following Google Cloud services and resources:\n", "\n", "- [Vertex AI](https://cloud.google.com/vertex-ai/)\n", "- [BigQuery](https://cloud.google.com/bigquery/)\n", "- [Google Cloud Storage](https://cloud.google.com/storage)\n", "- [Pub/Sub](https://cloud.google.com/pubsub/)\n", "\n", "Steps performed in this notebook:\n", "\n", "- Read data from BigQuery\n", "- Calculate summary statistics across historical transactions\n", "- Calculate fraud percentages\n", "- Plot distributions of transaction amounts\n", "- Analyze customer-level aggregates of transaction data" ] }, { "cell_type": "markdown", "metadata": { "id": "8b5e2e2a7bdb" }, "source": [ "### Costs" ] }, { "cell_type": "markdown", "metadata": { "id": "04c1dae4ca17" }, "source": [ "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "* BigQuery\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Setup based on notebook environment\n", "\n", "If using Colab, make sure to enter your project ID in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Detect if current environment is Colab, authenticate if so\n", "# Enter in project directly if Colab, detect from environment otherwise\n", "# Also adjusts plotly renderer for Colab vs other (to display interactive plots)\n", "\n", "import sys\n", "import plotly.io as pio\n", "\n", "if \"google.colab\" in sys.modules:\n", " # Authenticate user to Google Cloud\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()\n", "\n", " # If Colab, enter in Cloud project here\n", " GCP_PROJECTS = ['YOUR-PROJECT-ID'] # @param {type:\"raw\"}\n", "\n", " pio.renderers.default = 'colab'\n", "\n", "else:\n", " # Detect Cloud project from environment\n", " GCP_PROJECTS = !gcloud config get-value project\n", "\n", " pio.renderers.default = 'iframe'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load configuration settings from the setup notebook\n", "\n", "Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "PROJECT_ID = GCP_PROJECTS[0]\n", "BUCKET_NAME = f\"{PROJECT_ID}-fraudfinder\"\n", "config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py\n", "print(config.n)\n", "exec(config.n)" ] }, { "cell_type": "markdown", "metadata": { "id": "11d8e5a98a80" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0c4746a0c78c", "tags": [] }, "outputs": [], "source": [ "from typing import Union\n", "\n", "import pandas as pd\n", "import plotly.express as px\n", "from google.cloud import bigquery" ] }, { "cell_type": "markdown", "metadata": { "id": "7ab0ce87d99e" }, "source": [ "#### Initialize BigQuery SDK for Python " ] }, { "cell_type": "markdown", "metadata": { "id": "f7aec2f1e516" }, "source": [ "Use a helper function for sending queries to BigQuery." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "079accfc15e2", "tags": [] }, "outputs": [], "source": [ "# Wrapper to use BigQuery client to run query/job, return job ID or result as DF\n", "def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:\n", " \"\"\"\n", " Run a BigQuery query and return the job ID or result as a DataFrame\n", " Args:\n", " sql: SQL query, as a string, to execute in BigQuery\n", " Returns:\n", " df: DataFrame of results from query, or error, if any\n", " \"\"\"\n", "\n", " bq_client = bigquery.Client(project = PROJECT_ID)\n", "\n", " # Try dry run before executing query to catch any errors\n", " job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)\n", " bq_client.query(sql, job_config=job_config)\n", "\n", " # If dry run succeeds without errors, proceed to run query\n", " job_config = bigquery.QueryJobConfig()\n", " client_result = bq_client.query(sql, job_config=job_config)\n", "\n", " job_id = client_result.job_id\n", "\n", " # Wait for query/job to finish running. then get & return data frame\n", " df = client_result.result().to_arrow().to_pandas()\n", " print(f\"Finished job_id: {job_id}\")\n", " return df" ] }, { "cell_type": "markdown", "metadata": { "id": "dde60686aa82" }, "source": [ "### Exploratory data analysis of transaction data in BigQuery\n", "\n", "In this section, you'll explore some of the FraudFinder data by running queries and creating a couple interactive plots." ] }, { "cell_type": "markdown", "metadata": { "id": "4f3ff222dea9" }, "source": [ "#### Get transaction data summary statistics\n", "First, you generate some summary statistics across the different fields in the transaction data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f0405eb973fb", "tags": [] }, "outputs": [], "source": [ "run_bq_query(\n", " \"\"\"\n", " SELECT\n", " COUNT(*) AS NUM_TX,\n", "\n", " MIN(TX_TS) AS MIN_TX_DATE,\n", " MAX(TX_TS) AS MAX_TX_DATE,\n", "\n", " COUNT(DISTINCT CUSTOMER_ID) AS NUM_CUSTOMERS,\n", " COUNT(DISTINCT TERMINAL_ID) AS NUM_TERMINALS,\n", "\n", " MIN(TX_AMOUNT) AS MIN_TX_AMOUNT,\n", " AVG(TX_AMOUNT) AS AVG_TX_AMOUNT,\n", " MAX(TX_AMOUNT) AS MAX_TX_AMOUNT\n", "\n", " FROM\n", " tx.tx\n", " \"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "3408867266eb" }, "source": [ "The results show that there are over 148M transactions spanning the course of 2022 and 2025, taking place at 5000 different terminals with nearly 50K unique customers. The transaction amounts vary from 0 to over 1100, with the mean amount being about 55 dollars.\n", "\n", "As of 2024, the data has been simulated into 2025, so that it is easier to do demonstrate some of the machine learning capabilities like evaluation in notebooks that you will be using later." ] }, { "cell_type": "markdown", "metadata": { "id": "b7b3266f39a6" }, "source": [ "#### Get fraud classification counts and percentages\n", "Next, you look at the fraud label data to see how many (and what percentage of) transactions are classified as fraud or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "daf0467aaeb5", "tags": [] }, "outputs": [], "source": [ "run_bq_query(\n", " \"\"\"\n", " SELECT\n", " TX_FRAUD,\n", "\n", " COUNT(*) AS NUM_TX,\n", "\n", " /* Calculates number of total transactions (not grouped) */\n", " SUM(COUNT(*)) OVER () AS OVR_TOTAL_TX,\n", "\n", " /* Calculates number of transaction in this group as % of total */\n", " SAFE_DIVIDE(\n", " COUNT(*),\n", " SUM(COUNT(*)) OVER ()\n", " ) AS PCT_TOTAL_TX\n", "\n", " FROM\n", " tx.txlabels\n", "\n", " GROUP BY\n", " TX_FRAUD\n", "\n", " ORDER BY\n", " TX_FRAUD\n", " \"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "a2ee6b7cca4a" }, "source": [ "You can see that about 2% of transactions are fraudulent, while the remaining 98% of transactions are not labeled as fraud." ] }, { "cell_type": "markdown", "metadata": { "id": "6552c3f71faa" }, "source": [ "#### Plot transaction amount distribution\n", "Next, you examine the distribution of transaction amounts more closely. After seeing above that the range is about 0 to 1100 dollars, you want to look at how often various transaction amounts across that range show up. Rather than bringing in all ~148M transactions amounts and plotting them, you can aggregate to the nearest dollar and count the number of observations in each dollar \"bucket\" using BigQuery." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c41e0815b2b3", "tags": [] }, "outputs": [], "source": [ "transaction_amount_dist = run_bq_query(\n", " \"\"\"\n", " SELECT\n", " /* Round transaction to nearest dollar, use to group */\n", " ROUND(TX_AMOUNT, 0) AS ROUNDED_TX_AMOUNT,\n", " COUNT(*) AS NUM_TX\n", "\n", " FROM\n", " tx.tx\n", "\n", " GROUP BY\n", " ROUNDED_TX_AMOUNT\n", "\n", " ORDER BY\n", " ROUNDED_TX_AMOUNT\n", " \"\"\"\n", ")\n", "\n", "transaction_amount_dist" ] }, { "cell_type": "markdown", "metadata": { "id": "81e3664b40a6" }, "source": [ "You can display the results as a histogram using the functionality to build interactive bar charts in [Plotly Express](https://plotly.com/python/bar-charts/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "54039166d146", "tags": [] }, "outputs": [], "source": [ "# Create histogram of transaction amounts\n", "transaction_amount_interactive_histogram = px.bar(\n", " transaction_amount_dist,\n", " x=\"ROUNDED_TX_AMOUNT\",\n", " y=\"NUM_TX\",\n", " labels={\n", " \"ROUNDED_TX_AMOUNT\": \"Transaction Amount (to nearest $)\",\n", " \"NUM_TX\": \"# of Transactions\",\n", " },\n", " title=\"Number of Transactions by Amount\",\n", " color_discrete_sequence=['blue']\n", ")\n", "\n", "transaction_amount_interactive_histogram.update_traces(\n", " marker_color='black',\n", " marker_line_color='white',\n", " marker_line_width=0.1\n", " )\n", "\n", "transaction_amount_interactive_histogram.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "9f4d4fe51a13" }, "source": [ "You can see that the vast majority of transaction amounts are 125 dollars or less, with a steady \"plateau\" of 700K-800K transactions at each dollar amount from about 10 to 80. One advantage of the interactive bar chart is that you can zoom in to different areas (e.g. the bars are hard to see past 100 or so on the default plot, so reducing the y-axis helps) and then hover over different bars to see the actual numbers." ] }, { "cell_type": "markdown", "metadata": { "id": "14113c7dbf4d" }, "source": [ "#### Analyze customer-level aggregates of transaction data\n", "After looking at the data on individual transactions in a couple different ways, you now look at the transaction data aggregated to the customer level. The query below gets three summary values for each customer in the dataset:\n", "* number of transactions\n", "* average transaction value\n", "* percentage of transactions that are fraudulent" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e3ae4a0f60f6", "tags": [] }, "outputs": [], "source": [ "customer_aggregates = run_bq_query(\n", " \"\"\"\n", " SELECT\n", " CUSTOMER_ID,\n", " COUNT(*) AS NUM_TX,\n", " AVG(TX_AMOUNT) AS AVG_TX_AMOUNT,\n", "\n", " /* Get % of transactions that are fraudulent, accounting for\n", " possibility of missing fraud labels */\n", " SAFE_DIVIDE(\n", " SUM(IF(TX_FRAUD IS NOT NULL, TX_FRAUD, 0)),\n", " SUM(IF(TX_FRAUD IS NOT NULL, 1, 0))\n", " ) AS PCT_TX_FRAUD\n", "\n", " FROM\n", " `tx.tx`\n", "\n", " /* Join to labels data to get fraud status of each transaction */\n", " LEFT JOIN\n", " `tx.txlabels` USING (TX_ID)\n", "\n", " GROUP BY\n", " CUSTOMER_ID\n", " \"\"\"\n", ")\n", "\n", "customer_aggregates.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "89211995e425" }, "source": [ "You can now explore this customer data further using [Plotly's scatter plot functionality](https://plotly.com/python/line-and-scatter/) to look at each customer's average transaction value vs number of transactions in an interactive plot. The code below does that, while also coloring points by the percentage of a customer's transactions that are fraudulent." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6460910cc6ec", "tags": [] }, "outputs": [], "source": [ "customer_aggregates_scatter_plot = px.scatter(\n", " customer_aggregates,\n", " x=\"NUM_TX\",\n", " y=\"AVG_TX_AMOUNT\",\n", " color=\"PCT_TX_FRAUD\",\n", " labels={\n", " \"CUSTOMER_ID\": \"Customer ID\",\n", " \"NUM_TX\": \"# of Transactions\",\n", " \"AVG_TX_AMOUNT\": \"Avg Transaction Amount ($)\",\n", " \"PCT_TX_FRAUD\": \"% of Transactions That Are Fraud\",\n", " },\n", " hover_data=[\"CUSTOMER_ID\"],\n", " title=(\n", " \"<b>Customer-Level Transaction Aggregates</b><br>\"\n", " + \"(Each Point is 1 Customer, Colored by % of Fraud Transactions)\"\n", " ),\n", ")\n", "\n", "customer_aggregates_scatter_plot.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "43e521da16bf" }, "source": [ "The plot above shows that the vast majority of customers have about 6000 or fewer transactions with an average value of 100 or less, and most of those appear to have very low fraud rates. But there are customers at the far right with a very large number of transactions (in the 8000-11000 range), some of whom have very high average transaction values (from 100 to more than 300 dollars). And the customers in this group with very high transaction counts also have an extremely high rate of fraud: most upwards of 60%, with some higher than 90% fraud.\n", "\n", "These trends found in the data are definitely worth keeping in mind as you think about how to construct a model to predict fraudulent transactions.\n", "\n", "Note that because this is simulated data, the graph is unlikely to be very realistic (!) so consider this more of an exercise in understanding the workflow you may encounter and the products on Google Cloud that can help you solve for real-time fraud classification along the way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### END\n", "\n", "Now you can go to the next notebook `02_feature_engineering_batch.ipynb`" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "01_exploratory_data_analysis.ipynb", "toc_visible": true }, "environment": { "kernel": "python3", "name": "common-cpu.m116", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m116" }, "kernelspec": { "display_name": "Python 3 (Local)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }