gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "XM4xjzQNzHwz" }, "outputs": [], "source": [ "# Copyright 2020 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "tTt2Oe0szW0M" }, "source": [ "<table align=\"left\">\n", " <td>\n", " <a href=\"https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?name=Churn%20prediction%20for%20game%20developers%20using%20Google%20Analytics%204%20%28GA4%29%20and%20BigQuery%20ML%20Notebook&download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fanalytics-componentized-patterns%2Fmaster%2Fgaming%2Fpropensity-model%2Fbqml%2Fbqml_ga4_gaming_propensity_to_churn.ipynb\">\n", " <img src=\"https://cloud.google.com/images/products/ai/ai-solutions-icon.svg\" alt=\"AI Platform Notebooks\">Run on AI Platform Notebooks</a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/blob/master/gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "DWW_k7u_zaER" }, "source": [ "# Overview\n", "This notebook shows you how you can train, evaluate, and deploy a propensity model in BigQuery ML to predict user retention on a mobile game, based on app measurement data from Google Analytics 4.\n", "\n", "#### Propensity modeling in the mobile gaming industry\n", "According to a [2019 study](https://gameanalytics.com/reports/mobile-gaming-industry-analysis-h1-2019) on 100K mobile games by the Mobile Gaming Industry Analysis, most mobile games only see a 25% retention rate for users after the first 24 hours, and any game \"below 30% retention generally needs improvement\". In light of this, using machine learning -- to identify the propensity that a user churn after day 1 -- can allow app developers to incentivize users at higher risk of churning to return.\n", "\n", "To predict the propensity (a.k.a. likelihood) that a user will return vs churn, you can use classification algorithms, like logistic regression, XGBoost, neural networks, or AutoML Tables, all of which are available with BigQuery ML.\n", "\n", "#### Propensity modeling in BigQuery ML\n", "With BigQuery ML, you can train, evaluate and deploy our models directly within BigQuery using SQL, which saves time from needing to manually configure ML infrastructure. You can train and deploy ML models directly where the data is already stored, which also helps to avoid potential issues around data governance.\n", "\n", "Using classification models that you train and deploy in BigQuery ML, you can predict propensity using the output of the models. The models outputs provide a probability score between 0 and 1.0 -- how likely the model predicts that the user will churn (1) or not churn (0).\n", "\n", "Using the probability (propensity) scores, you can then, for example, target users who may not return on their own, but could potentially return if they are provided with an incentive or notification.\n", "\n", "#### Not just churn -- propensity modeling for any behavior\n", "Propensity modeling is not limited to predicting churn. In fact, you can calculate a propensity score for any behavior you may want to predict. For example, you may want to predict the likelihood a user will spend money on in-app purchases. Or, perhaps you can predict the likelihood of a user performing \"stickier\" behaviors such as adding and playing with friends, which could lead to longer-term retention and organic user growth. Whichever the case, you can easily modify this notebook to suit your needs, as the overall workflow will still be the same." ] }, { "cell_type": "markdown", "metadata": { "id": "t44p6IQUzrY4" }, "source": [ "## Scope of this notebook" ] }, { "cell_type": "markdown", "metadata": { "id": "nznL3qK4z8Jm" }, "source": [ "### Dataset\n", "\n", "This notebook uses [this public BigQuery dataset](https://console.cloud.google.com/bigquery?p=firebase-public-project&d=analytics_153293282&t=events_20181003&page=table), contains raw event data from a real mobile gaming app called Flood It! ([Android app](https://play.google.com/store/apps/details?id=com.labpixies.flood), [iOS app](https://itunes.apple.com/us/app/flood-it!/id476943146?mt=8)). The [data schema](https://support.google.com/analytics/answer/7029846) originates from Google Analytics for Firebase, but is the same schema as [Google Analytics 4](https://support.google.com/analytics/answer/9358801); this notebook applies to use cases that use either Google Analytics for Firebase or Google Analytics 4 data.\n", "\n", "Google Analytics 4 (GA4) uses an [event-based](https://support.google.com/analytics/answer/9322688) measurement model. Events provide insight on what is happening in an app or on a website, such as user actions, system events, or errors. Every row in the dataset is an event, with various characteristics relevant to that event stored in a nested format within the row. While Google Analytics logs many types of events already by default, developers can also customize the types of events they also wish to log.\n", "\n", "Note that as you cannot simply use the raw event data to train a machine learning model, in this notebook, you will also learn the important steps of how to pre-process the raw data into an appropriate format to use as training data for classification models." ] }, { "cell_type": "markdown", "metadata": { "id": "hxrwZlF2PYQ8" }, "source": [ "#### Using your own GA4 data?\n", "If you are already using a Google Analytics 4 property, follow [this guide]((https://support.google.com/analytics/answer/9823238) to learn how to export your GA4 data to BigQuery. Once the GA4 data is in BigQuery, there will be two tables:\n", "\n", "* `events_`\n", "* `events_intraday_`\n", "\n", "For this notebook, you can replace the table in the `FROM` clause in SQL queries with your `events_` table that is updated daily. The `events_intraday_` table contains streaming data for the current day.\n", "\n", "Note that if you use your own GA4 data, you may need to slightly modify some of the scripts in this notebook to predict a different output behavior or the types events in the training data that are specific to your use case. " ] }, { "cell_type": "markdown", "metadata": { "id": "KJB3rOtfrBIl" }, "source": [ "#### Using data from other non-Google Analytics data collection tools?" ] }, { "cell_type": "markdown", "metadata": { "id": "9corrB3IrPeH" }, "source": [ "While this notebook provides code based on a Google Analytics dataset, you can also use your own dataset from other non-Google Analytics data collection tools. The overall concepts and process of propensity modeling will be the same, but you may need to customize the code in order to prepare your dataset into the training data format described in this notebook.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sH0CZGku0BPp" }, "source": [ "### Objective and Problem Statement\n", "\n", "The goal of this notebook is to provide an end-to-end solution for propensity modeling to predict user churn on GA4 data using BigQuery ML. Using the \"Flood It!\" dataset, based on a user's activity within the first 24 hrs of app installation, you will try various classification models to predict the propensity to churn (1) or not churn (0).\n", "\n", "By the end of this notebook, you will know how to:\n", "* Explore the export of Google Analytics 4 data on BigQuery\n", "* Prepare the training data using demographic, behavioral data, and the label (churn/not-churn)\n", "* Train classification models using BigQuery ML\n", "* Evaluate classification models using BigQuery ML\n", "* Make predictions on which users will churn using BigQuery ML\n", "* Activate on model predictions" ] }, { "cell_type": "markdown", "metadata": { "id": "IStlwydL0NWC" }, "source": [ "### Costs \n", "\n", "There is no cost associated with using the free version of Google Analytics and using the BigQuery Export feature. This tutorial uses billable components of Google Cloud Platform (GCP):\n", "\n", "* BigQuery\n", "* BigQuery ML\n", "\n", "Learn about [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [BigQuery ML\n", "pricing](https://cloud.google.com/bigquery-ml/pricing) and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "coMAAaOH0Tcl" }, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "NUlqCGWz0VGL" }, "source": [ "### PIP Install Packages and dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MujO3LNG0W26" }, "outputs": [], "source": [ "!pip install google-cloud-bigquery" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oRab23dd0aeC" }, "outputs": [], "source": [ "# Automatically restart kernel after installs\n", "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True) " ] }, { "cell_type": "markdown", "metadata": { "id": "8PpWu5MR0bPf" }, "source": [ "### Set up your GCP project\n", "\n", "_The following steps are required, regardless of your notebook environment._\n", "\n", "1. [Select or create a GCP project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", "\n", "1. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", "1. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", "1. Enter your project ID and region in the cell below. Then run the cell to make sure the\n", "Cloud SDK uses the right project for all the commands in this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROJECT_ID = \"YOUR-PROJECT-ID\" #replace with your project id\n", "REGION = 'US'" ] }, { "cell_type": "markdown", "metadata": { "id": "mm5cGuC00mHI" }, "source": [ "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DRYImlXa0mld" }, "outputs": [], "source": [ "from google.cloud import bigquery\n", "import pandas as pd\n", "\n", "pd.set_option('display.float_format', lambda x: '%.3f' % x)" ] }, { "cell_type": "markdown", "metadata": { "id": "BlfwBj4P0rHD" }, "source": [ "### Create a BigQuery dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "X9TLkb8f0tCE" }, "source": [ "In this notebook, you will need to create a dataset in your project called `bqmlga4`. To create it, run the following cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d8LyhoGy0ttK" }, "outputs": [], "source": [ "DATASET_NAME = \"bqmlga4\"\n", "!bq mk --location=$REGION --dataset $PROJECT_ID:$DATASET_NAME" ] }, { "cell_type": "markdown", "metadata": { "id": "_sOVqTXM0-Wc" }, "source": [ "## The dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "4LMr6Q3MQt0I" }, "source": [ "### Using the sample gaming event data from Flood it!\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wV7wi8X51eNw" }, "source": [ "The sample dataset contains raw event data, as shown in the next cell:\n", "\n", "_Note_: Jupyter runs cells starting with %%bigquery as SQL queries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 551 }, "id": "wwz_N3Kh1f9l", "outputId": "8b162c57-12a6-4262-e482-6e4d0ae3b47c" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT \n", " *\n", "FROM\n", " `firebase-public-project.analytics_153293282.events_*`\n", " \n", "TABLESAMPLE SYSTEM (1 PERCENT)" ] }, { "cell_type": "markdown", "metadata": { "id": "PYBMN963Qydl" }, "source": [ "It may be helpful to take a look at the overall schema used in Google Analytics 4. As mentioned earlier, Google Analytics 4 uses an event based measurement model and each row in this dataset is an event. [Click here](https://support.google.com/analytics/answer/7029846) to view the complete schema and details about each column. As you can see above, certain columns are nested records and contain detailed information:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "IaM8co6eRsOp" }, "source": [ "\n", "* `app_info`\n", "* `device`\n", "* `ecommerce`\n", "* `event_params`\n", "* `geo`\n", "* `traffic_source`\n", "* `user_properties`\n", "* `items`*\n", "* `web_info`*\n", "\n", "_* present by default in GA4 datasets_" ] }, { "cell_type": "markdown", "metadata": { "id": "dLUv-7xNRhAj" }, "source": [ "As we can see below, there are 15K users and 5.7M events in this dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 77 }, "id": "MjqKMGVDRPyZ", "outputId": "6c0d76c7-ad92-40de-a689-365867b23281" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT \n", " COUNT(DISTINCT user_pseudo_id) as count_distinct_users,\n", " COUNT(event_timestamp) as count_events\n", "FROM\n", " `firebase-public-project.analytics_153293282.events_*`" ] }, { "cell_type": "markdown", "metadata": { "id": "3iHaV9-q1k1i" }, "source": [ "### Preparing the training data" ] }, { "cell_type": "markdown", "metadata": { "id": "P358Jo_s8WC0" }, "source": [ "You cannot simply use raw event data to train a machine learning model as it would not be in the right shape and format to use as training data. So in this section, you will learn how to pre-process the raw data into an appropriate format to use as training data for classification models.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eXl1kSh1yPXk" }, "source": [ "To predict which user is going to _churn_ or _return_, the ideal training data format for classification should look like the following: \n" ] }, { "cell_type": "markdown", "metadata": { "id": "Xv8ibjMNy_bV" }, "source": [ "|User ID|User demographic data|User behavioral data|Churned|\n", "|-|-|-|-|\n", "|User1|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1\n", "|User2|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|0\n", "|User3|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1\n" ] }, { "cell_type": "markdown", "metadata": { "id": "HydYCB2jzrzn" }, "source": [ "Characteristics of the training data:\n", "- each row is a separate unique user ID\n", "- feature(s) for **demographic data**\n", "- feature(s) for **behavioral data**\n", "- the actual **label** that you want to train the model to predict (e.g., 1 = churned, 0 = returned)\n", "\n", "You can train a model with only demographic data or behavioral data, but having a combination of both will likely help you create a more predictive model. For this reason, in this section, you will learn how to pre-process the raw data to follow this training data format." ] }, { "cell_type": "markdown", "metadata": { "id": "ICpTsrfg2-Cw" }, "source": [ "The following sections will walk you through preparing the demographic data, behavioral data, and the label before joining them all together as the training data.\n", "\n", "1. Identifying the label for each user (churned or returned)\n", "1. Extracting demographic data for each user\n", "1. Extracting behavioral data for each user\n", "1. Combining the label, demographic and behavioral data together as training data" ] }, { "cell_type": "markdown", "metadata": { "id": "ZYHefnNx21lO" }, "source": [ "#### Step 1: Identifying the label for each user" ] }, { "cell_type": "markdown", "metadata": { "id": "Qt6a5Kv-25iq" }, "source": [ "The raw dataset doesn't have a feature that simply identifies users as \"churned\" or \"returned\", so in this section, you will need to create this label based on some of the existing columns." ] }, { "cell_type": "markdown", "metadata": { "id": "lgqxD30Hl6FM" }, "source": [ "There are many ways to define user churn, but for the purposes of this notebook, you will predict 1-day churn as users who do not come back and use the app again after 24 hr of the user's first engagement. \n", "\n", "In other words, after 24 hr of a user's first engagement with the app:\n", "- if the user _shows no event data thereafter_, the user is considered **churned**. \n", "- if the user _does have at least one event datapoint thereafter_, then the user is considered **returned**\n", "\n", "You may also want to remove users who were unlikely to have ever returned anyway after spending just a few minutes with the app, which is sometimes referred to as \"bouncing\". For example, we can say want to build our model only on users who spent at least 10 minutes with the app (users who didn't bounce).\n", "\n", "So your updated definition of a **churned user** for this notebook is:\n", "> \"any user who spent at least 10 minutes on the app, but after 24 hour from when they first engaged with the app, never used the app again\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_YyxwMQQ1uQW" }, "source": [ "In SQL, since the raw data contains all of the events for every user, from their first touch (app installation) to their last touch, you can use this information to create two columns: `churned` and `bounced`.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d_JCCtuZVzne" }, "source": [ "Take a look at the following SQL query and the results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 531 }, "id": "_QQO3POV2EQ4", "outputId": "8369d0bd-8527-42bd-d1c6-aa8e4c380cb0" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID \n", "\n", "CREATE OR REPLACE VIEW bqmlga4.returningusers AS (\n", " WITH firstlasttouch AS (\n", " SELECT\n", " user_pseudo_id,\n", " MIN(event_timestamp) AS user_first_engagement,\n", " MAX(event_timestamp) AS user_last_engagement\n", " FROM\n", " `firebase-public-project.analytics_153293282.events_*`\n", " WHERE event_name=\"user_engagement\"\n", " GROUP BY\n", " user_pseudo_id\n", "\n", " )\n", " SELECT\n", " user_pseudo_id,\n", " user_first_engagement,\n", " user_last_engagement,\n", " EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,\n", " EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,\n", " EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,\n", "\n", " #add 24 hr to user's first touch\n", " (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,\n", "\n", "#churned = 1 if last_touch within 24 hr of app installation, else 0\n", "IF (user_last_engagement < (user_first_engagement + 86400000000),\n", " 1,\n", " 0 ) AS churned,\n", "\n", "#bounced = 1 if last_touch within 10 min, else 0\n", "IF (user_last_engagement <= (user_first_engagement + 600000000),\n", " 1,\n", " 0 ) AS bounced,\n", " FROM\n", " firstlasttouch\n", " GROUP BY\n", " 1,2,3\n", " );\n", "\n", "SELECT \n", " * \n", "FROM \n", " bqmlga4.returningusers \n", "LIMIT 100;" ] }, { "cell_type": "markdown", "metadata": { "id": "FOoqPb2J2Q5f" }, "source": [ "For the `churned` column, `churned=0` if the user performs an action after 24 hours since their first touch, otherwise if their last action was only within the first 24 hours, then `churned=1`.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sC3sIc0C2a4Z" }, "source": [ "For the `bounced` column, `bounced=1` if the user's last action was within the first ten minutes since their first touch with the app, otherwise `bounced=0`. We can use this column to filter our training data later on, by conditionally querying for users where `bounced = 0`." ] }, { "cell_type": "markdown", "metadata": { "id": "ulbfb8SY2fSM" }, "source": [ "You might wonder how many of these 15k users bounced and returned? You can run the following query to check:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "id": "gC32zyIE2olw", "outputId": "6c51e523-0a4f-4a8d-a4c4-e0f7f0131925" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " bounced,\n", " churned, \n", " COUNT(churned) as count_users\n", "FROM\n", " bqmlga4.returningusers\n", "GROUP BY 1,2\n", "ORDER BY bounced" ] }, { "cell_type": "markdown", "metadata": { "id": "Z29RsKJi2uwO" }, "source": [ "For the training data, you will only end up using data where `bounced = 0`. Based on the 15k users, you can see that 5,557 (\\~41%) users bounced within the first ten minutes of their first engagement with the app, but of the remaining 8,031 users, 1,883 users (\\~23%) churned after 24 hours." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 77 }, "id": "ZStzVtgEIkzh", "outputId": "15b2abb5-f966-4363-f7f6-b676f6b521b8" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " COUNTIF(churned=1)/COUNT(churned) as churn_rate\n", "FROM\n", " bqmlga4.returningusers\n", "WHERE bounced = 0" ] }, { "cell_type": "markdown", "metadata": { "id": "daSQViux_XWR" }, "source": [ "#### Step 2. Extracting demographic data for each user" ] }, { "cell_type": "markdown", "metadata": { "id": "qcC2JJ6W_-sb" }, "source": [ "This section is focused on extracting the demographic information for each user. Different demographic information about the user is available in the dataset already, including `app_info`, `device`, `ecommerce`, `event_params`, `geo`. Demographic features can help the model predict whether users on certain devices or countries are more likely to churn.\n", "\n", "For this notebook, you can start just with `geo.country`, `device.operating_system`, and `device.language`. If you are using your own dataset and have joinable first-party data, this section is a good opportunity to add any additional attributes for each user that may not be readily available in Google Analytics 4.\n", "\n", "Note that a user's demographics may occasionally change (e.g. moving from one country to another). For simplicity, you will just use the demographic information that Google Analytics 4 provides when the user first engaged with the app as indicated by `MIN(event_timestamp)`. This enables every unique user to be represented by a single row." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 343 }, "id": "gc47WFyM_5nQ", "outputId": "5c545aef-eaa8-451e-8443-38461d2c9923" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "CREATE OR REPLACE VIEW bqmlga4.user_demographics AS (\n", "\n", " WITH first_values AS (\n", " SELECT\n", " user_pseudo_id,\n", " geo.country as country,\n", " device.operating_system as operating_system,\n", " device.language as language,\n", " ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num\n", " FROM `firebase-public-project.analytics_153293282.events_*`\n", " WHERE event_name=\"user_engagement\"\n", " )\n", " SELECT * EXCEPT (row_num)\n", " FROM first_values\n", " WHERE row_num = 1\n", " );\n", "\n", "SELECT\n", " *\n", "FROM\n", " bqmlga4.user_demographics\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": { "id": "Nxv9yaTt2zD1" }, "source": [ "#### Step 3. Extracting behavioral data for each user" ] }, { "cell_type": "markdown", "metadata": { "id": "c_9qPkVAfmpY" }, "source": [ "Behavioral data in the raw event data spans across multiple events -- and thus rows -- per user. The goal of this section is to aggregate and extract behavioral data for each user, resulting in one row of behavioral data per unique user.\n", "\n", "But what kind of behavioral data will you need to prepare? Since the end goal of this notebook is to predict, based on a user's activity within the first 24 hrs since app installation, whether that user will churn or return thereafter, then you will want to use behavioral data from the first 24 hrs in your training data. Later on, we can also extract some extra time-related features from `user_first_engagement`, such as the month or day of the first engagement.\n", "\n", "Google Analytics automatically collects [certain events](https://support.google.com/analytics/answer/6317485) that you can use to analyze behavior. In addition, there are certain recommended [events for games](https://support.google.com/analytics/answer/6317494). \n", "\n", "\n", "As a first step, you can explore all the unique events that exist in this dataset, based on `event_name`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "XsXBNmeAf3fI", "outputId": "da0f4a32-83ba-42e5-c381-49178c44f5f1" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " event_name,\n", " COUNT(event_name) as event_count\n", "FROM\n", " `firebase-public-project.analytics_153293282.events_*`\n", "GROUP BY 1\n", "ORDER BY\n", " event_count DESC" ] }, { "cell_type": "markdown", "metadata": { "id": "smsCdPFpmJpT" }, "source": [ "For this notebook, to predict whether a user will churn or return, you can start by counting the number of times a user engages in the following event types:\n", "\n", "* `user_engagement`\n", "* `level_start_quickplay`\n", "* `level_end_quickplay`\n", "* `level_complete_quickplay`\n", "* `level_reset_quickplay`\n", "* `post_score`\n", "* `spend_virtual_currency`\n", "* `ad_reward`\n", "* `challenge_a_friend`\n", "* `completed_5_levels`\n", "* `use_extra_steps`\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "D9vIqhuj_qFW" }, "source": [ "In SQL, you can aggregate the behavioral data by calculating the total number of times when each of the above `event_names` occurred in the data set per user.\n", "\n", "If you are using your own dataset, you may have different event types that you can aggregate and extract. Your app may be sending very different `event_names` to Google Analytics so be sure to use events most suitable to your scenario." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 363 }, "id": "nzbVtI6G_Y9p", "outputId": "8ae82387-01ff-4bbe-b629-659e4754ac81" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "CREATE OR REPLACE VIEW bqmlga4.user_aggregate_behavior AS (\n", "WITH\n", " events_first24hr AS (\n", " #select user data only from first 24 hr of using the app\n", " SELECT\n", " e.*\n", " FROM\n", " `firebase-public-project.analytics_153293282.events_*` e\n", " JOIN\n", " bqmlga4.returningusers r\n", " ON\n", " e.user_pseudo_id = r.user_pseudo_id\n", " WHERE\n", " e.event_timestamp <= r.ts_24hr_after_first_engagement\n", " )\n", "SELECT\n", " user_pseudo_id,\n", " SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,\n", " SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,\n", " SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,\n", " SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,\n", " SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,\n", " SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,\n", " SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,\n", " SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,\n", " SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,\n", " SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,\n", " SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,\n", "FROM\n", " events_first24hr\n", "GROUP BY\n", " 1\n", " );\n", "\n", "SELECT\n", " *\n", "FROM\n", " bqmlga4.user_aggregate_behavior\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": { "id": "EO1zcaiK_I4S" }, "source": [ "Note that in addition to frequency of performing an action, you can also include other behavioral features in this step such as the total amount of in-game currency they spent, or if they reached certain app-specifc milestones that may be more relevant to your app (e.g., gained a certain threshold amount of XP or leveled up at least 5 times). This is an opportunity for you to extend this notebook to suit your needs." ] }, { "cell_type": "markdown", "metadata": { "id": "PrWx_WQQBitA" }, "source": [ "#### Step 4: Combining the label, demographic and behavioral data together as training data" ] }, { "cell_type": "markdown", "metadata": { "id": "o54qf5U6ik2l" }, "source": [ "In this section, you can now combine these three intermediary views (label, demographic, and behavioral data) into the final training data. Here you can also specify `bounced = 0`, in order to limit the training data only to users who did not \"bounce\" within the first 10 minutes of using the app." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 531 }, "id": "2i4WeTqLB1mC", "outputId": "c6669996-83c4-4fb5-fcde-c3135ffd7705" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "CREATE OR REPLACE VIEW bqmlga4.train AS (\n", " \n", " SELECT\n", " dem.*,\n", " IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,\n", " IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,\n", " IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,\n", " IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,\n", " IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,\n", " IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,\n", " IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,\n", " IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,\n", " IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,\n", " IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,\n", " IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,\n", " ret.user_first_engagement,\n", " ret.month,\n", " ret.julianday,\n", " ret.dayofweek,\n", " ret.churned\n", " FROM\n", " bqmlga4.returningusers ret\n", " LEFT OUTER JOIN\n", " bqmlga4.user_demographics dem\n", " ON \n", " ret.user_pseudo_id = dem.user_pseudo_id\n", " LEFT OUTER JOIN \n", " bqmlga4.user_aggregate_behavior beh\n", " ON\n", " ret.user_pseudo_id = beh.user_pseudo_id\n", " WHERE ret.bounced = 0\n", " );\n", "\n", "SELECT\n", " *\n", "FROM\n", " bqmlga4.train\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": { "id": "Co90TkTsCk9p" }, "source": [ "## Training the propensity model with BigQuery ML" ] }, { "cell_type": "markdown", "metadata": { "id": "stwX9Np9CyWG" }, "source": [ "In this section, using the training data you prepared, you will now train machine learning models in SQL using BigQuery ML. The remainder of the notebook will only use logistic regression, but you can also follow the optional code below to train other model types." ] }, { "cell_type": "markdown", "metadata": { "id": "FQTRryQ9Fr_l" }, "source": [ "**Choosing the model:**\n", "As this is a binary classification task, for simplicity, you can start with [logistic regression](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create), but you can also train other classification models like [XGBoost](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree), [deep neural networks](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models) and [AutoML Tables](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl) in BigQuery ML to calculate propensity scores. Each of these models will output a probability score (propensity) between 0 and 1.0 of how likely the model prediction is based on the training data. In this notebook, the model predicts whether the user will churn (1) or return (0) after 24 hours of the user's first engagement with the app.\n", "\n", "\n", "|Model| model_type| Advantages | Disadvantages|\n", "|-|-|-|-|\n", "|**Logistic Regression**| `LOGISTIC_REG` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create))| Fast to train vs. other model types | May not have the highest model performance |\n", "|**XGBoost**| `BOOSTED_TREE_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree))| Higher model performance. Can inspect feature importance. | Slower to train vs. `LOGISTIC_REG`.|\n", "|**Deep Neural Networks**| `DNN_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models))| Higher model performance | Slower to train vs. `LOGISTIC_REG`.|\n", "|**AutoML Tables**| `AUTOML_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl))| Very high model performance | May take at least a few hours to train, not easy to explain how the model works. |\n" ] }, { "cell_type": "markdown", "metadata": { "id": "W326AA24FyPF" }, "source": [ "**There's no need to split your data into train/test:**\n", "- When you run the `CREATE MODEL` statement, BigQuery ML will automatically split your data into training and test, so you can evaluate your model immediately after training (see the [documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#data_split_method) for more information or how to specify the split manually).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "zfNMGKb_GAPM" }, "source": [ "**Hyperparameter tuning:**\n", "Note that you can also tune hyperparameters for each model, although it is beyond the scope of this notebook. See the [BigQuery ML documentation for CREATE MODEL](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create) for further details on the available hyperparameters." ] }, { "cell_type": "markdown", "metadata": { "id": "X5Ei0HB_HTAT" }, "source": [ "**`TRANSFORM()`:** \n", "It may also be useful to extract features from datetimes/timestamps as one simple example of additional feature preprocessing before training. For example, we can extract the month, day of year, and day of week from `user_first_engagement`. [`TRANSFORM()`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#transform) allows the model to remember the extracted values so you won't need to extract them again when making predictions using the model later on." ] }, { "cell_type": "markdown", "metadata": { "id": "cu-xHARS90xN" }, "source": [ "#### Train a logistic regression model" ] }, { "cell_type": "markdown", "metadata": { "id": "5A_PGKG5Ob8u" }, "source": [ "The following code trains a logistic regression model. This should only take a minute or two to train.\n", "\n", "For more information on the default hyperparameters used, you can read the documentation: \n", "[CREATE MODEL statement](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 31 }, "id": "1WRGHLIIC-RL", "outputId": "32339cf6-5548-4239-e211-002d1c5743a7" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "CREATE OR REPLACE MODEL bqmlga4.churn_logreg\n", "\n", "OPTIONS(\n", " MODEL_TYPE=\"LOGISTIC_REG\",\n", " INPUT_LABEL_COLS=[\"churned\"]\n", ") AS\n", "\n", "SELECT\n", " *\n", "FROM\n", " bqmlga4.train" ] }, { "cell_type": "markdown", "metadata": { "id": "ZBX-46-Q94tI" }, "source": [ "#### Train an XGBoost model (optional)" ] }, { "cell_type": "markdown", "metadata": { "id": "uqewIJBoOnAZ" }, "source": [ "The following code trains an XGBoost model. This may take several minutes to train.\n", "\n", "For more information on the default hyperparameters used, you can read the documentation: \n", "[CREATE MODEL statement for Boosted Tree models using XGBoost](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3XNDx5YoEFmY" }, "outputs": [], "source": [ "# %%bigquery --project $PROJECT_ID\n", "\n", "# CREATE OR REPLACE MODEL bqmlga4.churn_xgb\n", "\n", "# OPTIONS(\n", "# MODEL_TYPE=\"BOOSTED_TREE_CLASSIFIER\",\n", "# INPUT_LABEL_COLS=[\"churned\"]\n", "# ) AS\n", "\n", "# SELECT\n", "# * EXCEPT(user_pseudo_id)\n", "# FROM\n", "# bqmlga4.train" ] }, { "cell_type": "markdown", "metadata": { "id": "0tSEDUcl98eE" }, "source": [ "#### Train a deep neural network (DNN) model (optional)" ] }, { "cell_type": "markdown", "metadata": { "id": "FPpfxT6ZOslN" }, "source": [ "The following code trains a deep neural network. This may take several minutes to train.\n", "\n", "For more information on the default hyperparameters used, you can read the documentation: \n", "[CREATE MODEL statement for Deep Neural Network (DNN) models](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1o99jDldEGbT" }, "outputs": [], "source": [ "# %%bigquery --project $PROJECT_ID\n", "\n", "# CREATE OR REPLACE MODEL bqmlga4.churn_dnn\n", "\n", "# OPTIONS(\n", "# MODEL_TYPE=\"DNN_CLASSIFIER\",\n", "# INPUT_LABEL_COLS=[\"churned\"]\n", "# ) AS\n", "\n", "# SELECT\n", "# * EXCEPT(user_pseudo_id)\n", "# FROM\n", "# bqmlga4.train" ] }, { "cell_type": "markdown", "metadata": { "id": "X-DJU4-IkTsP" }, "source": [ "### Train an AutoML Tables model (optional)\n", "\n", "[AutoML Tables](https://cloud.google.com/automl-tables) enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. AutoML Tables automatically searches through Google’s model zoo for structured data to find the best model for your needs, ranging from linear/logistic regression models for simpler datasets to advanced deep, ensemble, and architecture-search methods for larger, more complex ones.\n", "\n", "You can train an [AutoML model directly with BigQuery ML](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl), as in the code below.\n", "\n", "Note that the `BUDGET_HOURS` parameter is for AutoML Tables training, specified in hours. The default value is 1.0 hour and must be between 1.0 and 72.0. The total query processing time can be greater than the budgeted hours specified in the query.\n", "\n", "**Note:** This may take a few hours to train.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-tcfIVLZEHXU" }, "outputs": [], "source": [ "# %%bigquery --project $PROJECT_ID\n", "\n", "# CREATE OR REPLACE MODEL bqmlga4.churn_automl\n", "\n", "# OPTIONS(\n", "# MODEL_TYPE=\"AUTOML_CLASSIFIER\",\n", "# INPUT_LABEL_COLS=[\"churned\"],\n", "# BUDGET_HOURS=1.0\n", "# ) AS\n", "\n", "# SELECT\n", "# * EXCEPT(user_pseudo_id)\n", "# FROM\n", "# bqmlga4.train" ] }, { "cell_type": "markdown", "metadata": { "id": "s_t0zxoeE1d7" }, "source": [ "## Model Evaluation" ] }, { "cell_type": "markdown", "metadata": { "id": "b69YK1iX_ksq" }, "source": [ "To evaluate the model, you can run [`ML.EVALUATE`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate) on a model that has finished training to inspect some of the metrics.\n", "\n", "The metrics are based on the test sample data that was automatically split during model creation ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#data_split_method))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 77 }, "id": "p16-00xjE3JQ", "outputId": "cd9db9bf-2211-4152-e84e-5c299354663d" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " *\n", "FROM\n", " ML.EVALUATE(MODEL bqmlga4.churn_logreg)" ] }, { "cell_type": "markdown", "metadata": { "id": "ejdRfRYeATL-" }, "source": [ "`ML.EVALUATE` generates the `precision`, `recall`, `accuracy` and `f1_score` using the default classification threshold of 0.5, which can be modified by using the optional [`THRESHOLD`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate#eval_threshold) parameter.\n", "\n", "Generally speaking, you can use the `log_loss` and `roc_auc` metrics to compare model performance.\n", "\n", "The `log_loss` ranges between 0 and 1.0, and the closer the `log_loss` is the zero, the closer the predicted labels were to the actual labels.\n", "The `roc_auc` ranges between 0 and 1.0, and the closer the `roc_auc` is to 1.0, the better the model is at distinguishing between the classes.\n", "\n", "For more information on these metrics, you can read through the definitions on [precision and recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall), [accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy), [f1-score](https://en.wikipedia.org/wiki/F-score), [log_loss](https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss) and [roc_auc](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)." ] }, { "cell_type": "markdown", "metadata": { "id": "glkKlvoqP7Uf" }, "source": [ "#### Confusion matrix: predicted vs actual values" ] }, { "cell_type": "markdown", "metadata": { "id": "AYB1xZ9oQDOv" }, "source": [ "In addition to model evaluation metrics, you may also want to use a confusion matrix to inspect how well the model predicted the labels, compared to the actual labels.\n", "\n", "With the rows indicating the actual labels, and the columns as the predicted labels, the resulting format for ML.CONFUSION_MATRIX for binary classification looks like:\n", "\n", "| | Predicted_0 | Predicted_1|\n", "|-|-|-|\n", "|Actual_0| True Negatives | False Positives|\n", "|Actual_1| False Negatives | True Positives|\n", "\n", "For more information on confusion matrices, you can read through a detailed explanation [here](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 106 }, "id": "Db5M8U8QQgyi", "outputId": "2ea7e3c1-e411-43aa-e32c-6a29a2175c14" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " expected_label,\n", " _0 AS predicted_0,\n", " _1 AS predicted_1\n", "FROM\n", " ML.CONFUSION_MATRIX(MODEL bqmlga4.churn_logreg)" ] }, { "cell_type": "markdown", "metadata": { "id": "Ix5q8Onw1aYs" }, "source": [ "#### ROC Curve" ] }, { "cell_type": "markdown", "metadata": { "id": "XleHv73dIZx-" }, "source": [ "You can plot the AUC-ROC curve by using `ML.ROC_CURVE` to return the metrics for different threshold values for the model ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-roc))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p_iG9vOfbgX6" }, "outputs": [], "source": [ "%%bigquery df_roc --project $PROJECT_ID\n", "SELECT * FROM ML.ROC_CURVE(MODEL bqmlga4.churn_logreg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 422 }, "id": "GoHGUcC7bzvX", "outputId": "c809aec7-5153-4a8b-bae8-14103c996c61" }, "outputs": [], "source": [ "df_roc" ] }, { "cell_type": "markdown", "metadata": { "id": "s_zu9zZXJVNG" }, "source": [ "Plot the AUC-ROC curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 313 }, "id": "jSRTXr6ub2ty", "outputId": "46f6cf64-9108-4c8d-d148-ecbd2036ed2d" }, "outputs": [], "source": [ "df_roc.plot(x=\"false_positive_rate\", y=\"recall\", title=\"AUC-ROC curve\")" ] }, { "cell_type": "markdown", "metadata": { "id": "oyeAzheVFUxU" }, "source": [ "## Model prediction" ] }, { "cell_type": "markdown", "metadata": { "id": "02ZQPKIebo7y" }, "source": [ "You can run [`ML.PREDICT`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-predict) to make predictions on the propensity to churn. The following code returns all the information from `ML.PREDICT`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 590 }, "id": "229TPkhUFe23", "outputId": "81495cbc-754a-4845-8e74-cd9ef0a49ede" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " *\n", "FROM\n", " ML.PREDICT(MODEL bqmlga4.churn_logreg,\n", " (SELECT * FROM bqmlga4.train)) #can be replaced with a test dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "4WinvkfYdE1q" }, "source": [ "For propensity modeling, the most important output is the probability of a behavior occuring. The following query returns the probability that the user will return after 24 hrs. The higher the probability and closer it is to 1, the more likely the user is predicted to churn, and the closer it is to 0, the more likely the user is predicted to return." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 422 }, "id": "92eOP8Sw7zO_", "outputId": "a98b4dd5-39d4-43ff-ed39-7d1cdbb9a33f" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "SELECT\n", " user_pseudo_id,\n", " churned,\n", " predicted_churned,\n", " predicted_churned_probs[OFFSET(0)].prob as probability_churned\n", " \n", "FROM\n", " ML.PREDICT(MODEL bqmlga4.churn_logreg,\n", " (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "TCGAfP9DKLDH" }, "source": [ "### Exporting the predictions out of Bigquery" ] }, { "cell_type": "markdown", "metadata": { "id": "EMETxq7CKtTQ" }, "source": [ "##### Reading the predictions directly from BigQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "6_BUvnJvKQQP" }, "source": [ "With the predictions from `ML.PREDICT`, you can export the data into a Pandas dataframe using the BigQuery Storage API (see [documentation and code samples](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas#download_table_data_using_the_client_library)). You can also use other [BigQuery client libraries](https://cloud.google.com/bigquery/docs/reference/libraries).\n", "\n", "Alternatively you can also export directly into pandas in a notebook using the %%bigquery <variable name> as in:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0qB2DsQkKgEH" }, "outputs": [], "source": [ "%%bigquery df --project $PROJECT_ID\n", "\n", "SELECT\n", " user_pseudo_id,\n", " churned,\n", " predicted_churned,\n", " predicted_churned_probs[OFFSET(0)].prob as probability_churned\n", " \n", "FROM\n", " ML.PREDICT(MODEL bqmlga4.churn_logreg,\n", " (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-yJAFuUOKl1i" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "wzEsO_oPK2vv" }, "source": [ "##### Export predictions table to Google Cloud Storage" ] }, { "cell_type": "markdown", "metadata": { "id": "xxgKXizcK5Cy" }, "source": [ "There are several ways to export the predictions table to Google Cloud Storage (GCS), so that you can use them in a separate service. Perhaps the easiest way is to export directly to GCS using SQL ([documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5iah7a0KK9lv" }, "outputs": [], "source": [ "%%bigquery --project $PROJECT_ID\n", "\n", "EXPORT DATA OPTIONS (\n", "uri=\"gs://mybucket/myfile/churnpredictions.csv\", \n", " format=CSV\n", ") AS \n", "SELECT\n", " user_pseudo_id,\n", " churned,\n", " predicted_churned,\n", " predicted_churned_probs[OFFSET(0)].prob as probability_churned\n", "FROM\n", " ML.PREDICT(MODEL bqmlga4.churn_logreg,\n", " (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "kDdj9jDZFl5z" }, "source": [ "## Activate on model predictions" ] }, { "cell_type": "markdown", "metadata": { "id": "WfWm7WFdmMN7" }, "source": [ "Once you have the model predictions, there are different steps you can take based on your business objective.\n", "\n", "In our analysis, we used `user_pseudo_id` as the user identifier. However, ideally, your app should send back the `user_id` from your app to Google Analytics. This will help you to:\n", "\n", "* join any first-party data you have for model predictions\n", "* joins the model predictions with your first-party data\n", "\n", "Once you have this join capability, you can:\n", "\n", "* Export the model predictions back into Google Analytics as user attribute. This can be done using [Data Import feature](https://support.google.com/analytics/answer/10071301) in Google Analytics 4.\n", " * Based on the prediction values you can [Create and edit audiences](https://support.google.com/analytics/answer/2611404) and also do [Audience targeting](https://support.google.com/optimize/answer/6283435). For example, an audience can be users with prediction probability between 0.4 and 0.7, to represent users who are predicted to be \"on the fence\" between churning and returning.\n", "* Adjust the user experience for targeted users within your app. For Firebase Apps, you can use the [Import segmentments](https://firebase.google.com/docs/projects/import-segments) feature. You can tailor user experience by targeting your identified users through Firebase services such as Remote Config, Cloud Messaging, and In-App Messaging. This will involve importing the segment data from BigQuery into Firebase. After that you can send notifications to the users, configure the app for them, or follow the user journeys across devices.\n", "* Run targeted marketing campaigns via CRMs like Salesforce, e.g. send out reminder emails.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ODjAEK2cmf9S" }, "source": [ "## Further resources: \n" ] }, { "cell_type": "markdown", "metadata": { "id": "u-K-dCfJKSpi" }, "source": [ "As you collect more data from your users, you may want to regularly evaluate your model on fresh data and re-train the model if you notice that the model quality is decaying.\n", "\n", "Continuous evaluation—the process of ensuring a production machine learning model is still performing well on new data—is an essential part in any ML workflow. Performing continuous evaluation can help you catch model drift, a phenomenon that occurs when the data used to train your model no longer reflects the current environment. \n", "\n", "To learn more about how to do continous model evaluation and re-train models, you can read the blogpost: [Continuous model evaluation with BigQuery ML, Stored Procedures, and Cloud Scheduler](https://cloud.google.com/blog/topics/developers-practitioners/continuous-model-evaluation-bigquery-ml-stored-procedures-and-cloud-scheduler)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "Pattern - Propensity modeling (churn) BigQuery ML using GA4 data.ipynb", "provenance": [] }, "environment": { "name": "tf2-gpu.2-3.m65", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m65" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }

gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb (1,640 lines of code) (raw):