notebooks/official/workbench/demand_forecasting/forecasting-retail-demand.ipynb (2,256 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6a81007f6093"
},
"outputs": [],
"source": [
"# Copyright 2022 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nweBeREip93u"
},
"source": [
"# Forecasting retail demand with Vertex AI and BigQuery ML \n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/demand_forecasting/forecasting-retail-demand.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fworkbench%2Fdemand_forecasting%2Fforecasting-retail-demand.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/demand_forecasting/forecasting-retail-demand.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
"<a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/workbench/demand_forecasting/forecasting-retail-demand.ipynb\" target='_blank'>\n",
" <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
"</table>\n",
"<br/><br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dc1f01b84273"
},
"source": [
"## Overview \n",
"\n",
"This tutorial explores demand forecasting using a BigQuery public retail dataset. Being able to measure and forecast customer demand can help retailers better understand their customers, stock shelves with the right products, offer targeted promotions, and generally, better plan and manage their budgets. This notebook applies an ARIMA (Autoregressive integrated moving average) model from BigQuery ML on retail data. This notebook demonstrates how to train and evaluate a BigQuery ML model for demand forecasting datasets and extract actionable future insights.\n",
"\n",
"*Note: This notebook file was designed to run in a [Vertex AI Workbench managed notebooks](https://cloud.google.com/vertex-ai/docs/workbench/managed/create-instance) instance using the `Python (Local)` kernel. Some components of this notebook may not work in other notebook environments.*\n",
"\n",
"#### ARIMA Modeling with BigQuery ML \n",
"\n",
"The <a href='https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average'>ARIMA model</a> is designed to analyze historical data, spot patterns over time, and project them into the future in other words, forecasting. The model is available inside BigQuery ML and enables users to create and execute machine learning models directly in BigQuery using SQL queries. Working with BigQuery ML is advantageous, as it already has access to the data, it can handle most of the modeling details automatically if desired, and stores both the model and any predictions also inside BigQuery. \n",
"\n",
"Learn more about [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench/introduction) and [BigQuery ML](https://cloud.google.com/bigquery-ml/docs/managing-models-vertex)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dbab58d4ae1a"
},
"source": [
"### Objective\n",
"In this tutorial, you learn how to build ARIMA (Autoregressive integrated moving average) model from BigQuery ML on retail data\n",
"\n",
"This tutorial uses the following Google Cloud ML services:\n",
"- BigQuery\n",
"\n",
"The steps performed include:\n",
"\n",
"* Explore data\n",
"* Model with BigQuery and the ARIMA model\n",
"* Evaluate the model\n",
"* Evaluate the model results using BigQuery ML (on training data)\n",
"* Evaluate the model results - MAE, MAPE, MSE, RMSE (on test data)\n",
"* Use the executor feature"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "26a00a419045"
},
"source": [
"### Dataset \n",
"\n",
"This notebook uses the BigQuery public retail dataset.\n",
"The data covers 10 US stores and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price and gross margin. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "17e0532066d7"
},
"source": [
"### Costs\n",
"This tutorial uses the following billable components of Google Cloud:\n",
"\n",
"* Vertex AI\n",
"* BigQuery\n",
"\n",
"Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), \n",
"[BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the \n",
"[Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oH0bZDCmp930"
},
"source": [
"## Install additional packages\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "V3RUMIvBp932"
},
"outputs": [],
"source": [
"! pip3 install --quiet --upgrade pandas-gbq 'google-cloud-bigquery[bqstorage,pandas]' \\\n",
" scikit-learn \\\n",
" matplotlib"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e9255e3b156f"
},
"source": [
"### Restart runtime (Colab only)\n",
"To use the newly installed packages, you must restart the runtime on Google Colab."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0c0b2427998a"
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"if \"google.colab\" in sys.modules:\n",
"\n",
" import IPython\n",
"\n",
" app = IPython.Application.instance()\n",
" app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4de1bd77992b"
},
"source": [
"<div class=\"alert alert-block alert-warning\">,\n",
"<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>,\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "befa6ca14bc0"
},
"source": [
"### Authenticate your notebook environment (Colab only)\n",
"Authenticate your environment on Google Colab."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7de6ef0fac42"
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"if \"google.colab\" in sys.modules:\n",
"\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bd0e79ceaea2"
},
"source": [
"### Set Google Cloud project information\n",
"Learn more about [setting up a project and a development environment.](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8940d70dfdef"
},
"outputs": [],
"source": [
"PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n",
"LOCATION = \"us-central1\" # @param {type:\"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e5755d1a554f"
},
"source": [
"### Create a Cloud Storage bucket\n",
"\n",
"Create a storage bucket to store intermediate artifacts such as datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d2de92accb67"
},
"outputs": [],
"source": [
"BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b72bfdf29dae"
},
"source": [
"**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a4453435d115"
},
"outputs": [],
"source": [
"! gsutil mb -l $LOCATION $BUCKET_URI"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b4edb2c269d4"
},
"source": [
"## Import libraries and define constants"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b81a2c71fa3a"
},
"source": [
"**Load the required libraries.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "c64de5cbad25"
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"from google.cloud import bigquery\n",
"from google.cloud.bigquery import Client\n",
"from sklearn.metrics import (mean_absolute_error,\n",
" mean_absolute_percentage_error,\n",
" mean_squared_error)\n",
"\n",
"%matplotlib inline\n",
"\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "40902aa0f1de"
},
"source": [
"**Set the name for the table**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7592011d7825"
},
"outputs": [],
"source": [
"SALES_TABLE = \"training_data_table\"\n",
"\n",
"# Construct a BigQuery client object.\n",
"\n",
"client = Client(project=PROJECT_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "36d3a8aec700"
},
"source": [
"**Create a BigQuery datatset**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2f6037562f87"
},
"outputs": [],
"source": [
"dataset_id = \"demandforecasting\" + \"_\" + \"unique\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0wguBZeKp93_"
},
"source": [
"If you're using ***Vertex AI Workbench managed notebooks instance***, Identify cells starting with \"#@bigquery\" as SQL queries. If you're using [Vertex AI Workbench user managed notebooks instance](https://cloud.google.com/vertex-ai/docs/workbench/user-managed/migrate-to-instances) or Colab it will be a markdown cell."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2ce9876ee882"
},
"source": [
"#@bigquery\n",
"CREATE SCHEMA [your-dataset-id]\n",
"OPTIONS(\n",
" location=\"us\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RxnaBh4sp93_"
},
"source": [
"(**Optional**) If you're using Vertex AI Workbench managed notebooks instance, once the results from BigQuery are displayed in the below cell, click the **Query and load as DataFrame** button and execute the generated code stub to fetch the data into the current notebook as a dataframe.\n",
"\n",
"*Note: By default the data is loaded into a `df` variable, though this can be changed before executing the cell if required.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d1m0ixQ9p93_"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE SCHEMA `{PROJECT_ID}.{dataset_id}`\n",
"OPTIONS(\n",
" location=\"us\"\n",
" )\n",
"\"\"\".format(\n",
" PROJECT_ID=PROJECT_ID, dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7002b223b2b5"
},
"source": [
"## Explore the Data\n",
"View the data that is stored in the public BigQuery dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2b134d3155d5"
},
"source": [
"#@bigquery\n",
"SELECT * FROM `looker-private-demo.retail.transaction_detail`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R8QkUxvzp94A"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM `looker-private-demo.retail.transaction_detail`\n",
"\"\"\"\n",
"query_job = client.query(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GhAK4ryBp94A"
},
"outputs": [],
"source": [
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6d52803e14f2"
},
"source": [
"**Create a view named `important_fields` using only the `transaction_timestamp` and `line_items` fields, where the store ID is 10.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dc37b15df336"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].important_fields AS\n",
"(\n",
" SELECT transaction_timestamp,line_items from `looker-private-demo.retail.transaction_detail` WHERE store_id = 10\n",
") "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uDgu3GWtp94B"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.important_fields AS\n",
"(\n",
" SELECT transaction_timestamp,line_items from `looker-private-demo.retail.transaction_detail` WHERE store_id = 10\n",
") \n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a9ca8a8bbf82"
},
"source": [
"**Look at the data in the `important_fields` view.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1ffe2d8e816c"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].important_fields"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "J4CN-N_op94B"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.important_fields\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5f03b43de905"
},
"source": [
"**Convert the `transaction_timestamp` field into a date.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "221986867dbd"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].data_after_converting_timestamp_to_date AS\n",
"(\n",
" SELECT EXTRACT(DATE FROM transaction_timestamp AT TIME ZONE \"UTC\") AS date,line_items from [your-dataset-id].important_fields\n",
") "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DGq3a0aUp94C"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.data_after_converting_timestamp_to_date AS\n",
"(\n",
" SELECT EXTRACT(DATE FROM transaction_timestamp AT TIME ZONE \"UTC\") AS date,line_items from {dataset_id}.important_fields\n",
") \n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f0babc4a23fa"
},
"source": [
"**View the data and check the `date` field values.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2181af3cae88"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].data_after_converting_timestamp_to_date"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_1mz0KD1p94C"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.data_after_converting_timestamp_to_date\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "990d04eab2e1"
},
"source": [
"**Load the data into a dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2d3a85f95f9d"
},
"outputs": [],
"source": [
"df_intermediary = query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f25cf5322fbc"
},
"source": [
"**Check the data types of your dataframe fields.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "40a78a81c379"
},
"outputs": [],
"source": [
"df_intermediary.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f2f685f15000"
},
"source": [
"The `line_items` field is an array of structs. Split the array into its component parts, `product_id`, `sale_price`, and `gross_margin`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "caf01494d657"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].split_array_of_structs AS\n",
" \n",
"(SELECT date,line_items\n",
"FROM [your-dataset-id].data_after_converting_timestamp_to_date, UNNEST(line_items) AS line_items)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "33zixGvkp94E"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.split_array_of_structs AS\n",
" \n",
"(SELECT date,line_items\n",
"FROM {dataset_id}.data_after_converting_timestamp_to_date, UNNEST(line_items) AS line_items)\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6c677762b34a"
},
"source": [
"**View the data.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0cbbded0bf50"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].split_array_of_structs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wRmYwn7Vp94F"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.split_array_of_structs\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7acc57476f1c"
},
"source": [
"**Remove the extra columns to keep only `date` and `product_id`.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "704e7c73ec0c"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].splitting_struct_columns AS\n",
" \n",
"(SELECT date,line_items.product_id as product_id\n",
"FROM [your-dataset-id].split_array_of_structs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2UamlX9Np94F"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.splitting_struct_columns AS\n",
" \n",
"(SELECT date,line_items.product_id as product_id\n",
"FROM {dataset_id}.split_array_of_structs)\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4f9d72c483e8"
},
"source": [
"**View the data.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6103f7ff48ff"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].splitting_struct_columns "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bnv0lr4-p94G"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.splitting_struct_columns \n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7c0c4245acb7"
},
"source": [
"**Count the sales of a product for each date.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "368f75a291d1"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].sales_count_per_date AS\n",
" \n",
"(SELECT date,product_id,COUNT(*) as sales_count\n",
"FROM [your-dataset-id].splitting_struct_columns GROUP BY date,product_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MBeBKffvp94G"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.sales_count_per_date AS\n",
" \n",
"(SELECT date,product_id,COUNT(*) as sales_count\n",
"FROM {dataset_id}.splitting_struct_columns GROUP BY date,product_id)\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7291d3970a2c"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].sales_count_per_date"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oN-5H8w3p94H"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.sales_count_per_date\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "722fd013c28c"
},
"source": [
"**Create a view for the five products that have sold the most units over the entire date range.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "97db6ee61087"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE VIEW [your-dataset-id].top_five_products AS (\n",
" WITH topsellingitems AS(\n",
" SELECT \n",
" product_id,\n",
" sum(sales_count) sum_sales\n",
" FROM\n",
" `[your-dataset-id].sales_count_per_date` \n",
" GROUP BY \n",
" product_id\n",
" ORDER BY sum_sales DESC\n",
" LIMIT 5 #Top N\n",
" )\n",
" SELECT \n",
" date,\n",
" product_id,\n",
" sales_count\n",
" FROM\n",
" `[your-dataset-id].sales_count_per_date` \n",
" WHERE\n",
" product_id IN (SELECT product_id FROM topsellingitems)\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KgRwHUjrp94H"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"CREATE OR REPLACE VIEW {dataset_id}.top_five_products AS (\n",
" WITH topsellingitems AS(\n",
" SELECT \n",
" product_id,\n",
" sum(sales_count) sum_sales\n",
" FROM\n",
" `{dataset_id}.sales_count_per_date` \n",
" GROUP BY \n",
" product_id\n",
" ORDER BY sum_sales DESC\n",
" LIMIT 5 #Top N\n",
" )\n",
" SELECT \n",
" date,\n",
" product_id,\n",
" sales_count\n",
" FROM\n",
" `{dataset_id}.sales_count_per_date` \n",
" WHERE\n",
" product_id IN (SELECT product_id FROM topsellingitems)\n",
" )\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "182ce3346563"
},
"source": [
"#@bigquery\n",
"SELECT * FROM [your-dataset-id].top_five_products\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BXochSk_p94I"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT * FROM {dataset_id}.top_five_products\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "292855967806"
},
"source": [
"**Load the data into a dataframe and view the data.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "495c76e598a6"
},
"outputs": [],
"source": [
"df = query_job.to_dataframe()\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5c78641e2881"
},
"source": [
"**Check the data types of your dataframe fields.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bb63bbdcead4"
},
"outputs": [],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ef761dcd0109"
},
"source": [
"**Convert the `date` field's data type to `datetime`.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0534d564d09e"
},
"outputs": [],
"source": [
"df[\"date\"] = pd.to_datetime(df[\"date\"], format=\"%Y-%m-%d\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8ec5fa4b1ea0"
},
"source": [
"For this forecasting model, date values need to present for all dates, for each product.\n",
"\n",
"To construct a dataframe with `0` values for the `sales_count` field, on dates in which products weren't sold, determine the minimum and maximum dates so that you know which dates need `0` values.\n",
"\n",
"**First, get the earliest (minimum) date.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dd6423ef739f"
},
"source": [
"#@bigquery\n",
"SELECT MIN(DATE) FROM [your-dataset-id].top_five_products"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iCCEozWrp94K"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT MIN(DATE) AS min_date FROM {dataset_id}.top_five_products\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fbb934199011"
},
"source": [
"**Get the latest (maximum) date.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6e09f3221d77"
},
"source": [
"#@bigquery\n",
"SELECT MAX(DATE) FROM [your-dataset-id].top_five_products"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uIJX3B8dp94K"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT MAX(DATE) FROM {dataset_id}.top_five_products\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)\n",
"\n",
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "247a08d866ef"
},
"source": [
"**Add the full date range of values to a dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d41c099464ac"
},
"outputs": [],
"source": [
"dates = pd.date_range(start=\"2016-12-17\", end=\"2021-10-06\").to_frame()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b9d3afc9f7cd"
},
"source": [
"**Get a description of the `dates` dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1b1f67e375fc"
},
"outputs": [],
"source": [
"dates.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "49eb81d44b65"
},
"source": [
"**View the data for one of the products, sorted by date, to show that many dates are'nt present in the dataset.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7194f59c71b5"
},
"outputs": [],
"source": [
"df.loc[df[\"product_id\"] == 20552].sort_values(by=[\"date\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "326f4f48f5cc"
},
"source": [
"To make a dataframe in which each product has zero missing dates, merge each of the five products' data with the `dates` dataframe.\n",
"\n",
"Start with `product_id` `20552`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a0b443c8d160"
},
"outputs": [],
"source": [
"df1 = (\n",
" pd.merge(\n",
" df.loc[df[\"product_id\"] == 20552],\n",
" dates,\n",
" left_on=\"date\",\n",
" right_on=0,\n",
" how=\"outer\",\n",
" )\n",
" .sort_values(by=[\"date\"])\n",
" .drop(columns=0)\n",
") # merging dates dataframe with product_id matching rows\n",
"df1[\"product_id\"] = 20552 # Set the product ID to the specified values.\n",
"df1.reset_index(inplace=True, drop=True) # making index to start from 0\n",
"df1 = df1.fillna(0) # for sales_count making null values as 0\n",
"df1[\"sales_count\"] = df1[\"sales_count\"].astype(\n",
" \"int\"\n",
") # convert sales_count column to integer\n",
"print(\"data after converting for a product with product_id 20552\")\n",
"print(df1)\n",
"\n",
"df2 = (\n",
" pd.merge(\n",
" df.loc[df[\"product_id\"] == 13596],\n",
" dates,\n",
" left_on=\"date\",\n",
" right_on=0,\n",
" how=\"outer\",\n",
" )\n",
" .sort_values(by=[\"date\"])\n",
" .drop(columns=0)\n",
") # merging dates dataframe with product_id matching rows\n",
"df2[\"product_id\"] = 13596 # Set the product ID to the specified values.\n",
"df2.reset_index(inplace=True, drop=True) # making index to start from 0\n",
"df2 = df2.fillna(0) # for sales_count making null values as 0\n",
"df2[\"sales_count\"] = df2[\"sales_count\"].astype(\n",
" \"int\"\n",
") # convert sales_count column to integer\n",
"print(df2)\n",
"\n",
"df3 = (\n",
" pd.merge(\n",
" df.loc[df[\"product_id\"] == 23641],\n",
" dates,\n",
" left_on=\"date\",\n",
" right_on=0,\n",
" how=\"outer\",\n",
" )\n",
" .sort_values(by=[\"date\"])\n",
" .drop(columns=0)\n",
") # merging dates dataframe with product_id matching rows\n",
"df3[\"product_id\"] = 23641 # Set the product ID to the specified values.\n",
"df3.reset_index(inplace=True, drop=True) # making index to start from 0\n",
"df3 = df3.fillna(0) # for sales_count making null values as 0\n",
"df3[\"sales_count\"] = df3[\"sales_count\"].astype(\n",
" \"int\"\n",
") # convert sales_count column to integer\n",
"print(df3)\n",
"\n",
"df4 = (\n",
" pd.merge(\n",
" df.loc[df[\"product_id\"] == 28305],\n",
" dates,\n",
" left_on=\"date\",\n",
" right_on=0,\n",
" how=\"outer\",\n",
" )\n",
" .sort_values(by=[\"date\"])\n",
" .drop(columns=0)\n",
") # merging dates dataframe with product_id matching rows\n",
"df4[\"product_id\"] = 28305 # Set the product ID to the specified values.\n",
"df4.reset_index(inplace=True, drop=True) # making index to start from 0\n",
"df4 = df4.fillna(0) # for sales_count making null values as 0\n",
"df4[\"sales_count\"] = df4[\"sales_count\"].astype(\n",
" \"int\"\n",
") # convert sales_count column to integer\n",
"print(df4)\n",
"\n",
"df5 = (\n",
" pd.merge(\n",
" df.loc[df[\"product_id\"] == 20547],\n",
" dates,\n",
" left_on=\"date\",\n",
" right_on=0,\n",
" how=\"outer\",\n",
" )\n",
" .sort_values(by=[\"date\"])\n",
" .drop(columns=0)\n",
") # merging dates dataframe with product_id matching rows\n",
"df5[\"product_id\"] = 20547 # Set the product ID to the specified values.\n",
"df5.reset_index(inplace=True, drop=True) # making index to start from 0\n",
"df5 = df5.fillna(0) # for sales_count making null values as 0\n",
"df5[\"sales_count\"] = df5[\"sales_count\"].astype(\n",
" \"int\"\n",
") # convert sales_count column to integer\n",
"print(df5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6a2d033b733e"
},
"source": [
"**Merge all five dataframes into one new dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bf570c0c6913"
},
"outputs": [],
"source": [
"pdList = [df1, df2, df3, df4, df5] # List of your dataframes\n",
"new_df = pd.concat(pdList)\n",
"new_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4a75fa1dc8dc"
},
"source": [
"**Reset the index of the dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7d258952fc71"
},
"outputs": [],
"source": [
"new_df.reset_index(inplace=True, drop=True)\n",
"print(new_df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "97e1289b2106"
},
"source": [
"**View the five product IDs.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5891c2082a4b"
},
"source": [
"#@bigquery\n",
"SELECT DISTINCT product_id from [your-dataset-id].top_five_products"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5Gjad6uSp94O"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT DISTINCT product_id from {dataset_id}.top_five_products\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MUynE3dnp94O"
},
"outputs": [],
"source": [
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aa06b0e893cb"
},
"source": [
"**Plot `sales_count` over time, for each product.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9cdce0080aa1"
},
"outputs": [],
"source": [
"plt.plot(\n",
" new_df.loc[new_df[\"product_id\"] == 20552][\"date\"],\n",
" new_df.loc[new_df[\"product_id\"] == 20552][\"sales_count\"],\n",
")\n",
"plt.xticks(rotation=\"vertical\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6fef99223f63"
},
"outputs": [],
"source": [
"plt.plot(\n",
" new_df.loc[new_df[\"product_id\"] == 20547][\"date\"],\n",
" new_df.loc[new_df[\"product_id\"] == 20547][\"sales_count\"],\n",
")\n",
"plt.xticks(rotation=\"vertical\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d281151b0e2a"
},
"outputs": [],
"source": [
"plt.plot(\n",
" new_df.loc[new_df[\"product_id\"] == 28305][\"date\"],\n",
" new_df.loc[new_df[\"product_id\"] == 28305][\"sales_count\"],\n",
")\n",
"plt.xticks(rotation=\"vertical\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a50eeeeaedb0"
},
"outputs": [],
"source": [
"plt.plot(\n",
" new_df.loc[new_df[\"product_id\"] == 23641][\"date\"],\n",
" new_df.loc[new_df[\"product_id\"] == 23641][\"sales_count\"],\n",
")\n",
"plt.xticks(rotation=\"vertical\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4d002f0717ee"
},
"outputs": [],
"source": [
"plt.plot(\n",
" new_df.loc[new_df[\"product_id\"] == 13596][\"date\"],\n",
" new_df.loc[new_df[\"product_id\"] == 13596][\"sales_count\"],\n",
")\n",
"plt.xticks(rotation=\"vertical\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "755a811fbad5"
},
"source": [
"**List the data types for the `new_df` dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "220ec814875c"
},
"outputs": [],
"source": [
"new_df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ff06b56ffa7a"
},
"source": [
"**Create a new BigQuery table out of the `new_df` dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2d303f1cfd90"
},
"outputs": [],
"source": [
"job_config = bigquery.LoadJobConfig(\n",
" # Specify a (partial) schema. All columns are always written to the\n",
" # table. The schema is used to assist in data type definitions.\n",
" schema=[\n",
" bigquery.SchemaField(\"product_id\", bigquery.enums.SqlTypeNames.INTEGER),\n",
" bigquery.SchemaField(\"date\", bigquery.enums.SqlTypeNames.DATE),\n",
" bigquery.SchemaField(\"sales_count\", bigquery.enums.SqlTypeNames.INTEGER),\n",
" ],\n",
" # Optionally, set the write disposition. BigQuery appends loaded rows\n",
" # to an existing table by default, but with WRITE_TRUNCATE write\n",
" # disposition it replaces the table with the loaded data.\n",
" write_disposition=\"WRITE_TRUNCATE\",\n",
")\n",
"\n",
"# save the dataframe to a table in the created dataset\n",
"job = client.load_table_from_dataframe(\n",
" new_df,\n",
" \"{}.{}.{}\".format(PROJECT_ID, dataset_id, SALES_TABLE),\n",
" job_config=job_config,\n",
") # Make an API request.\n",
"job.result() # Wait for the job to complete."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c2e0e9aa67cd"
},
"source": [
"**Create a training dataset by setting a date range that limits the data being used.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "e1fffed34704"
},
"outputs": [],
"source": [
"# select the date-range and item-id(top 5) for training-data and create a table for the same\n",
"TRAININGDATA_STARTDATE = \"2016-12-17\"\n",
"TRAININGDATA_ENDDATE = \"2021-6-01\"\n",
"query = \"\"\"\n",
"CREATE OR REPLACE TABLE {PROJECT_ID}.{DATASET}.training_data AS (\n",
" SELECT\n",
" *\n",
" FROM\n",
" `{DATASET}.{SALES_TABLE}`\n",
" WHERE\n",
" date BETWEEN '{STARTDATE}' AND '{ENDDATE}'\n",
" );\n",
"\"\"\".format(\n",
" STARTDATE=TRAININGDATA_STARTDATE,\n",
" ENDDATE=TRAININGDATA_ENDDATE,\n",
" DATASET=dataset_id,\n",
" SALES_TABLE=SALES_TABLE,\n",
" PROJECT_ID=PROJECT_ID,\n",
")\n",
"# execute the query (as it's a create query, there won't be any tabular output)\n",
"query_job = client.query(query)\n",
"print(query_job.result())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2f7d9d2d4229"
},
"source": [
"**Select the original data for plotting.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1b82fc7c7bf6"
},
"outputs": [],
"source": [
"df_historical = new_df[\n",
" (new_df[\"date\"] >= pd.to_datetime(TRAININGDATA_STARTDATE))\n",
" & (new_df[\"date\"] <= pd.to_datetime(TRAININGDATA_ENDDATE))\n",
"].copy()\n",
"df_historical"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c25b75dc957a"
},
"source": [
"## Modeling with BigQuery and the ARIMA model\n",
"\n",
"**Create an ARIMA model using the training data.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "69ac23f216a4"
},
"source": [
"#@bigquery\n",
"CREATE OR REPLACE MODEL [your-dataset-id].arima_model\n",
"\n",
"OPTIONS(\n",
" MODEL_TYPE='ARIMA',\n",
" TIME_SERIES_TIMESTAMP_COL='date', \n",
" TIME_SERIES_DATA_COL='sales_count',\n",
" TIME_SERIES_ID_COL='product_id',\n",
" HOLIDAY_REGION='US'\n",
" \n",
") AS\n",
"\n",
"SELECT \n",
" date,\n",
" product_id,\n",
" sales_count\n",
"FROM\n",
" [your-dataset-id].training_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c45e18a773ad"
},
"source": [
"**Train the ARIMA model.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "00638d9f7e55"
},
"outputs": [],
"source": [
"# Train an ARIMA model on the created dataset\n",
"query = \"\"\"\n",
"CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATASET}.arima_model`\n",
"\n",
"OPTIONS(\n",
" MODEL_TYPE='ARIMA',\n",
" TIME_SERIES_TIMESTAMP_COL='date',\n",
" TIME_SERIES_DATA_COL='sales_count',\n",
" TIME_SERIES_ID_COL='product_id') AS\n",
"\n",
"SELECT\n",
" date,\n",
" product_id,\n",
" sales_count\n",
"FROM\n",
" `{DATASET}.training_data`\n",
"\"\"\".format(\n",
" PROJECT_ID=PROJECT_ID, DATASET=dataset_id\n",
")\n",
"# execute the query\n",
"job = client.query(query)\n",
"job.result()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a02551627598"
},
"source": [
"# Evaluate the model\n",
"\n",
"To evaluate the trained model, get forecasts for the 90 days after the last date of the training data. In BigQuery ML, provide the number of forecast days using the `HORIZON` argument. Specify the confidence interval for the forecast using the `CONFIDENCE_LEVEL` argument."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "87ef4ba64a3a"
},
"source": [
"#@bigquery dfforecast \n",
"\n",
"DECLARE HORIZON STRING DEFAULT \"90\";\n",
"DECLARE CONFIDENCE_LEVEL STRING DEFAULT \"0.90\";\n",
"\n",
"EXECUTE IMMEDIATE format('''\n",
" SELECT\n",
" *\n",
" FROM\n",
" ML.FORECAST(MODEL [your-dataset-id].arima_model,\n",
" STRUCT(%s AS horizon,\n",
" %s AS confidence_level)\n",
" )\n",
" ''',HORIZON,CONFIDENCE_LEVEL)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "801528e3e2c7"
},
"source": [
"**Load the data into a dataframe named `dfforecast`.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "f1277e846d10"
},
"outputs": [],
"source": [
"query = '''DECLARE HORIZON STRING DEFAULT \"90\"; #number of values to forecast\n",
"DECLARE CONFIDENCE_LEVEL STRING DEFAULT \"0.90\"; ## required confidence level\n",
"\n",
"EXECUTE IMMEDIATE format(\"\"\"\n",
" SELECT\n",
" *\n",
" FROM\n",
" ML.FORECAST(MODEL {dataset_id}.arima_model,\n",
" STRUCT(%s AS horizon,\n",
" %s AS confidence_level)\n",
" )\n",
" \"\"\",HORIZON,CONFIDENCE_LEVEL)'''.format(\n",
" dataset_id=dataset_id\n",
")\n",
"job = client.query(query)\n",
"dfforecast = job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1e0549381849"
},
"source": [
"**View the first few rows.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "83b07f243b72"
},
"outputs": [],
"source": [
"dfforecast.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0e01aa3098f8"
},
"outputs": [],
"source": [
"print(f\"Number of rows: {dfforecast.shape[0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6fa36b762a33"
},
"source": [
"**Clean the historical and forecasted values for plotting.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "073a2dd6a5d3"
},
"outputs": [],
"source": [
"df_historical.sort_values(by=[\"product_id\", \"date\"], inplace=True)\n",
"dfforecast.sort_values(by=[\"product_id\", \"forecast_timestamp\"], inplace=True)\n",
"\n",
"# Select the actual data to plot against the forecasted data\n",
"day_diff = (new_df[\"date\"] - pd.to_datetime(TRAININGDATA_ENDDATE)).dt.days\n",
"df_actual_90d = new_df[new_df[\"product_id\"].isin(dfforecast[\"product_id\"].unique())][\n",
" (day_diff > 0) & (day_diff <= 90)\n",
"].copy()\n",
"df_actual_90d.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d40a95ad0616"
},
"source": [
"**Plot the historical and forecast data.**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "c6845ad425d3"
},
"outputs": [],
"source": [
"def plot_hist_forecast(\n",
" historical, forecast, actual, hist_start=\"\", hist_end=\"\", title=\"\"\n",
"):\n",
" if hist_start != \"\":\n",
" historical = historical[\n",
" historical[\"date\"] >= pd.to_datetime(hist_start, format=\"%Y-%m-%d\")\n",
" ].copy()\n",
" if hist_end != \"\":\n",
" historical = historical[\n",
" historical[\"date\"] <= pd.to_datetime(hist_end, format=\"%Y-%m-%d\")\n",
" ].copy()\n",
"\n",
" plt.figure(figsize=(15, 4))\n",
" plt.plot(historical[\"date\"], historical[\"sales_count\"], label=\"historical\")\n",
" # Plot the forecast data\n",
" plt.plot(\n",
" forecast[\"forecast_timestamp\"],\n",
" forecast[\"forecast_value\"],\n",
" label=\"forecast\",\n",
" linestyle=\"--\",\n",
" )\n",
" # Plot the actual data\n",
" plt.plot(actual[\"date\"], actual[\"sales_count\"], label=\"actual\")\n",
" # plot the confidence interval\n",
" confidence_level = forecast[\"confidence_level\"].iloc[0] * 100\n",
" low_CI = forecast[\"confidence_interval_lower_bound\"]\n",
" upper_CI = forecast[\"confidence_interval_upper_bound\"]\n",
"\n",
" # Shade the confidence interval\n",
" plt.fill_between(\n",
" forecast[\"forecast_timestamp\"],\n",
" low_CI,\n",
" upper_CI,\n",
" color=\"#539caf\",\n",
" alpha=0.4,\n",
" label=f\"{confidence_level} confidence interval\",\n",
" )\n",
" plt.legend()\n",
" plt.title(title)\n",
" plt.show()\n",
" return\n",
"\n",
"\n",
"product_id_list = dfforecast[\"product_id\"].unique()\n",
"for i in product_id_list:\n",
" print(\"Product_id : \", i)\n",
" plot_hist_forecast(\n",
" df_historical[df_historical[\"product_id\"] == i],\n",
" dfforecast[dfforecast[\"product_id\"] == i],\n",
" df_actual_90d[df_actual_90d[\"product_id\"] == i],\n",
" hist_start=\"2021-02-01\",\n",
" title=i,\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "76b5f6fab0b8"
},
"source": [
"While most of the predictions are looking decent, you can also see that the actual ranges fall into the 90% confidence interval suggested by the model. Under the hood, BigQuery ML performs many computationally expensive tasks even considering the seasonal and holiday information.\n",
"<img src=\"https://cloud.google.com/bigquery-ml/images/BQ_ARIMA_diagram.png\"> <a href=\"https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series\"> Source</a> </img>\n",
"\n",
"The coefficients learned by BigQuery ML's ARIMA model can also be checked by querying for <a href=\"https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-arima-coefficients\">ARIMA_COEFFICIENTS</a> from the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "934fb81f6ad1"
},
"source": [
"#@bigquery\n",
"SELECT\n",
" *\n",
"FROM \n",
" ML.ARIMA_COEFFICIENTS(MODEL [your-dataset-id].arima_model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "owAuWLkAp94U"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT\n",
" *\n",
"FROM \n",
" ML.ARIMA_COEFFICIENTS(MODEL {dataset_id}.arima_model)\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "l3xmi8bOp94U"
},
"outputs": [],
"source": [
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f37aa9ddfb8f"
},
"source": [
"In the above results, \n",
"- The <b>product_id</b> column represents the index column that we've specified while training the ARIMA model.\n",
"- The <b>ar_coefficients</b> column corresponds to the autoregressive coefficients in the ARIMA algorithm (non-seasonal p).\n",
"- <b>ma_coefficients</b> refers to the moving average coefficients in the ARIMA algorithm (non-seasonal q).\n",
"- <b>intercept_or_drift</b> is the constant term in the ARIMA algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "27bf1ef88ea7"
},
"source": [
"## Evaluating the model results using BigQuery ML \n",
"\n",
"BigQuery ML also provides the <a href=\"https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate\">ML.EVALUATE</a> function for checking the evaluation metrics of the trained model. For the ARIMA model, you can see the model being evaluated on `log_likelihood`, `AIC` and `variance`. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f4f0ef9b1a04"
},
"source": [
"#@bigquery\n",
"SELECT\n",
" *\n",
"FROM\n",
" ML.EVALUATE(MODEL [your-dataset-id].arima_model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CtM-g5ixp94V"
},
"outputs": [],
"source": [
"query = \"\"\"\n",
"SELECT\n",
" *\n",
"FROM\n",
" ML.EVALUATE(MODEL {dataset_id}.arima_model)\n",
"\"\"\".format(\n",
" dataset_id=dataset_id\n",
")\n",
"query_job = client.query(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JQwOWcgMp94V"
},
"outputs": [],
"source": [
"query_job.to_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3be6d3e1607e"
},
"source": [
"## Evaluting the model results - MAE, MAPE, MSE, RMSE (on Test data)\n",
"\n",
"Generally, to evaluate the forecast model, you can choose metrics depending on how you want to evaluate. For starters, you can choose from the following: \n",
"* <b>Mean Absolute Error (MAE)</b>: Average of the absolute differences between the actual values and the forecasted values.\n",
"* <b>Mean Absolute Percentage Error (MAPE)</b>: Average of the percentages of absolute difference between the actual and forecasted values to the actual values.\n",
"* <b>Mean Squared Error (MSE)</b>: Average of squared differences between the actual and forecasted values.\n",
"* <b>Root Mean Squared Error (RMSE)</b>: Root of MSE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d78abc6aee46"
},
"outputs": [],
"source": [
"df_actual_90d.sort_values(by=[\"product_id\", \"date\"], inplace=True)\n",
"df_actual_90d.reset_index(drop=True, inplace=True)\n",
"dfforecast.sort_values(by=[\"product_id\", \"forecast_timestamp\"], inplace=True)\n",
"dfforecast.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "132bb1acdd75"
},
"outputs": [],
"source": [
"errors = {\"product_id\": [], \"MAE\": [], \"MAPE\": [], \"MSE\": [], \"RMSE\": []}\n",
"for i in product_id_list:\n",
" mae = mean_absolute_error(\n",
" df_actual_90d[df_actual_90d[\"product_id\"] == i][\"sales_count\"],\n",
" dfforecast[dfforecast[\"product_id\"] == i][\"forecast_value\"],\n",
" )\n",
" mape = mean_absolute_percentage_error(\n",
" df_actual_90d[df_actual_90d[\"product_id\"] == i][\"sales_count\"],\n",
" dfforecast[dfforecast[\"product_id\"] == i][\"forecast_value\"],\n",
" )\n",
"\n",
" mse = mean_squared_error(\n",
" df_actual_90d[df_actual_90d[\"product_id\"] == i][\"sales_count\"],\n",
" dfforecast[dfforecast[\"product_id\"] == i][\"forecast_value\"],\n",
" squared=True,\n",
" )\n",
"\n",
" rmse = mean_squared_error(\n",
" df_actual_90d[df_actual_90d[\"product_id\"] == i][\"sales_count\"],\n",
" dfforecast[dfforecast[\"product_id\"] == i][\"forecast_value\"],\n",
" squared=False,\n",
" )\n",
"\n",
" errors[\"product_id\"].append(i)\n",
" errors[\"MAE\"].append(mae)\n",
" errors[\"MAPE\"].append(mape)\n",
" errors[\"MSE\"].append(mse)\n",
" errors[\"RMSE\"].append(rmse)\n",
"errors = pd.DataFrame(errors)\n",
"errors"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d84da670e5ae"
},
"source": [
"From the values obtained for these error measures, it looks like product ID 20552's 13,596's error measures are high. Note that these error measures are an aggregate of all the individual forecasts made during the test period and so reflect an overall picture of the model's performance over the selected period. Ideally, the lower these error measures, the better the model is at forecasting."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "56105a7f9a46"
},
"source": [
"## Executor feature in managed instances\n",
"\n",
"If you're using managed instances, along the top toolbar, above your notebook, click the **Executor** button."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "942c4866b758"
},
"source": [
"<img src=\"images/navbar_exe.png\" ></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "31dba6ec2c0e"
},
"source": [
"Give the execution a name. Select the **Cloud Storage bucket**, **Machine type**, and **Accelerator type**. For the **Environment**, select **Python 3**."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4a56743fe76f"
},
"source": [
"<img src=\"images/exe_form.png\" style=\"width:500px\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "af652028084a"
},
"source": [
"<img src=\"images/python3_env_selection.png\" style=\"width:500px\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0d37a2baf139"
},
"source": [
"In the **Type** menu, you can select schedule-based recurring executions if you want to schedule the execution to run on an interval, such as one hour."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c1981e3e49db"
},
"source": [
"<img src=\"images/schedule.png\" style=\"height:200px;\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "791a0bfc0cb9"
},
"source": [
"You can see the history of your executions by clicking the **Notebook Executor** button, on the left sidebar. It looks like a calendar icon."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e898ff47f952"
},
"source": [
"<img src=\"images/side_nav.png\" style=\"height:500px;\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3eb4d9d73c57"
},
"source": [
"You can see execution history on the **Executions** tab."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "45447f8e1cc0"
},
"source": [
"<img src=\"images/list_execution.png\" style=\"width:500px\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cb08d55eaec1"
},
"source": [
"You can see your active schedules on the **Schedules** tab."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8c32707e57e4"
},
"source": [
"<img src=\"images/list_schedule.png\" style=\"width:500px\"></img>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "23ee85f65a0b"
},
"source": [
"## Clean up\n",
"\n",
"\n",
"To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n",
"project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
"\n",
"Otherwise, you can delete the individual resources you created in this tutorial. The following code deletes the entire dataset.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8b70ac0dc210"
},
"outputs": [],
"source": [
"# Set dataset variable to the ID of the dataset to fetch.\n",
"dataset = f\"{PROJECT_ID}.{dataset_id}\"\n",
"\n",
"# Use the delete_contents parameter to delete a dataset and its contents.\n",
"# Use the not_found_ok parameter to not receive an error if the dataset has already been deleted.\n",
"client.delete_dataset(\n",
" dataset_id, delete_contents=True, not_found_ok=True\n",
") # Make an API request.\n",
"\n",
"print(\"Deleted dataset '{}'.\".format(dataset_id))\n",
"\n",
"# Delete Cloud Storage objects that were created\n",
"delete_bucket = True # Set True for deletion\n",
"if delete_bucket:\n",
" ! gsutil -m rm -r $BUCKET_URI"
]
}
],
"metadata": {
"colab": {
"name": "forecasting-retail-demand.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}