notebooks/bqml-feature-transform-module-tutorial.ipynb (6,157 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "4f33bf0f-c2a6-4dd6-9307-d30c10854218",
"metadata": {},
"source": [
"# BigQuery ML (BQML) Feature Engineering - Reusable and Modular\n",
"\n",
"This tutorial accompanies the blog post [Making BigQuery ML Feature Preprocessing Reusable and Modular](https://cloud.google.com/blog/products/data-analytics/how-bigquery-ml-does-feature-preprocessing).\n",
"\n",
"Transforming raw data into meaningful features = preprocessing. The [`TRANSFORM` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform) makes embedding feature engineering with BigQuery ML models easy. This workflow expands on this concept with two new ways of working with feature engineering in BigQuery ML:\n",
"- Reusable preprocessing with [`ML.TRANSFORM` table function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transform)\n",
"- Modular preprocessing with [`TRANSFORM_ONLY` models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-transform)"
]
},
{
"cell_type": "markdown",
"id": "97e4d33a-ea63-4402-a379-cc85929afbb9",
"metadata": {},
"source": [
"---\n",
"## Find & Replace Project ID\n",
"\n",
"This workflow uses IPython Magics for readability and for making SQL queries easy to copy/paste to other editors, like the BigQuery console. It is recommended to first do a find & replace of the project id used in this notebook with the one in your environment. \n",
"- Replace all occurences of `statmike-mlops-349915` with the value of your project id."
]
},
{
"cell_type": "markdown",
"id": "35fce6f5-eef5-4989-848c-50430f6e4ca5",
"metadata": {
"id": "od_UkDpvRmgD"
},
"source": [
"---\n",
"## Colab Setup\n",
"\n",
"To run this notebook in Colab click [](https://colab.research.google.com/github/GoogleCloudPlatform/bigquery-ml-utils/blob/master/notebooks/bqml-feature-transform-module-tutorial.ipynb) and run the cells in this section. Otherwise, skip this section.\n",
"\n",
"This cell will authenticate to GCP (follow prompts in the popup)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5f318fef-5147-4063-8786-923390132365",
"metadata": {
"executionInfo": {
"elapsed": 195,
"status": "ok",
"timestamp": 1683726184843,
"user": {
"displayName": "Mike Henderson",
"userId": "07691629187611687318"
},
"user_tz": 240
},
"id": "8UO9FnqyKBlF",
"tags": []
},
"outputs": [],
"source": [
"PROJECT_ID = 'statmike-mlops-349915' # replace with project ID"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a6c7083f-df17-49c8-b74e-8e9b1bddf659",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 68869,
"status": "ok",
"timestamp": 1683726253709,
"user": {
"displayName": "Mike Henderson",
"userId": "07691629187611687318"
},
"user_tz": 240
},
"id": "N98-KK7LRkjm",
"outputId": "09ec5008-0def-4e1a-c349-c598ee752f78",
"tags": []
},
"outputs": [],
"source": [
"try:\n",
" import google.colab\n",
" from google.colab import auth\n",
" auth.authenticate_user()\n",
" !gcloud config set project {PROJECT_ID}\n",
"except Exception:\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "cee53189-e087-4b62-8cd2-adce75205abc",
"metadata": {},
"source": [
"## Installs (If Needed)\n",
"\n",
"The clients packages may need installing in this environment. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4d888df6-bce7-432d-9448-3c5c288a7439",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# tuples of (import name, install name)\n",
"packages = [\n",
" ('google.cloud.aiplatform', 'google-cloud-aiplatform'),\n",
" ('google.cloud.bigquery', 'google-cloud-bigquery'),\n",
" ('google.cloud.storage', 'google-cloud-storage'),\n",
" ('bigframes', 'bigframes'),\n",
" ('pandas', 'pandas')\n",
"]\n",
"\n",
"import importlib\n",
"install = False\n",
"for package in packages:\n",
" if not importlib.util.find_spec(package[0]):\n",
" print(f'installing package {package[1]}')\n",
" install = True\n",
" !pip install {package[1]} -U -q --user"
]
},
{
"cell_type": "markdown",
"id": "f61e1af9-08f1-4b3a-9040-6db34963a6c7",
"metadata": {},
"source": [
"### Restart Kernel (If Installs Occured)\n",
"\n",
"After a kernel restart the code submission can start with the next cell after this one."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f290a065-f28f-4802-bd71-f326ad7af1e8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"if install:\n",
" import IPython\n",
" app = IPython.Application.instance()\n",
" app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "361a4570-c1b1-4857-8b26-5fdb93bda135",
"metadata": {
"id": "appt8-yVRtJ1"
},
"source": [
"---\n",
"## Setup"
]
},
{
"cell_type": "markdown",
"id": "451ed609-398e-46d7-af29-855e0b7d4ca6",
"metadata": {
"id": "63mx2EozRxFP"
},
"source": [
"Inputs"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "80c4d683-2888-43c1-874f-9e7092ad5ea2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"executionInfo": {
"elapsed": 2124,
"status": "ok",
"timestamp": 1683726390544,
"user": {
"displayName": "Mike Henderson",
"userId": "07691629187611687318"
},
"user_tz": 240
},
"id": "xzcoXjM5Rky5",
"outputId": "b3bdcbc1-70d5-472e-aea2-42c74a42efde",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'statmike-mlops-349915'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"project = !gcloud config get-value project\n",
"PROJECT_ID = project[0]\n",
"PROJECT_ID"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e1dad47f-a2c3-431d-a11b-5e0b318b4924",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"REGION = 'us-central1'\n",
"\n",
"# specify a GCS Bucket\n",
"GCS_BUCKET = PROJECT_ID\n",
"\n",
"# BigQuery Source Table\n",
"BQ_SOURCE_TABLE = 'bigquery-public-data.ml_datasets.penguins'\n",
"\n",
"# BigQuery Environment Parameters\n",
"BQ_REGION = REGION[0:2] # use multi-region\n",
"BQ_PROJECT = PROJECT_ID\n",
"BQ_DATASET = 'bqml'\n",
"BQ_TABLE_PREFIX = 'feature-engineering'"
]
},
{
"cell_type": "markdown",
"id": "dc69bdfe-6a93-4893-999f-f16579ce0668",
"metadata": {
"id": "LuajVwCiO6Yg"
},
"source": [
"Packages"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f0e862ff-51a9-4d67-8675-5f584eb6b905",
"metadata": {
"executionInfo": {
"elapsed": 17761,
"status": "ok",
"timestamp": 1683726409304,
"user": {
"displayName": "Mike Henderson",
"userId": "07691629187611687318"
},
"user_tz": 240
},
"id": "LVC7zzSLRk2C",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The google.cloud.bigquery extension is already loaded. To reload it, use:\n",
" %reload_ext google.cloud.bigquery\n"
]
}
],
"source": [
"from google.cloud import bigquery\n",
"from google.cloud import aiplatform\n",
"from google.cloud import storage\n",
"import bigframes.pandas as bf\n",
"import bigframes.ml as bfml\n",
"import pandas as pd\n",
"\n",
"# load BigQuery IPython Magics (for Jupyter Notebooks)\n",
"%load_ext google.cloud.bigquery"
]
},
{
"cell_type": "markdown",
"id": "f4527c7d-3378-4a97-8819-bf6704cae1f5",
"metadata": {
"id": "EyAVFG9TO9H-"
},
"source": [
"Clients"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "037b8bb2-0617-45c1-b085-042a6f2306c7",
"metadata": {
"executionInfo": {
"elapsed": 19,
"status": "ok",
"timestamp": 1683726409306,
"user": {
"displayName": "Mike Henderson",
"userId": "07691629187611687318"
},
"user_tz": 240
},
"id": "L0RPE13LOZce",
"tags": []
},
"outputs": [],
"source": [
"# bigquery client\n",
"bq = bigquery.Client(project = PROJECT_ID)\n",
"\n",
"# vertex ai client\n",
"aiplatform.init(project = PROJECT_ID, location = REGION)\n",
"\n",
"# gcs client\n",
"gcs = storage.Client(project = PROJECT_ID)\n",
"\n",
"# setup BigFrames API\n",
"bf.reset_session()\n",
"bf.options.bigquery.project = BQ_PROJECT\n",
"bf.options.bigquery.location = BQ_REGION\n",
"bf_session = bf.get_global_session()"
]
},
{
"cell_type": "markdown",
"id": "4ab2eae1-f009-47b2-b788-07ed86dfbf8d",
"metadata": {},
"source": [
"---\n",
"## BigQuery Source Data\n",
"\n",
"The source table is a BigQuery Public Dataset table. The following cell uses the BigQuery IPython magic to retrieve 5 rows of the table for review. This data is known as [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/) data: \n",
"\n",
"```\n",
"@Manual{,\n",
" title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},\n",
" author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},\n",
" year = {2020},\n",
" note = {R package version 0.1.0},\n",
" doi = {10.5281/zenodo.3960218},\n",
" url = {https://allisonhorst.github.io/palmerpenguins/},\n",
"}\n",
"```\n",
"\n",
"\n",
"There are 334 observations of 4 numerical features (culman length, culmen depth, flipper length, body mass) and 2 categorical features (island, sex) that represent 3 species of penguins."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "da471a56-f3f0-4a08-861f-5babadb0e2eb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e604cc47401b4b7fad41c9144b64fe7f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "03afa08aac4b491c9c2420c69059ccdc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" <th>sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>36.6</td>\n",
" <td>18.4</td>\n",
" <td>184.0</td>\n",
" <td>3475.0</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>39.8</td>\n",
" <td>19.1</td>\n",
" <td>184.0</td>\n",
" <td>4650.0</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>40.9</td>\n",
" <td>18.9</td>\n",
" <td>184.0</td>\n",
" <td>3900.0</td>\n",
" <td>MALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>Dream</td>\n",
" <td>46.5</td>\n",
" <td>17.9</td>\n",
" <td>192.0</td>\n",
" <td>3500.0</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>37.3</td>\n",
" <td>16.8</td>\n",
" <td>192.0</td>\n",
" <td>3000.0</td>\n",
" <td>FEMALE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island culmen_length_mm \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream 36.6 \n",
"1 Adelie Penguin (Pygoscelis adeliae) Dream 39.8 \n",
"2 Adelie Penguin (Pygoscelis adeliae) Dream 40.9 \n",
"3 Chinstrap penguin (Pygoscelis antarctica) Dream 46.5 \n",
"4 Adelie Penguin (Pygoscelis adeliae) Dream 37.3 \n",
"\n",
" culmen_depth_mm flipper_length_mm body_mass_g sex \n",
"0 18.4 184.0 3475.0 FEMALE \n",
"1 19.1 184.0 4650.0 MALE \n",
"2 18.9 184.0 3900.0 MALE \n",
"3 17.9 192.0 3500.0 FEMALE \n",
"4 16.8 192.0 3000.0 FEMALE "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM `bigquery-public-data.ml_datasets.penguins`\n",
"LIMIT 5"
]
},
{
"cell_type": "markdown",
"id": "03441bd7-02d6-47be-8360-0af94fdce55c",
"metadata": {},
"source": [
"Review the mean values of each measurement within `species`:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7456a1d0-21d8-4591-b01e-3f96d9fe9cd2",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0ee56e95a301414d9f4225cdefaf3bfe",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1040ec1225084a3faf44c315b2933225",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>count</th>\n",
" <th>mean_culmen_length</th>\n",
" <th>mean_culment_depth</th>\n",
" <th>mean_flipper_length</th>\n",
" <th>mean_body_mass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>152</td>\n",
" <td>38.791391</td>\n",
" <td>18.346358</td>\n",
" <td>189.953642</td>\n",
" <td>3700.662252</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>68</td>\n",
" <td>48.833824</td>\n",
" <td>18.420588</td>\n",
" <td>195.823529</td>\n",
" <td>3733.088235</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>124</td>\n",
" <td>47.504878</td>\n",
" <td>14.982114</td>\n",
" <td>217.186992</td>\n",
" <td>5076.016260</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species count mean_culmen_length \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) 152 38.791391 \n",
"1 Chinstrap penguin (Pygoscelis antarctica) 68 48.833824 \n",
"2 Gentoo penguin (Pygoscelis papua) 124 47.504878 \n",
"\n",
" mean_culment_depth mean_flipper_length mean_body_mass \n",
"0 18.346358 189.953642 3700.662252 \n",
"1 18.420588 195.823529 3733.088235 \n",
"2 14.982114 217.186992 5076.016260 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT species, count(*) as count,\n",
" AVG(culmen_length_mm) as mean_culmen_length,\n",
" AVG(culmen_depth_mm) as mean_culment_depth,\n",
" AVG(flipper_length_mm) as mean_flipper_length,\n",
" AVG(body_mass_g) as mean_body_mass\n",
"FROM `bigquery-public-data.ml_datasets.penguins`\n",
"GROUP BY species"
]
},
{
"cell_type": "markdown",
"id": "36668bbc-db31-407c-b9f0-529403f3a767",
"metadata": {},
"source": [
"Review the mean values of each measurement within `species` and `island`:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c7190fc3-dbb8-41d7-a31d-0c0e90ac4b8a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "65c6d5b7c7924dc2af4a348725ca62b9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ea0c7a2923334001801a4fb03605c6e9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>count</th>\n",
" <th>mean_culmen_length</th>\n",
" <th>mean_culment_depth</th>\n",
" <th>mean_flipper_length</th>\n",
" <th>mean_body_mass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>56</td>\n",
" <td>38.501786</td>\n",
" <td>18.251786</td>\n",
" <td>189.732143</td>\n",
" <td>3688.392857</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>Dream</td>\n",
" <td>68</td>\n",
" <td>48.833824</td>\n",
" <td>18.420588</td>\n",
" <td>195.823529</td>\n",
" <td>3733.088235</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>124</td>\n",
" <td>47.504878</td>\n",
" <td>14.982114</td>\n",
" <td>217.186992</td>\n",
" <td>5076.016260</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Biscoe</td>\n",
" <td>44</td>\n",
" <td>38.975000</td>\n",
" <td>18.370455</td>\n",
" <td>188.795455</td>\n",
" <td>3709.659091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>52</td>\n",
" <td>38.950980</td>\n",
" <td>18.429412</td>\n",
" <td>191.196078</td>\n",
" <td>3706.372549</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island count \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream 56 \n",
"1 Chinstrap penguin (Pygoscelis antarctica) Dream 68 \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe 124 \n",
"3 Adelie Penguin (Pygoscelis adeliae) Biscoe 44 \n",
"4 Adelie Penguin (Pygoscelis adeliae) Torgersen 52 \n",
"\n",
" mean_culmen_length mean_culment_depth mean_flipper_length mean_body_mass \n",
"0 38.501786 18.251786 189.732143 3688.392857 \n",
"1 48.833824 18.420588 195.823529 3733.088235 \n",
"2 47.504878 14.982114 217.186992 5076.016260 \n",
"3 38.975000 18.370455 188.795455 3709.659091 \n",
"4 38.950980 18.429412 191.196078 3706.372549 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT species, island, count(*) as count,\n",
" AVG(culmen_length_mm) as mean_culmen_length,\n",
" AVG(culmen_depth_mm) as mean_culment_depth,\n",
" AVG(flipper_length_mm) as mean_flipper_length,\n",
" AVG(body_mass_g) as mean_body_mass\n",
"FROM `bigquery-public-data.ml_datasets.penguins`\n",
"GROUP BY species, island"
]
},
{
"cell_type": "markdown",
"id": "d984e2ec-d6cd-4abb-aad0-6528da1252d3",
"metadata": {},
"source": [
"Review the mean values of each measurment within `species` and `sex`:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "f20f60ca-8600-4363-a9bd-099cf92a0f1e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "630c5726983648e1940bd72f3b68365a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5763e453b68e4b0193bebbc35808ebf2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>sex</th>\n",
" <th>count</th>\n",
" <th>mean_culmen_length</th>\n",
" <th>mean_culment_depth</th>\n",
" <th>mean_flipper_length</th>\n",
" <th>mean_body_mass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>FEMALE</td>\n",
" <td>73</td>\n",
" <td>37.257534</td>\n",
" <td>17.621918</td>\n",
" <td>187.794521</td>\n",
" <td>3368.835616</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>MALE</td>\n",
" <td>73</td>\n",
" <td>40.390411</td>\n",
" <td>19.072603</td>\n",
" <td>192.410959</td>\n",
" <td>4043.493151</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>FEMALE</td>\n",
" <td>34</td>\n",
" <td>46.573529</td>\n",
" <td>17.588235</td>\n",
" <td>191.735294</td>\n",
" <td>3527.205882</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>MALE</td>\n",
" <td>34</td>\n",
" <td>51.094118</td>\n",
" <td>19.252941</td>\n",
" <td>199.911765</td>\n",
" <td>3938.970588</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>None</td>\n",
" <td>6</td>\n",
" <td>37.840000</td>\n",
" <td>18.320000</td>\n",
" <td>185.600000</td>\n",
" <td>3540.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>46.000000</td>\n",
" <td>14.166667</td>\n",
" <td>215.333333</td>\n",
" <td>4491.666667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>FEMALE</td>\n",
" <td>58</td>\n",
" <td>45.563793</td>\n",
" <td>14.237931</td>\n",
" <td>212.706897</td>\n",
" <td>4679.741379</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>MALE</td>\n",
" <td>61</td>\n",
" <td>49.473770</td>\n",
" <td>15.718033</td>\n",
" <td>221.540984</td>\n",
" <td>5484.836066</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>.</td>\n",
" <td>1</td>\n",
" <td>44.500000</td>\n",
" <td>15.700000</td>\n",
" <td>217.000000</td>\n",
" <td>4875.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species sex count \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) FEMALE 73 \n",
"1 Adelie Penguin (Pygoscelis adeliae) MALE 73 \n",
"2 Chinstrap penguin (Pygoscelis antarctica) FEMALE 34 \n",
"3 Chinstrap penguin (Pygoscelis antarctica) MALE 34 \n",
"4 Adelie Penguin (Pygoscelis adeliae) None 6 \n",
"5 Gentoo penguin (Pygoscelis papua) None 4 \n",
"6 Gentoo penguin (Pygoscelis papua) FEMALE 58 \n",
"7 Gentoo penguin (Pygoscelis papua) MALE 61 \n",
"8 Gentoo penguin (Pygoscelis papua) . 1 \n",
"\n",
" mean_culmen_length mean_culment_depth mean_flipper_length mean_body_mass \n",
"0 37.257534 17.621918 187.794521 3368.835616 \n",
"1 40.390411 19.072603 192.410959 4043.493151 \n",
"2 46.573529 17.588235 191.735294 3527.205882 \n",
"3 51.094118 19.252941 199.911765 3938.970588 \n",
"4 37.840000 18.320000 185.600000 3540.000000 \n",
"5 46.000000 14.166667 215.333333 4491.666667 \n",
"6 45.563793 14.237931 212.706897 4679.741379 \n",
"7 49.473770 15.718033 221.540984 5484.836066 \n",
"8 44.500000 15.700000 217.000000 4875.000000 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT species, sex, count(*) as count,\n",
" AVG(culmen_length_mm) as mean_culmen_length,\n",
" AVG(culmen_depth_mm) as mean_culment_depth,\n",
" AVG(flipper_length_mm) as mean_flipper_length,\n",
" AVG(body_mass_g) as mean_body_mass\n",
"FROM `bigquery-public-data.ml_datasets.penguins`\n",
"GROUP BY species, sex"
]
},
{
"cell_type": "markdown",
"id": "0fd958c6-160d-4c57-b1dc-cf11d9531e72",
"metadata": {},
"source": [
"Which observations have missing values?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "06a9852c-6ac0-4aa9-a971-a9d6ce6b4296",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b7fe3207774646289c537e0819c9066b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "92ff75e521254b4b8bd4865bd5201638",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" <th>sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>37.5</td>\n",
" <td>18.9</td>\n",
" <td>179.0</td>\n",
" <td>2975.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>47.3</td>\n",
" <td>13.8</td>\n",
" <td>216.0</td>\n",
" <td>4725.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>14.3</td>\n",
" <td>216.0</td>\n",
" <td>4100.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>15.7</td>\n",
" <td>217.0</td>\n",
" <td>4875.0</td>\n",
" <td>.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>46.2</td>\n",
" <td>14.4</td>\n",
" <td>214.0</td>\n",
" <td>4650.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>34.1</td>\n",
" <td>18.1</td>\n",
" <td>193.0</td>\n",
" <td>3475.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>37.8</td>\n",
" <td>17.1</td>\n",
" <td>186.0</td>\n",
" <td>3300.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>37.8</td>\n",
" <td>17.3</td>\n",
" <td>180.0</td>\n",
" <td>3700.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>42.0</td>\n",
" <td>20.2</td>\n",
" <td>190.0</td>\n",
" <td>4250.0</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island culmen_length_mm \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream 37.5 \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe NaN \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe 47.3 \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"5 Gentoo penguin (Pygoscelis papua) Biscoe 46.2 \n",
"6 Adelie Penguin (Pygoscelis adeliae) Torgersen NaN \n",
"7 Adelie Penguin (Pygoscelis adeliae) Torgersen 34.1 \n",
"8 Adelie Penguin (Pygoscelis adeliae) Torgersen 37.8 \n",
"9 Adelie Penguin (Pygoscelis adeliae) Torgersen 37.8 \n",
"10 Adelie Penguin (Pygoscelis adeliae) Torgersen 42.0 \n",
"\n",
" culmen_depth_mm flipper_length_mm body_mass_g sex \n",
"0 18.9 179.0 2975.0 None \n",
"1 NaN NaN NaN None \n",
"2 13.8 216.0 4725.0 None \n",
"3 14.3 216.0 4100.0 None \n",
"4 15.7 217.0 4875.0 . \n",
"5 14.4 214.0 4650.0 None \n",
"6 NaN NaN NaN None \n",
"7 18.1 193.0 3475.0 None \n",
"8 17.1 186.0 3300.0 None \n",
"9 17.3 180.0 3700.0 None \n",
"10 20.2 190.0 4250.0 None "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM `bigquery-public-data.ml_datasets.penguins`\n",
"WHERE sex IS NULL OR sex = '.'\n",
" OR culmen_length_mm IS NULL\n",
" OR culmen_depth_mm IS NULL\n",
" OR flipper_length_mm IS NULL\n",
" OR body_mass_g IS NULL"
]
},
{
"cell_type": "markdown",
"id": "8a834c76-260f-4e2f-868b-12e3d1db4421",
"metadata": {},
"source": [
"### Processing As Dataframes Using BigFrames API\n",
"\n",
"It can be helpful to use the `.describe()` method from Pandas. The [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest) API allows you to work in Python with dataframe like objects while the execution remains inside of BigQuery."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c8b9ee2e-c343-4405-a1eb-475ac4fb6e63",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df = bf.read_gbq(BQ_SOURCE_TABLE)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "40ceca0a-e36d-4a89-8978-3ef34ecdb00e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"Query job ab49247b-e12d-4b56-b680-ab6b1f2a051e is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:ab49247b-e12d-4b56-b680-ab6b1f2a051e&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job 44d9c49c-2226-498a-94b6-8cfd9b58d54a is DONE. 10.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:44d9c49c-2226-498a-94b6-8cfd9b58d54a&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>342.0</td>\n",
" <td>342.0</td>\n",
" <td>342.0</td>\n",
" <td>342.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>43.92193</td>\n",
" <td>17.15117</td>\n",
" <td>200.915205</td>\n",
" <td>4201.754386</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>5.459584</td>\n",
" <td>1.974793</td>\n",
" <td>14.061714</td>\n",
" <td>801.954536</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>32.1</td>\n",
" <td>13.1</td>\n",
" <td>172.0</td>\n",
" <td>2700.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>39.2</td>\n",
" <td>15.5</td>\n",
" <td>190.0</td>\n",
" <td>3550.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>44.1</td>\n",
" <td>17.3</td>\n",
" <td>197.0</td>\n",
" <td>4000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>48.5</td>\n",
" <td>18.7</td>\n",
" <td>213.0</td>\n",
" <td>4750.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>59.6</td>\n",
" <td>21.5</td>\n",
" <td>231.0</td>\n",
" <td>6300.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 4 columns</p>\n",
"</div>[8 rows x 4 columns in total]"
],
"text/plain": [
" culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g\n",
"count 342.0 342.0 342.0 342.0\n",
"mean 43.92193 17.15117 200.915205 4201.754386\n",
"std 5.459584 1.974793 14.061714 801.954536\n",
"min 32.1 13.1 172.0 2700.0\n",
"25% 39.2 15.5 190.0 3550.0\n",
"50% 44.1 17.3 197.0 4000.0\n",
"75% 48.5 18.7 213.0 4750.0\n",
"max 59.6 21.5 231.0 6300.0\n",
"\n",
"[8 rows x 4 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"id": "d963149c-77ce-4765-bad2-53375750173b",
"metadata": {},
"source": [
"---\n",
"## BigQuery Setup\n",
"\n",
"This workflow uses a BigQuery Public Dataset table (reviewed above). This section creates (or links existing) dataset in the users BigQuery project. This dataset is used to store the model object created below."
]
},
{
"cell_type": "markdown",
"id": "6759be97-92e6-4e8e-9d2e-496398c70f8f",
"metadata": {},
"source": [
"Create the dataset if missing:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "229efdd8-7320-4cd2-81a2-c181614933f5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ds = bigquery.Dataset(f\"{BQ_PROJECT}.{BQ_DATASET}\")\n",
"ds.location = BQ_REGION\n",
"ds = bq.create_dataset(dataset = ds, exists_ok = True)"
]
},
{
"cell_type": "markdown",
"id": "25d9323a-e74c-4d4b-8d2a-53abc036be6b",
"metadata": {},
"source": [
"Review dataset attributes:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a6dbe25d-7b6f-43cd-88ac-7beccb1ce1ee",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'bqml'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.dataset_id"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "37d81560-22d8-42e3-aa78-0664d765c15e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'statmike-mlops-349915'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.project"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "4528c5e2-5d64-4707-bd60-300cccfc3c24",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'statmike-mlops-349915:bqml'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.full_dataset_id"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "a798f7bf-9eaa-48e4-8682-2d1f738bd93b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'/projects/statmike-mlops-349915/datasets/bqml'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.path"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "10afc018-fbb7-463c-a016-8b862f777739",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'US'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.location"
]
},
{
"cell_type": "markdown",
"id": "606986f8-6686-457f-8e95-d280793b407a",
"metadata": {},
"source": [
"### Add Train/Eval Splits To Source Data\n",
"\n",
"Make a copy of the source data in the local project including a column `split` with values of 'TRAIN' and 'EVAL'. The code below shows how to do stratified sampling for balance across the categorical variable `island` within each `species`.\n",
"\n",
"> When using the [`CREATE MODEL` statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) the method of data splitting can be specified with the `DATA_SPLIT_METHOD` option - see [boosted tree](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree#data_split_method) for example. The default value is `AUTO_SPLIT` which would not split the training data for this example because it has fewer than 500 rows. Specifying 'RANDOM' as the method will use the values of options `DATA_SPLIT_EVAL_FRACTION` and `DATA_SPLIT_TEST_FRACTION` to direct the data split. In the example here the `CUSTOM` method is used and the splitting is done ahead of time with the SQL query that follows. It does stratifed sampling to ensure balanced splits across both 'species' and 'island' values."
]
},
{
"cell_type": "markdown",
"id": "0b36206e-31a8-4606-afa4-a3978fcd9a59",
"metadata": {},
"source": [
"Use the Python Client for BigQuery to create the source table with `split` column. This uses an formated string in Python to construct the query string using parameters."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "551d347d-914a-40af-8a64-466b2ad60616",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"<google.cloud.bigquery.table._EmptyRowIterator at 0x7fe67cd81f60>"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_job = bq.query(\n",
" f'''\n",
" CREATE OR REPLACE TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE_PREFIX}-source` AS\n",
" SELECT * EXCEPT(seq, count),\n",
" CASE\n",
" WHEN seq <= CEIL(.1 * count) THEN 'EVAL'\n",
" WHEN species IS NULL THEN 'EVAL'\n",
" ELSE 'TRAIN'\n",
" END AS split\n",
" FROM (\n",
" SELECT * EXCEPT(sex),\n",
" CASE WHEN sex = '.' THEN NULL ELSE sex END AS sex,\n",
" ROW_NUMBER() OVER (PARTITION BY species, island ORDER BY RAND()) as seq\n",
" FROM `{BQ_SOURCE_TABLE}`\n",
" )\n",
" LEFT OUTER JOIN (\n",
" SELECT species, island, COUNT(*) as count\n",
" FROM `{BQ_SOURCE_TABLE}`\n",
" GROUP BY species, island\n",
" )\n",
" USING(species, island)\n",
" '''\n",
")\n",
"query_job.result()"
]
},
{
"cell_type": "markdown",
"id": "f523a611-45a0-4215-a759-bc9d448a6b32",
"metadata": {},
"source": [
"Print out the table name with full URI:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "a1337212-c51d-4cb0-8f52-7f2f6726d1ca",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"statmike-mlops-349915.bqml.feature-engineering-source\n"
]
}
],
"source": [
"print(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE_PREFIX}-source')"
]
},
{
"cell_type": "markdown",
"id": "47f6df51-e717-4ec2-a112-9a0a7ed70bf0",
"metadata": {},
"source": [
"Review the TRAIN/EVAL split:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "3a79e14f-04c9-4975-a324-417b263b8788",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0ebc55820f794592be30011c75cccd59",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5ca13be33f9b4525b323907f83ff0a31",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>count</th>\n",
" <th>TRAIN_PCT</th>\n",
" <th>EVAL_PCT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>56</td>\n",
" <td>89.285714</td>\n",
" <td>10.714286</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>Dream</td>\n",
" <td>68</td>\n",
" <td>89.705882</td>\n",
" <td>10.294118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>124</td>\n",
" <td>89.516129</td>\n",
" <td>10.483871</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Biscoe</td>\n",
" <td>44</td>\n",
" <td>88.636364</td>\n",
" <td>11.363636</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>52</td>\n",
" <td>88.461538</td>\n",
" <td>11.538462</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island count TRAIN_PCT \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream 56 89.285714 \n",
"1 Chinstrap penguin (Pygoscelis antarctica) Dream 68 89.705882 \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe 124 89.516129 \n",
"3 Adelie Penguin (Pygoscelis adeliae) Biscoe 44 88.636364 \n",
"4 Adelie Penguin (Pygoscelis adeliae) Torgersen 52 88.461538 \n",
"\n",
" EVAL_PCT \n",
"0 10.714286 \n",
"1 10.294118 \n",
"2 10.483871 \n",
"3 11.363636 \n",
"4 11.538462 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT species, island,\n",
" COUNT(*) as count,\n",
" 100 * COUNTIF(split = 'TRAIN')/COUNT(*) AS TRAIN_PCT,\n",
" 100 * COUNTIF(split = 'EVAL')/COUNT(*) AS EVAL_PCT\n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"GROUP BY species, island"
]
},
{
"cell_type": "markdown",
"id": "ba19b0be-5d5c-4fb1-81d6-f0575bc029c8",
"metadata": {},
"source": [
"Which observations have missing values? "
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "93eb0496-ec52-4b01-844b-7025082b24ca",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "13bcefe577eb40b09becb2488b379e86",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "31f27f9fc3594250b7213f1e697f89e2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" <th>sex</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>37.5</td>\n",
" <td>18.9</td>\n",
" <td>179.0</td>\n",
" <td>2975.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>47.3</td>\n",
" <td>13.8</td>\n",
" <td>216.0</td>\n",
" <td>4725.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>14.3</td>\n",
" <td>216.0</td>\n",
" <td>4100.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>15.7</td>\n",
" <td>217.0</td>\n",
" <td>4875.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>46.2</td>\n",
" <td>14.4</td>\n",
" <td>214.0</td>\n",
" <td>4650.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>34.1</td>\n",
" <td>18.1</td>\n",
" <td>193.0</td>\n",
" <td>3475.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>37.8</td>\n",
" <td>17.1</td>\n",
" <td>186.0</td>\n",
" <td>3300.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>37.8</td>\n",
" <td>17.3</td>\n",
" <td>180.0</td>\n",
" <td>3700.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Torgersen</td>\n",
" <td>42.0</td>\n",
" <td>20.2</td>\n",
" <td>190.0</td>\n",
" <td>4250.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island culmen_length_mm \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream 37.5 \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe NaN \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe 47.3 \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"5 Gentoo penguin (Pygoscelis papua) Biscoe 46.2 \n",
"6 Adelie Penguin (Pygoscelis adeliae) Torgersen NaN \n",
"7 Adelie Penguin (Pygoscelis adeliae) Torgersen 34.1 \n",
"8 Adelie Penguin (Pygoscelis adeliae) Torgersen 37.8 \n",
"9 Adelie Penguin (Pygoscelis adeliae) Torgersen 37.8 \n",
"10 Adelie Penguin (Pygoscelis adeliae) Torgersen 42.0 \n",
"\n",
" culmen_depth_mm flipper_length_mm body_mass_g sex split \n",
"0 18.9 179.0 2975.0 None TRAIN \n",
"1 NaN NaN NaN None TRAIN \n",
"2 13.8 216.0 4725.0 None TRAIN \n",
"3 14.3 216.0 4100.0 None TRAIN \n",
"4 15.7 217.0 4875.0 None TRAIN \n",
"5 14.4 214.0 4650.0 None TRAIN \n",
"6 NaN NaN NaN None TRAIN \n",
"7 18.1 193.0 3475.0 None TRAIN \n",
"8 17.1 186.0 3300.0 None TRAIN \n",
"9 17.3 180.0 3700.0 None TRAIN \n",
"10 20.2 190.0 4250.0 None TRAIN "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE sex IS NULL OR sex = '.'\n",
" OR culmen_length_mm IS NULL\n",
" OR culmen_depth_mm IS NULL\n",
" OR flipper_length_mm IS NULL\n",
" OR body_mass_g IS NULL"
]
},
{
"cell_type": "markdown",
"id": "f5971c6c-9b68-4eac-9e64-82fd79f67309",
"metadata": {},
"source": [
"---\n",
"## Embedded Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "23f96d54-6617-4d9a-8d27-e5d660d5fbed",
"metadata": {},
"source": [
"### Create Model Using `TRANSFORM` statement\n",
"\n",
"Using the [`TRANSFORM` clause](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform), you can specify the desired preprocessing of columns into features. In the case of this data there are several desired preprocessing steps based on the data review above:\n",
"- impute missing values with [`ML.IMPUTER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer)\n",
"- scale the `body_mass_g` column with [`ML.ROBUST_SCALER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-robust-scaler)\n",
"- scale the other numerical columns with [`ML.STANDARD_SCALER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler)\n",
"\n",
"The model specification below does the data imputation in the input query and the scaling is embedded in the model with a `TRANSFORM` clause."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "1c732743-b7bd-4d40-bbcb-723d888f636a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "78521bcee0de410d878cbc20c2a63670",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`\n",
" TRANSFORM(\n",
" species, sex, island, split,\n",
" ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g,\n",
" ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm,\n",
" ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm,\n",
" ML.STANDARD_SCALER(flipper_length_mm) OVER() AS flipper_length_mm\n",
" )\n",
" OPTIONS(\n",
" model_type = 'BOOSTED_TREE_CLASSIFIER',\n",
" input_label_cols = ['species'],\n",
" data_split_method = 'CUSTOM',\n",
" data_split_col = 'split',\n",
" model_registry = 'VERTEX_AI',\n",
" VERTEX_AI_MODEL_ID = 'bqml_embedded_preprocessing'\n",
" )\n",
"AS\n",
"SELECT species, island,\n",
" CASE WHEN split = 'TRAIN' THEN FALSE ELSE TRUE END AS split,\n",
" ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,\n",
" ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,\n",
" ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,\n",
" ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,\n",
" ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`"
]
},
{
"cell_type": "markdown",
"id": "1ac82fe0-9217-4265-a169-00b542d8a4cb",
"metadata": {},
"source": [
"The feature information for the model can be reviewed with [ML.FEATURE_INFO](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature). This shows summary statistics pre-transformation. Notice that the `null_count` is 0 for all features because the [`ML.IMPUTER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer) functions filled in the missing values as instructed on the query statement."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "3b1cad55-d67a-45d3-80b8-4d93ba375264",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fc0564390cd54ce4b5c8178a7bfac20d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7226f2034d414ed5b9e62c51bda79e9a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>input</th>\n",
" <th>min</th>\n",
" <th>max</th>\n",
" <th>mean</th>\n",
" <th>median</th>\n",
" <th>stddev</th>\n",
" <th>category_count</th>\n",
" <th>null_count</th>\n",
" <th>dimension</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>culmen_length_mm</td>\n",
" <td>32.1</td>\n",
" <td>58.0</td>\n",
" <td>43.835322</td>\n",
" <td>44.4</td>\n",
" <td>5.378822</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>body_mass_g</td>\n",
" <td>2700.0</td>\n",
" <td>6300.0</td>\n",
" <td>4201.710098</td>\n",
" <td>4000.0</td>\n",
" <td>801.896704</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>flipper_length_mm</td>\n",
" <td>172.0</td>\n",
" <td>231.0</td>\n",
" <td>200.732347</td>\n",
" <td>197.0</td>\n",
" <td>14.054983</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>island</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>culmen_depth_mm</td>\n",
" <td>13.1</td>\n",
" <td>21.5</td>\n",
" <td>17.100333</td>\n",
" <td>17.3</td>\n",
" <td>1.936125</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>sex</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" input min max mean median stddev \\\n",
"0 culmen_length_mm 32.1 58.0 43.835322 44.4 5.378822 \n",
"1 body_mass_g 2700.0 6300.0 4201.710098 4000.0 801.896704 \n",
"2 flipper_length_mm 172.0 231.0 200.732347 197.0 14.054983 \n",
"3 island NaN NaN NaN NaN NaN \n",
"4 culmen_depth_mm 13.1 21.5 17.100333 17.3 1.936125 \n",
"5 sex NaN NaN NaN NaN NaN \n",
"\n",
" category_count null_count dimension \n",
"0 <NA> 0 <NA> \n",
"1 <NA> 0 <NA> \n",
"2 <NA> 0 <NA> \n",
"3 3 0 <NA> \n",
"4 <NA> 0 <NA> \n",
"5 2 0 <NA> "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM ML.FEATURE_INFO(MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`)"
]
},
{
"cell_type": "markdown",
"id": "593d1b75-3765-4e99-8a0a-26bdd3189b13",
"metadata": {},
"source": [
"The [`ML.EVALUATE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate) function can be used to review the evaluation metrics, here for both splits combined. Notice that the imputation with [`ML.IMPUTER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer) function needs to be repeated because it was not embedded in the model above."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "6b6cdfe3-b6f9-4c24-a37f-479643044d4d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "556b3d255fd94cd3bdab18cb50043046",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "48900599908f4fd2a6329b5b4c26c179",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>precision</th>\n",
" <th>recall</th>\n",
" <th>accuracy</th>\n",
" <th>f1_score</th>\n",
" <th>log_loss</th>\n",
" <th>roc_auc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.012105</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" precision recall accuracy f1_score log_loss roc_auc\n",
"0 1.0 1.0 1.0 1.0 0.012105 1.0"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" imputed AS (\n",
" SELECT species, island, split,\n",
" ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,\n",
" ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,\n",
" ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,\n",
" ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,\n",
" ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, \n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source` \n",
" )\n",
"SELECT *\n",
"FROM ML.EVALUATE(\n",
" MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,\n",
" (SELECT * FROM imputed)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "46893141-6c73-4e6e-8809-f3b336a886ed",
"metadata": {},
"source": [
"---\n",
"## Reusable Preprocessing\n",
"\n",
"With the [`ML.TRANSFORM`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transform) function you can transform the results of a query statement using the transformation of a previously created model. That makes any models `TRANSFORM` statement completely reusuable. This is helpful because the transform statement also remembers values that were calculated when they were created - like the mean and standard-deviation used with [`ML.STANDARD_SCALER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler)."
]
},
{
"cell_type": "markdown",
"id": "6f66fa46-27a2-478c-9e8a-efbad37fa9c9",
"metadata": {},
"source": [
"### Using ML.TRANSFORM"
]
},
{
"cell_type": "markdown",
"id": "96489752-34a6-49f3-ba04-8dd2b585b8c1",
"metadata": {},
"source": [
"First, a sample of raw data:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "836f844e-e68b-4818-aacb-f236c075c90c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "832f4ac52dee49e8a2208c9d35a5b10c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cfad7a9ed475452bac4f1942afeef45b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" <th>sex</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>47.3</td>\n",
" <td>13.8</td>\n",
" <td>216.0</td>\n",
" <td>4725.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>14.3</td>\n",
" <td>216.0</td>\n",
" <td>4100.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>44.5</td>\n",
" <td>15.7</td>\n",
" <td>217.0</td>\n",
" <td>4875.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>46.2</td>\n",
" <td>14.4</td>\n",
" <td>214.0</td>\n",
" <td>4650.0</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species island culmen_length_mm \\\n",
"0 Gentoo penguin (Pygoscelis papua) Biscoe NaN \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe 47.3 \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe 46.2 \n",
"\n",
" culmen_depth_mm flipper_length_mm body_mass_g sex split \n",
"0 NaN NaN NaN None TRAIN \n",
"1 13.8 216.0 4725.0 None TRAIN \n",
"2 14.3 216.0 4100.0 None TRAIN \n",
"3 15.7 217.0 4875.0 None TRAIN \n",
"4 14.4 214.0 4650.0 None TRAIN "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE sex IS NULL AND island = 'Biscoe'"
]
},
{
"cell_type": "markdown",
"id": "c6364755-cb79-4efe-b4ba-84496dfaa37b",
"metadata": {},
"source": [
"Next, the same raw sample processed with imputed missing values first:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "ed7f8d54-8c52-411a-a028-42b4a5853ce7",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "dd13a05abd814a7ba665903d4304df72",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0f875d0de78a4ae6a057b0f260f63d60",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: | |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species</th>\n",
" <th>split</th>\n",
" <th>island</th>\n",
" <th>sex_null</th>\n",
" <th>sex</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [species, split, island, sex_null, sex, body_mass_g, culmen_length_mm, culmen_depth_mm, flipper_length_mm]\n",
"Index: []"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" imputed AS (\n",
" SELECT species, split, island,\n",
" CASE WHEN sex IS NULL THEN TRUE ELSE FALSE END AS sex_null,\n",
" ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,\n",
" ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,\n",
" ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,\n",
" ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,\n",
" ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, \n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source` \n",
" )\n",
"SELECT *\n",
"FROM imputed\n",
"WHERE sex IS NULL AND island = 'Biscoe' "
]
},
{
"cell_type": "markdown",
"id": "5130cd45-82a8-456b-a714-171348332e4d",
"metadata": {},
"source": [
"Now, the raw sample with imputed values and then processed with the transformation from the model created above using `ML.TRANSFORM` to recall the reusable transformations:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "e0488365-ffe4-4ab2-b680-5ceb6219182f",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "01d329dcae2641f580cb3a74f798e64b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c4b9c915889e4aa79fa63b842fc112b9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sex</th>\n",
" <th>island</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>species</th>\n",
" <th>split</th>\n",
" <th>sex_null</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MALE</td>\n",
" <td>Biscoe</td>\n",
" <td>0.500000</td>\n",
" <td>0.440350</td>\n",
" <td>-1.396969</td>\n",
" <td>0.945546</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>TRAIN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>MALE</td>\n",
" <td>Biscoe</td>\n",
" <td>0.000000</td>\n",
" <td>0.016132</td>\n",
" <td>0.026317</td>\n",
" <td>0.013056</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>TRAIN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>MALE</td>\n",
" <td>Biscoe</td>\n",
" <td>0.041667</td>\n",
" <td>0.123779</td>\n",
" <td>-1.448703</td>\n",
" <td>1.088077</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>TRAIN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>MALE</td>\n",
" <td>Biscoe</td>\n",
" <td>0.687500</td>\n",
" <td>0.123779</td>\n",
" <td>-0.724429</td>\n",
" <td>1.159342</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>TRAIN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>MALE</td>\n",
" <td>Biscoe</td>\n",
" <td>0.562500</td>\n",
" <td>0.645190</td>\n",
" <td>-1.707373</td>\n",
" <td>1.088077</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>TRAIN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sex island body_mass_g culmen_length_mm culmen_depth_mm \\\n",
"0 MALE Biscoe 0.500000 0.440350 -1.396969 \n",
"1 MALE Biscoe 0.000000 0.016132 0.026317 \n",
"2 MALE Biscoe 0.041667 0.123779 -1.448703 \n",
"3 MALE Biscoe 0.687500 0.123779 -0.724429 \n",
"4 MALE Biscoe 0.562500 0.645190 -1.707373 \n",
"\n",
" flipper_length_mm species split sex_null \n",
"0 0.945546 Gentoo penguin (Pygoscelis papua) TRAIN True \n",
"1 0.013056 Gentoo penguin (Pygoscelis papua) TRAIN True \n",
"2 1.088077 Gentoo penguin (Pygoscelis papua) TRAIN True \n",
"3 1.159342 Gentoo penguin (Pygoscelis papua) TRAIN True \n",
"4 1.088077 Gentoo penguin (Pygoscelis papua) TRAIN True "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" imputed AS (\n",
" SELECT species, split, island,\n",
" CASE WHEN sex IS NULL THEN TRUE ELSE FALSE END AS sex_null,\n",
" ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,\n",
" ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,\n",
" ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,\n",
" ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,\n",
" ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, \n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source` \n",
" )\n",
"SELECT *\n",
"FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,\n",
" (SELECT *\n",
" FROM imputed\n",
" WHERE sex_null AND island = 'Biscoe'\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "24b20264-840a-4372-88bb-31659d52bd73",
"metadata": {},
"source": [
"---\n",
"## Modular Preprocessing\n",
"\n",
"What if you want to take adavantage of the `TRANSFORM` clauses results repeatedly across many models and other parts of your workflow? Also, what if you want to apply multiple `TRANSFORM` clauses in sequence: like imputation then scaling? A great option is [building a model with only transformations using the `model_type = 'TRANSFORM_ONLY'`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-transform) as follows:"
]
},
{
"cell_type": "markdown",
"id": "d8e383ff-1159-40cf-8f82-03305c0072bf",
"metadata": {},
"source": [
"### Create A `TRANSFORM_ONLY` Model - For Imputation"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "fee2a744-4c00-4d6b-bd0a-f325b6cfb06b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0ec418ee238c47bd888925e9bbff1e60",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`\n",
" TRANSFORM(\n",
" ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,\n",
" ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,\n",
" ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,\n",
" ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,\n",
" ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm\n",
" )\n",
" OPTIONS(\n",
" model_type = 'TRANSFORM_ONLY',\n",
" model_registry = 'VERTEX_AI',\n",
" VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing_impute'\n",
" )\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "markdown",
"id": "ffcc0613-fb25-4033-87aa-a937cda78471",
"metadata": {},
"source": [
"Now apply the `TRANSFORM_ONLY` model using `ML.TRANSFORM`:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "11a6faad-6e23-468b-83c5-bd05edaafeff",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "42444be24ade480da1e5fe7b03053f48",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0c36d87d8cd447fa89000db7e4f27065",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sex</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MALE</td>\n",
" <td>4050.0</td>\n",
" <td>43.9489</td>\n",
" <td>17.1557</td>\n",
" <td>200.961</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>MALE</td>\n",
" <td>4725.0</td>\n",
" <td>47.3000</td>\n",
" <td>13.8000</td>\n",
" <td>216.000</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>MALE</td>\n",
" <td>4100.0</td>\n",
" <td>44.5000</td>\n",
" <td>14.3000</td>\n",
" <td>216.000</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>MALE</td>\n",
" <td>4875.0</td>\n",
" <td>44.5000</td>\n",
" <td>15.7000</td>\n",
" <td>217.000</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>MALE</td>\n",
" <td>4650.0</td>\n",
" <td>46.2000</td>\n",
" <td>14.4000</td>\n",
" <td>214.000</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sex body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm \\\n",
"0 MALE 4050.0 43.9489 17.1557 200.961 \n",
"1 MALE 4725.0 47.3000 13.8000 216.000 \n",
"2 MALE 4100.0 44.5000 14.3000 216.000 \n",
"3 MALE 4875.0 44.5000 15.7000 217.000 \n",
"4 MALE 4650.0 46.2000 14.4000 214.000 \n",
"\n",
" species island split \n",
"0 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN "
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT *\n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
" WHERE sex IS NULL and island = 'Biscoe')\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d19d9389-2ea8-4f8c-96dc-16b6a3deccd3",
"metadata": {},
"source": [
"### Create A `TRANSFORM_ONLY` Model - For Scaling"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "1654de4e-7424-40ad-bd3f-9f03e1448b8b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "429e02fd16b449258747060cee831875",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`\n",
" TRANSFORM(\n",
" ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g,\n",
" ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm,\n",
" ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm,\n",
" ML.STANDARD_SCALER(flipper_length_mm) OVER() AS flipper_length_mm\n",
" )\n",
" OPTIONS(\n",
" model_type = 'TRANSFORM_ONLY',\n",
" model_registry = 'VERTEX_AI',\n",
" VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing_scale'\n",
")\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "markdown",
"id": "14b4f91d-c429-4001-a302-e5ca2c2c0b1f",
"metadata": {},
"source": [
"Now apply the `TRANSFORM_ONLY` model using `ML.TRANSFORM`:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "fca298ac-3a23-4d14-9917-7d28ead67ba4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2c78c4db4ea74408af0ffa4d0bb3199c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "83499a0859a0437fbe034cc5081ac66a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>sex</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.562500</td>\n",
" <td>0.625818</td>\n",
" <td>-1.678211</td>\n",
" <td>1.080419</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.041667</td>\n",
" <td>0.102918</td>\n",
" <td>-1.428157</td>\n",
" <td>1.080419</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.687500</td>\n",
" <td>0.102918</td>\n",
" <td>-0.728007</td>\n",
" <td>1.152260</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.500000</td>\n",
" <td>0.420393</td>\n",
" <td>-1.378146</td>\n",
" <td>0.936737</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>None</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm \\\n",
"0 NaN NaN NaN NaN \n",
"1 0.562500 0.625818 -1.678211 1.080419 \n",
"2 0.041667 0.102918 -1.428157 1.080419 \n",
"3 0.687500 0.102918 -0.728007 1.152260 \n",
"4 0.500000 0.420393 -1.378146 0.936737 \n",
"\n",
" species island sex split \n",
"0 Gentoo penguin (Pygoscelis papua) Biscoe None TRAIN \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe None TRAIN \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe None TRAIN \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe None TRAIN \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe None TRAIN "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,\n",
" (SELECT *\n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
" WHERE sex IS NULL and island = 'Biscoe')\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0aa12eed-5136-4038-936c-4e65e9a6e4ba",
"metadata": {},
"source": [
"## Feature Pipeline: With Multiple `TRANSFORM_ONLY` Models\n",
"\n",
"Use BigQuery [CTEs (WITH Clause)](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause) for modular pipeline definition."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "b61249ab-6018-4ac3-991f-43652d8426c1",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3aa606b878524b05a7961028a3f401ca",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b4dc812720c943608ad85eb873e16722",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>sex</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>MALE</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.562500</td>\n",
" <td>0.625818</td>\n",
" <td>-1.678211</td>\n",
" <td>1.080419</td>\n",
" <td>MALE</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.041667</td>\n",
" <td>0.102918</td>\n",
" <td>-1.428157</td>\n",
" <td>1.080419</td>\n",
" <td>MALE</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.687500</td>\n",
" <td>0.102918</td>\n",
" <td>-0.728007</td>\n",
" <td>1.152260</td>\n",
" <td>MALE</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.500000</td>\n",
" <td>0.420393</td>\n",
" <td>-1.378146</td>\n",
" <td>0.936737</td>\n",
" <td>MALE</td>\n",
" <td>Gentoo penguin (Pygoscelis papua)</td>\n",
" <td>Biscoe</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm sex \\\n",
"0 0.000000 0.000000 0.000000 0.000000 MALE \n",
"1 0.562500 0.625818 -1.678211 1.080419 MALE \n",
"2 0.041667 0.102918 -1.428157 1.080419 MALE \n",
"3 0.687500 0.102918 -0.728007 1.152260 MALE \n",
"4 0.500000 0.420393 -1.378146 0.936737 MALE \n",
"\n",
" species island split \n",
"0 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"1 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"2 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"3 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN \n",
"4 Gentoo penguin (Pygoscelis papua) Biscoe TRAIN "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" raw AS (\n",
" SELECT *\n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
" WHERE sex IS NULL and island = 'Biscoe'\n",
" ),\n",
" impute AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
" )\n",
" ),\n",
" scale AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,\n",
" (SELECT * FROM impute)\n",
" )\n",
" )\n",
"SELECT *\n",
"FROM scale"
]
},
{
"cell_type": "markdown",
"id": "32d49e7e-fd52-42fc-9b7a-187ec727ca86",
"metadata": {},
"source": [
"This approach allows increased modularity with ease, so there can actually be feature specific transform like what might be accomplished with a feature store. This section create a TRANSFORM_ONLY model for each feature:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "b6c5749c-f85b-49d3-9684-95c4a45bcb23",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6ca7434632ea47efbab6415d4cd11f76",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_body_mass_g`\n",
" TRANSFORM(\n",
" ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g\n",
" )\n",
" OPTIONS(model_type = 'TRANSFORM_ONLY')\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "e2cae737-62f1-44ec-a8a3-d5b1d6fa7798",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b1153d779d8b4ffc9f7b1472e145df7a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_culmen_length_mm`\n",
" TRANSFORM(\n",
" ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm\n",
" )\n",
" OPTIONS(model_type = 'TRANSFORM_ONLY')\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "6411a686-996e-48ec-a8b6-1d45339f1589",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d10266b781d04ede8ed5707984e5eefe",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_culmen_depth_mm`\n",
" TRANSFORM(\n",
" ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm\n",
" )\n",
" OPTIONS(model_type = 'TRANSFORM_ONLY')\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "fe5a58ca-9dc8-4e07-8900-ebd5d6f77e63",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fc849091705c47f5b811daa03839e18b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_flipper_length_mm`\n",
" TRANSFORM(\n",
" ML.STANDARD_SCALER(flipper_length_mm) OVER() AS flipper_length_mm\n",
" )\n",
" OPTIONS(model_type = 'TRANSFORM_ONLY')\n",
"AS\n",
"SELECT * \n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
"WHERE split = 'TRAIN'"
]
},
{
"cell_type": "markdown",
"id": "b99464c6-ea4c-44e6-8cca-9444d1c978fd",
"metadata": {},
"source": [
"### Feature Pipeline As A View\n",
"\n",
"The feature pipeline can be made if more reusuable by using it in a [view](https://cloud.google.com/bigquery/docs/views):"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "6aa6a32b-3935-4ee2-81c3-46ed65c69d8a",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "210a9d1315e24b5bad62e693f9880f38",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE VIEW `statmike-mlops-349915.bqml.feature-engineering-preprocessing` AS\n",
"WITH\n",
" raw AS (\n",
" SELECT *\n",
" FROM `statmike-mlops-349915.bqml.feature-engineering-source`\n",
" ),\n",
" impute AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
" )\n",
" ),\n",
" feature_1 AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_body_mass_g`,\n",
" (SELECT * FROM impute)\n",
" )\n",
" ),\n",
" feature_2 AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_culmen_length_mm`,\n",
" (SELECT * FROM feature_1)\n",
" )\n",
" ),\n",
" feature_3 AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_culmen_depth_mm`,\n",
" (SELECT * FROM feature_2)\n",
" )\n",
" ),\n",
" feature_4 AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale_flipper_length_mm`,\n",
" (SELECT * FROM feature_3)\n",
" )\n",
" )\n",
"SELECT *\n",
"FROM feature_4"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "35441915-1586-4be0-a4e2-770ab87b0880",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c796207c76f947b79d809854b14e23e2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cd78983a6768474fa1c0960323f6d374",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>flipper_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>body_mass_g</th>\n",
" <th>sex</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" <th>split</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-1.218498</td>\n",
" <td>0.872338</td>\n",
" <td>-0.569382</td>\n",
" <td>-0.125000</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1.218498</td>\n",
" <td>0.972359</td>\n",
" <td>-0.774807</td>\n",
" <td>0.500000</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1.218498</td>\n",
" <td>0.622284</td>\n",
" <td>-1.372408</td>\n",
" <td>-0.479167</td>\n",
" <td>FEMALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.643768</td>\n",
" <td>-0.277910</td>\n",
" <td>0.551118</td>\n",
" <td>-1.125000</td>\n",
" <td>FEMALE</td>\n",
" <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
" <td>Dream</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.643768</td>\n",
" <td>0.672295</td>\n",
" <td>-0.139857</td>\n",
" <td>0.041667</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" <td>TRAIN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" flipper_length_mm culmen_depth_mm culmen_length_mm body_mass_g sex \\\n",
"0 -1.218498 0.872338 -0.569382 -0.125000 MALE \n",
"1 -1.218498 0.972359 -0.774807 0.500000 MALE \n",
"2 -1.218498 0.622284 -1.372408 -0.479167 FEMALE \n",
"3 -0.643768 -0.277910 0.551118 -1.125000 FEMALE \n",
"4 -0.643768 0.672295 -0.139857 0.041667 MALE \n",
"\n",
" species island split \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream TRAIN \n",
"1 Adelie Penguin (Pygoscelis adeliae) Dream TRAIN \n",
"2 Adelie Penguin (Pygoscelis adeliae) Dream TRAIN \n",
"3 Chinstrap penguin (Pygoscelis antarctica) Dream TRAIN \n",
"4 Adelie Penguin (Pygoscelis adeliae) Dream TRAIN "
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-preprocessing`\n",
"LIMIT 5"
]
},
{
"cell_type": "markdown",
"id": "64d27b79-6e0e-4e5e-8d4d-611cba26723f",
"metadata": {},
"source": [
"### Create A Model Using `TRANSFORM_ONLY` Models As A Feature Pipeline\n",
"\n",
"Using the feature pipeline built form BigQuery CTEs as above can be the input to a `CREATE MODEL` statement:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "b2330f87-b8a7-4519-bd5d-364472747546",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "080c6aca20fb411d9c9dc29415ba51a9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing`\n",
" OPTIONS(\n",
" model_type = 'BOOSTED_TREE_CLASSIFIER',\n",
" input_label_cols = ['species'],\n",
" data_split_method = 'CUSTOM',\n",
" data_split_col = 'split',\n",
" model_registry = 'VERTEX_AI',\n",
" VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing'\n",
" )\n",
"AS\n",
"SELECT * EXCEPT(split),\n",
" CASE WHEN split = 'TRAIN' THEN FALSE ELSE TRUE END AS split\n",
"FROM `statmike-mlops-349915.bqml.feature-engineering-preprocessing`"
]
},
{
"cell_type": "markdown",
"id": "9ac7ea2a-18cf-4233-9a34-930626deb961",
"metadata": {},
"source": [
"The feature information for the model can be reviewed with [ML.FEATURE_INFO](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature). This shows summary statistics pre-transformation inside the model, but since the features were pre-processed using modular `TRANSFORM_ONLY` models, the input features are already imputed and scaled."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "93f58752-3879-437a-bff2-d3f7f5aefe18",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b1d26b47e5ae4812a28dcc89e0c5ce22",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9fa3357e29684e59a560aee678015c89",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>input</th>\n",
" <th>min</th>\n",
" <th>max</th>\n",
" <th>mean</th>\n",
" <th>median</th>\n",
" <th>stddev</th>\n",
" <th>category_count</th>\n",
" <th>null_count</th>\n",
" <th>dimension</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>flipper_length_mm</td>\n",
" <td>-2.080591</td>\n",
" <td>2.158036</td>\n",
" <td>-0.000025</td>\n",
" <td>-0.284563</td>\n",
" <td>0.998366</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>culmen_depth_mm</td>\n",
" <td>-2.028286</td>\n",
" <td>2.172617</td>\n",
" <td>0.000019</td>\n",
" <td>0.072166</td>\n",
" <td>0.998363</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>culmen_length_mm</td>\n",
" <td>-2.212783</td>\n",
" <td>2.624044</td>\n",
" <td>-0.000009</td>\n",
" <td>0.102918</td>\n",
" <td>0.998365</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>body_mass_g</td>\n",
" <td>-1.125000</td>\n",
" <td>1.875000</td>\n",
" <td>0.125407</td>\n",
" <td>0.000000</td>\n",
" <td>0.660268</td>\n",
" <td><NA></td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>sex</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>island</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td><NA></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" input min max mean median stddev \\\n",
"0 flipper_length_mm -2.080591 2.158036 -0.000025 -0.284563 0.998366 \n",
"1 culmen_depth_mm -2.028286 2.172617 0.000019 0.072166 0.998363 \n",
"2 culmen_length_mm -2.212783 2.624044 -0.000009 0.102918 0.998365 \n",
"3 body_mass_g -1.125000 1.875000 0.125407 0.000000 0.660268 \n",
"4 sex NaN NaN NaN NaN NaN \n",
"5 island NaN NaN NaN NaN NaN \n",
"\n",
" category_count null_count dimension \n",
"0 <NA> 0 <NA> \n",
"1 <NA> 0 <NA> \n",
"2 <NA> 0 <NA> \n",
"3 <NA> 0 <NA> \n",
"4 2 0 <NA> \n",
"5 3 0 <NA> "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM ML.FEATURE_INFO(MODEL `statmike-mlops-349915.bqml.modular_preprocessing`)"
]
},
{
"cell_type": "markdown",
"id": "c0ba7707-9176-4f0e-8088-577e927c5231",
"metadata": {},
"source": [
"The [`ML.EVALUATE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate) function can be used to review the evaluation metrics, here for both splits combined. Notice that the feature pipeline needs to be repeated because it is not embedded in the model in this case."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "c6bed37b-541e-4cd2-89e0-13ef15a32631",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "11e9d55650294ea5a0c34faebcdfe776",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "88a238f8e4234b4c84738ca6e5d18e15",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>precision</th>\n",
" <th>recall</th>\n",
" <th>accuracy</th>\n",
" <th>f1_score</th>\n",
" <th>log_loss</th>\n",
" <th>roc_auc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.997821</td>\n",
" <td>0.995098</td>\n",
" <td>0.997093</td>\n",
" <td>0.996438</td>\n",
" <td>0.015599</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" precision recall accuracy f1_score log_loss roc_auc\n",
"0 0.997821 0.995098 0.997093 0.996438 0.015599 1.0"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"SELECT *\n",
"FROM ML.EVALUATE(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing`,\n",
" (SELECT * FROM `statmike-mlops-349915.bqml.feature-engineering-preprocessing`)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c70ada0c-ef75-4f0f-b9db-8dca878b5c2a",
"metadata": {},
"source": [
"---\n",
"## Review Models With Step-By-Step Processing\n",
"\n",
"This section takes a single row of data and traces it through `TRANSFORM_ONLY` models as well as the predictive models created (trained) above."
]
},
{
"cell_type": "markdown",
"id": "5df914ad-99bc-4cab-906c-6c15568fcb75",
"metadata": {},
"source": [
"### Example Row - Raw\n",
"\n",
"Create an example row and include a missing value to showcase the imputation `TRANSFORM_ONLY` model."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "01e285de-459f-46d4-a757-63cf78a4661c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'known_species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'island': 'Dream',\n",
" 'culmen_length_mm': None,\n",
" 'culmen_depth_mm': 18.9,\n",
" 'flipper_length_mm': 184.0,\n",
" 'body_mass_g': 3900.0,\n",
" 'sex': 'MALE'}"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_dict = dict(\n",
" known_species = 'Adelie Penguin (Pygoscelis adeliae)',\n",
" island = 'Dream',\n",
" culmen_length_mm = None, # 40.9\n",
" culmen_depth_mm = 18.9,\n",
" flipper_length_mm = 184.0,\n",
" body_mass_g = 3900.00,\n",
" sex = 'MALE'\n",
")\n",
"raw_dict"
]
},
{
"cell_type": "markdown",
"id": "8b2f166d-1571-4b38-b6fb-e5e611623956",
"metadata": {},
"source": [
"### Preprocessing - Imputation"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "8cf7ab40-bd33-4fa9-bcc0-d8b55edc7123",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a46881dda25440b4840e8b5f1094c0a5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "42d58a8d3c3c4a76ae32200fa6ba5339",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%bigquery impute_dict --project $PROJECT_ID\n",
"WITH\n",
" raw AS (\n",
" SELECT\n",
" 'Adelie Penguin (Pygoscelis adeliae)' as species,\n",
" 'Dream' as island,\n",
" Null as culmen_length_mm,\n",
" 18.9 as culmen_depth_mm,\n",
" 184.0 as flipper_length_mm,\n",
" 3900.00 as body_mass_g,\n",
" 'MALE' as sex\n",
" )\n",
"SELECT *\n",
"FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "46b3fae2-bd83-45ff-80ef-6cd13f33c699",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sex</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MALE</td>\n",
" <td>3900.0</td>\n",
" <td>43.8348</td>\n",
" <td>18.9</td>\n",
" <td>184.0</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sex body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm \\\n",
"0 MALE 3900.0 43.8348 18.9 184.0 \n",
"\n",
" species island \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"impute_dict"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "f4a263d6-94ae-4735-9d19-8e004824a0b8",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'sex': 'MALE',\n",
" 'body_mass_g': 3900.0,\n",
" 'culmen_length_mm': 43.8348,\n",
" 'culmen_depth_mm': 18.9,\n",
" 'flipper_length_mm': 184.0,\n",
" 'species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'island': 'Dream'}"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"impute_dict = impute_dict.to_dict(orient = 'records')[0]\n",
"impute_dict"
]
},
{
"cell_type": "markdown",
"id": "89dd6da2-c1e1-4661-ba5a-968718dfe285",
"metadata": {},
"source": [
"### Preprocessing - Imputation and Scaling"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "15723e68-cb83-4f03-95f8-959bb99ebe50",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "21ded754f94e404a9b1b427ced861fc4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b95062a088d94c12a0533f3594ca2595",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%bigquery scale_dict --project $PROJECT_ID\n",
"WITH\n",
" raw AS (\n",
" SELECT\n",
" 'Adelie Penguin (Pygoscelis adeliae)' as species,\n",
" 'Dream' as island,\n",
" Null as culmen_length_mm,\n",
" 18.9 as culmen_depth_mm,\n",
" 184.0 as flipper_length_mm,\n",
" 3900.00 as body_mass_g,\n",
" 'MALE' as sex\n",
" ),\n",
" impute AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
" )\n",
" )\n",
"SELECT *\n",
"FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,\n",
" (SELECT * FROM impute)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "23f0f3f0-1014-4d98-a4a1-f26e325ebd02",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>sex</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.083333</td>\n",
" <td>0.0</td>\n",
" <td>0.928175</td>\n",
" <td>-1.18845</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm sex \\\n",
"0 -0.083333 0.0 0.928175 -1.18845 MALE \n",
"\n",
" species island \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scale_dict"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "10db31cc-444a-44d5-949a-8c5e5413098d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'body_mass_g': -0.08333333333333333,\n",
" 'culmen_length_mm': 0.0,\n",
" 'culmen_depth_mm': 0.9281747443652044,\n",
" 'flipper_length_mm': -1.188450063929535,\n",
" 'sex': 'MALE',\n",
" 'species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'island': 'Dream'}"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scale_dict = scale_dict.to_dict(orient = 'records')[0]\n",
"scale_dict"
]
},
{
"cell_type": "markdown",
"id": "522beaba-8e2c-45b7-9288-c3333c5cf593",
"metadata": {},
"source": [
"### Prediction With Embedded Preprocessing\n",
"\n",
"The model created above with embedded preprocessing has the scaling transformation built-in. Here the imputed data row is used with `ML.PREDICT` to request a prediction with the embedded preprocessing version of the model."
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "aa9f6169-9cc0-416d-aae4-ed6a7c6e8106",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1cba910efd97480ab8e25b77ce70c627",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9a2ea6893caf45eca665b74d171bace7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>predicted_species</th>\n",
" <th>predicted_species_probs</th>\n",
" <th>sex</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>[{'label': 'Gentoo penguin (Pygoscelis papua)'...</td>\n",
" <td>MALE</td>\n",
" <td>3900.0</td>\n",
" <td>43.8348</td>\n",
" <td>18.9</td>\n",
" <td>184.0</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" predicted_species \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) \n",
"\n",
" predicted_species_probs sex body_mass_g \\\n",
"0 [{'label': 'Gentoo penguin (Pygoscelis papua)'... MALE 3900.0 \n",
"\n",
" culmen_length_mm culmen_depth_mm flipper_length_mm \\\n",
"0 43.8348 18.9 184.0 \n",
"\n",
" species island \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream "
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" raw AS (\n",
" SELECT\n",
" 'Adelie Penguin (Pygoscelis adeliae)' as species,\n",
" 'Dream' as island,\n",
" Null as culmen_length_mm,\n",
" 18.9 as culmen_depth_mm,\n",
" 184.0 as flipper_length_mm,\n",
" 3900.00 as body_mass_g,\n",
" 'MALE' as sex\n",
" ),\n",
" impute AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
" )\n",
" )\n",
"SELECT *\n",
"FROM ML.PREDICT(\n",
" MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,\n",
" (SELECT * FROM impute)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f70a5ed6-0626-47e1-b82f-30f8751147e6",
"metadata": {},
"source": [
"### Prediction With Modular Preprocessing\n",
"\n",
"The modular preprocessing model did all the preprocessing prior to the model with `TRANSFORM_ONLY` models. This example takes the row after it has been processed through imputation and scaling."
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "571cf882-2e31-4962-9b4e-8f8301d24186",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3df704c700f64c1fb94923ad55a885d7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "584bb6046f974538b6d12be8f2b3f8e7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>predicted_species</th>\n",
" <th>predicted_species_probs</th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>sex</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>[{'label': 'Gentoo penguin (Pygoscelis papua)'...</td>\n",
" <td>-0.083333</td>\n",
" <td>0.0</td>\n",
" <td>0.928175</td>\n",
" <td>-1.18845</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" predicted_species \\\n",
"0 Adelie Penguin (Pygoscelis adeliae) \n",
"\n",
" predicted_species_probs body_mass_g \\\n",
"0 [{'label': 'Gentoo penguin (Pygoscelis papua)'... -0.083333 \n",
"\n",
" culmen_length_mm culmen_depth_mm flipper_length_mm sex \\\n",
"0 0.0 0.928175 -1.18845 MALE \n",
"\n",
" species island \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream "
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"WITH\n",
" raw AS (\n",
" SELECT\n",
" 'Adelie Penguin (Pygoscelis adeliae)' as species,\n",
" 'Dream' as island,\n",
" Null as culmen_length_mm,\n",
" 18.9 as culmen_depth_mm,\n",
" 184.0 as flipper_length_mm,\n",
" 3900.00 as body_mass_g,\n",
" 'MALE' as sex\n",
" ),\n",
" impute AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,\n",
" (SELECT * FROM raw)\n",
" )\n",
" ),\n",
" scale AS (\n",
" SELECT *\n",
" FROM ML.TRANSFORM(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,\n",
" (SELECT * FROM impute)\n",
" )\n",
" )\n",
"SELECT *\n",
"FROM ML.PREDICT(\n",
" MODEL `statmike-mlops-349915.bqml.modular_preprocessing`,\n",
" (SELECT * FROM scale)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a9431eb1-570c-41cb-81e1-2e220b89210a",
"metadata": {},
"source": [
"---\n",
"## Using Models With BigFrames API\n",
"\n",
"The model with modular preprocessing can be used directly with the [BigFrames API](https://cloud.google.com/python/docs/reference/bigframes/latest) as long as the inputs are first processed with the imputation and scaling preprocessors."
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "4ca40f92-bc3d-4caa-bb4e-55c2caffd172",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"model = bf.read_gbq_model(f'{BQ_PROJECT}.{BQ_DATASET}.modular_preprocessing')"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "c3b96359-ff61-4f27-aff4-cd5cb32a67a3",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'body_mass_g': -0.08333333333333333,\n",
" 'culmen_length_mm': 0.0,\n",
" 'culmen_depth_mm': 0.9281747443652044,\n",
" 'flipper_length_mm': -1.188450063929535,\n",
" 'sex': 'MALE',\n",
" 'species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'island': 'Dream'}"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scale_dict"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "f6611bfa-e3f8-4e9c-a81b-ce093e438fb4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"Load job 8e9500f9-1d61-4938-9369-6931521f0e37 is DONE. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:8e9500f9-1d61-4938-9369-6931521f0e37&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job 1603d53e-dec5-4dbb-93a9-8edd2254b799 is DONE. 8 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:1603d53e-dec5-4dbb-93a9-8edd2254b799&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job ae131bb0-9597-4a7c-ad9b-3a5644467723 is DONE. 98 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:ae131bb0-9597-4a7c-ad9b-3a5644467723&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>body_mass_g</th>\n",
" <th>culmen_length_mm</th>\n",
" <th>culmen_depth_mm</th>\n",
" <th>flipper_length_mm</th>\n",
" <th>sex</th>\n",
" <th>species</th>\n",
" <th>island</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.083333</td>\n",
" <td>0.0</td>\n",
" <td>0.928175</td>\n",
" <td>-1.18845</td>\n",
" <td>MALE</td>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" <td>Dream</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 7 columns</p>\n",
"</div>[1 rows x 7 columns in total]"
],
"text/plain": [
" body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm sex \\\n",
"0 -0.083333 0.0 0.928175 -1.18845 MALE \n",
"\n",
" species island \n",
"0 Adelie Penguin (Pygoscelis adeliae) Dream \n",
"\n",
"[1 rows x 7 columns]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_df = pd.Series(scale_dict).to_frame().T\n",
"test_bf = bf.read_pandas(test_df)\n",
"test_bf.head()"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "30598965-763d-4c19-96f5-4de709e58055",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"Query job 7ae74d28-11fc-48a6-ac80-036a9e036def is DONE. 31.6 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:7ae74d28-11fc-48a6-ac80-036a9e036def&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job 25e4677e-74cd-424c-884b-7ae37a649f24 is DONE. 8 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:25e4677e-74cd-424c-884b-7ae37a649f24&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job 3cb7a1a8-bfba-4683-bb88-543924854ad8 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:3cb7a1a8-bfba-4683-bb88-543924854ad8&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Query job 5a4aa2a7-43ce-49ff-894d-724173fd7967 is DONE. 45 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&j=bq:US:5a4aa2a7-43ce-49ff-894d-724173fd7967&page=queryresults\">Open Job</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>predicted_species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 1 columns</p>\n",
"</div>[1 rows x 1 columns in total]"
],
"text/plain": [
" predicted_species\n",
"0 Adelie Penguin (Pygoscelis adeliae)\n",
"\n",
"[1 rows x 1 columns]"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.predict(test_bf)"
]
},
{
"cell_type": "markdown",
"id": "dba3d1ad-3434-4ac4-8883-93afcce2c838",
"metadata": {},
"source": [
"---\n",
"## Export To GCS For Complete Portability\n",
"\n",
"Models trained in BigQuery ML with the `CREATE MODEL` statement can be [exported](https://cloud.google.com/bigquery/docs/exporting-models) for portability. This even includes a preprocessing model with the `TRANSFORM` statement is used during model creation."
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "03c8cf59-b6f6-4473-b3ad-f9d336e88f81",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"<Bucket: statmike-mlops-349915>"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gcs.lookup_bucket(GCS_BUCKET)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "36fe5e1c-4ddc-46ca-b73c-643b20904151",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1b3827828d3a4101bd1beb8190005459",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Query is running: 0%| |"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"EXPORT MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`\n",
" OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/embedded_preprocessing');\n",
"EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing`\n",
" OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing');\n",
"EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`\n",
" OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing_impute');\n",
"EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`\n",
" OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing_scale');"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "4cfb7ecd-077e-4b29-8e1f-f9c485dea86e",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bqml/feature-engineering/\n",
"bqml/feature-engineering/models/\n",
"bqml/feature-engineering/models/embedded_preprocessing/\n",
"bqml/feature-engineering/models/embedded_preprocessing/assets/0_categorical_label.txt\n",
"bqml/feature-engineering/models/embedded_preprocessing/assets/1_categorical_label.txt\n",
"bqml/feature-engineering/models/embedded_preprocessing/assets/model_metadata.json\n",
"bqml/feature-engineering/models/embedded_preprocessing/explanation_metadata.json\n",
"bqml/feature-engineering/models/embedded_preprocessing/main.py\n",
"bqml/feature-engineering/models/embedded_preprocessing/model.bst\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/assets/\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/fingerprint.pb\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/saved_model.pb\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/variables/\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/variables/variables.data-00000-of-00001\n",
"bqml/feature-engineering/models/embedded_preprocessing/transform/variables/variables.index\n",
"bqml/feature-engineering/models/embedded_preprocessing/xgboost_predictor-0.1.tar.gz\n",
"bqml/feature-engineering/models/modular_preprocessing/assets/4_categorical_label.txt\n",
"bqml/feature-engineering/models/modular_preprocessing/assets/5_categorical_label.txt\n",
"bqml/feature-engineering/models/modular_preprocessing/assets/model_metadata.json\n",
"bqml/feature-engineering/models/modular_preprocessing/explanation_metadata.json\n",
"bqml/feature-engineering/models/modular_preprocessing/main.py\n",
"bqml/feature-engineering/models/modular_preprocessing/model.bst\n",
"bqml/feature-engineering/models/modular_preprocessing/xgboost_predictor-0.1.tar.gz\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/assets/\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/fingerprint.pb\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/saved_model.pb\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/variables/\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/variables/variables.data-00000-of-00001\n",
"bqml/feature-engineering/models/modular_preprocessing_impute/transform/variables/variables.index\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/assets/\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/fingerprint.pb\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/saved_model.pb\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/variables/\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/variables/variables.data-00000-of-00001\n",
"bqml/feature-engineering/models/modular_preprocessing_scale/transform/variables/variables.index\n"
]
}
],
"source": [
"for blob in list(gcs.bucket(GCS_BUCKET).list_blobs(prefix = 'bqml/feature-engineering/')):\n",
" print(blob.name)"
]
},
{
"cell_type": "markdown",
"id": "df68e3fb-1f82-4d12-a89a-a9a3df4f41e7",
"metadata": {},
"source": [
"---\n",
"## Online Serving With Vertex AI\n",
"\n",
"For online serving the export to GCS can be bypassed and instead [register the BigQuery ML model](https://cloud.google.com/bigquery/docs/create_vertex) directly to [Vertex AI Model Registry](https://cloud.google.com/vertex-ai/docs/model-registry/introduction) using the training option `MODEL_REGISTRY = 'VERTEX_AI'`. This was already included in model created above and the cells that follow us the Python SDK for Vertex AI to review the model registry and [deploy the models](https://cloud.google.com/vertex-ai/docs/general/deployment) to a Vertex AI Prediction Endpoint for online serving."
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "8518a840-fb98-4530-811b-e3ed3fef3a16",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"## vertex ai client\n",
"aiplatform.init(project = PROJECT_ID, location = REGION)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "5ec2bc65-45d3-4efc-af92-2f965f13df13",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bqml_modular_preprocessing\n",
"bqml_modular_preprocessing_scale\n",
"bqml_modular_preprocessing_impute\n",
"bqml_embedded_preprocessing\n"
]
}
],
"source": [
"for model in aiplatform.Model.list():\n",
" if model.name.startswith('bqml_modular'): print(model.name)\n",
" if model.name.startswith('bqml_embedded'): print(model.name)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "40ff3f46-ccbf-4651-b3b7-b2c23f5b59c1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"vertex_model = aiplatform.Model(model_name = 'bqml_embedded_preprocessing')"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "08dfb717-b679-42c8-9fcb-510b8eaf8bbc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating Endpoint\n",
"Create Endpoint backing LRO: projects/1026793852137/locations/us-central1/endpoints/1328405759423152128/operations/8441581267635404800\n",
"Endpoint created. Resource name: projects/1026793852137/locations/us-central1/endpoints/1328405759423152128\n",
"To use this Endpoint in another session:\n",
"endpoint = aiplatform.Endpoint('projects/1026793852137/locations/us-central1/endpoints/1328405759423152128')\n"
]
}
],
"source": [
"endpoint = aiplatform.Endpoint.create(display_name = 'bqml_embedded_preprocessing')"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "8d41094f-009c-4ba5-9f4f-ce6c8686e9f1",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Deploying model to Endpoint : projects/1026793852137/locations/us-central1/endpoints/1328405759423152128\n",
"Using default machine_type: n1-standard-2\n",
"Deploy Endpoint model backing LRO: projects/1026793852137/locations/us-central1/endpoints/1328405759423152128/operations/3032758115163439104\n",
"Endpoint model deployed. Resource name: projects/1026793852137/locations/us-central1/endpoints/1328405759423152128\n"
]
}
],
"source": [
"endpoint = vertex_model.deploy(endpoint = endpoint)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "af3641f4-1443-4fad-9e20-d3cf63996fed",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'sex': 'MALE',\n",
" 'body_mass_g': 3900.0,\n",
" 'culmen_length_mm': 43.8348,\n",
" 'culmen_depth_mm': 18.9,\n",
" 'flipper_length_mm': 184.0,\n",
" 'species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'island': 'Dream'}"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"impute_dict"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "301ab3bd-f5bd-4dac-8225-b2a7a5924abe",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# remove the known value that the model will predict:\n",
"del impute_dict['species']"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "b94d2453-2a20-4eea-bbc8-7a4b635e9d2c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[{'predicted_species': 'Adelie Penguin (Pygoscelis adeliae)',\n",
" 'species_probs': [0.03061437793076038,\n",
" 0.3899675905704498,\n",
" 0.5794180035591125],\n",
" 'species_values': ['Gentoo penguin (Pygoscelis papua)',\n",
" 'Chinstrap penguin (Pygoscelis antarctica)',\n",
" 'Adelie Penguin (Pygoscelis adeliae)']}]"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"endpoint.predict(instances = [impute_dict]).predictions"
]
},
{
"cell_type": "markdown",
"id": "27c64404-e00a-4dc8-b3c8-a2f2c773e9dc",
"metadata": {},
"source": [
"---\n",
"## Review Objects in GCP Console: BigQuery Models, GCS Exports, Vertex AI Models, Vertex AI Endpoints\n",
"\n",
"While the models above were created using SQL that ran in BigQuery via coding methods from this notebook, it is possible to also still review each object directly in the GCP console. This section includes direct links to the object created above as well as screen captures of what the object look like."
]
},
{
"cell_type": "markdown",
"id": "cc4596fb-265e-4834-abfa-714002dff5a1",
"metadata": {},
"source": [
"### Console: BigQuery Models"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "0054e862-0602-4d23-b658-3e5111ddb7a2",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Direct link to the model in BigQuery:\n",
"https://console.cloud.google.com/bigquery?project=statmike-mlops-349915&ws=!1m5!1m4!5m3!1sstatmike-mlops-349915!2sbqml!3sembedded_preprocessing\n"
]
}
],
"source": [
"print(f'Direct link to the model in BigQuery:\\nhttps://console.cloud.google.com/bigquery?project={PROJECT_ID}&ws=!1m5!1m4!5m3!1s{BQ_PROJECT}!2s{BQ_DATASET}!3sembedded_preprocessing')"
]
},
{
"cell_type": "markdown",
"id": "eb359b14-ed57-4a93-8322-4d8fbbafa368",
"metadata": {},
"source": [
"### Console: GCS Model Files"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "2b462613-7f27-41dc-a293-163acea6e659",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Direct link to exported model files in GCS:\n",
"https://console.cloud.google.com/storage/browser/statmike-mlops-349915/bqml/feature-engineering/models?project=statmike-mlops-349915\n"
]
}
],
"source": [
"print(f'Direct link to exported model files in GCS:\\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/bqml/feature-engineering/models?project={PROJECT_ID}')"
]
},
{
"cell_type": "markdown",
"id": "7e1e6899-3bb9-469e-9deb-2a229e6683cc",
"metadata": {},
"source": [
"### Console: Vertex AI Model Registry"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "5da5395b-166e-428f-9053-31740daba24b",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Direct link to Vertex AI Model Registry:\n",
"https://console.cloud.google.com/vertex-ai/models?project=statmike-mlops-349915\n"
]
}
],
"source": [
"print(f'Direct link to Vertex AI Model Registry:\\nhttps://console.cloud.google.com/vertex-ai/models?project={PROJECT_ID}')"
]
},
{
"cell_type": "markdown",
"id": "f4de0b66-c4ae-4799-9193-f444ec87e103",
"metadata": {},
"source": [
"### Console: Vertex AI Endpoint"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "34d7309c-8edb-4649-a8ae-beee82ad9fa0",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Direct link to Vertex AI Online Prediction Endpoints:\n",
"https://console.cloud.google.com/vertex-ai/locations/us-central1/endpoints/1328405759423152128?project=statmike-mlops-349915\n"
]
}
],
"source": [
"print(f'Direct link to Vertex AI Online Prediction Endpoints:\\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/endpoints/{endpoint.name}?project={PROJECT_ID}')"
]
},
{
"cell_type": "markdown",
"id": "d6a5ffbf-d5fd-47f7-bd24-d8952a353014",
"metadata": {},
"source": [
"---\n",
"## Remove Resources Created In This Notebook\n",
"\n",
"To avoid any ongoing cost form storage or compute (Vertex AI Prediction Endpoint), the follow cells are provided to remove all resources created in this notebook.\n",
"\n",
"- Dataset In BigQuery\n",
"- Model Objects In BigQuery\n",
"- Model Exports in GCS\n",
"- Endpoints In Vertex AI"
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "2321a560-6607-46d1-bb56-7c951f9fb989",
"metadata": {},
"outputs": [],
"source": [
"# to remove resources change this flag to True:\n",
"remove = False"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "596d7531-d0df-41fd-b84e-191e4bae4feb",
"metadata": {},
"outputs": [],
"source": [
"# remove Vertex AI Online Prediction Endpoint\n",
"if remove:\n",
" endpoint.delete(force = True)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "ab741d94-e3d0-41a4-9ef2-ea90c0dc45e7",
"metadata": {},
"outputs": [],
"source": [
"# delete models created in BigQuery with BQML - note this also removes them from Vertex AI Model Registry\n",
"if remove:\n",
" for model in list(bq.list_models(dataset = ds)):\n",
" if model.model_id in ['embedded_preprocessing', 'modular_preprocessing_impute', 'modular_preprocessing_scale', 'modular_preprocessing']:\n",
" print('deleting model: ', model.model_id)\n",
" bq.delete_model(model)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "590e1fcf-4979-43bb-b7f3-40664e9810e5",
"metadata": {},
"outputs": [],
"source": [
"# delete BigQuery table(s) created above\n",
"if remove:\n",
" bq.delete_table(f'{BQ_PROJECT}.{BQ_DATASET}.feature-engineering-source')"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "8b15e94a-1ae8-459f-bbbe-72eb221ab859",
"metadata": {},
"outputs": [],
"source": [
"# delete BigQuery dataset - fails if other content was in the dataset than what this notebook created\n",
"if remove:\n",
" bq.delete_dataset(dataset = ds, delete_contents = False)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "080bb6c2-8705-444e-89fe-f21b30dbf75d",
"metadata": {},
"outputs": [],
"source": [
"# delete model files exported to GCS\n",
"if remove:\n",
" bucket = gcs.bucket(GCS_BUCKET)\n",
" bucket.delete_blobs(blobs = list(bucket.list_blobs(prefix = 'bqml/feature-engineering/')))"
]
}
],
"metadata": {
"environment": {
"kernel": "conda-root-py",
"name": "workbench-notebooks.m113",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/workbench-notebooks:m113"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}