modules/data_warehouse/templates/notebooks/getting_started_bq_dataframes.ipynb (867 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Get started with BigQuery DataFrames\n", "\n", "<table align=\"left\">\n", "\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n", " Open in Vertex AI Workbench\n", " </a>\n", " </td> \n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "24743cf4a1e1" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.10" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "Use this notebook to get started with BigQuery DataFrames, including setup, installation, and basic tutorials.\n", "\n", "BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.\n", "\n", "* `bigframes.pandas` provides a pandas-like API for analytics.\n", "* `bigframes.ml` provides a scikit-learn-like API for ML.\n", "\n", "Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest)." ] }, { "cell_type": "markdown", "metadata": { "id": "d975e698c9a4" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to install BigQuery DataFrames, load data into a BigQuery DataFrames DataFrame, and inspect and manipulate the data using pandas and a custom Python function, running at BigQuery scale.\n", "\n", "The steps include:\n", "\n", "- Creating a BigQuery DataFrames DataFrame: Access data from a Parquet file stored in GCS to create a BigQuery DataFrames DataFrame.\n", "- Inspecting and manipulating data: Use pandas to perform data cleaning and preparation on the DataFrame.\n", "- Deploying a custom function: Deploy a [remote function ](https://cloud.google.com/bigquery/docs/remote-functions)that runs a scalar Python function at BigQuery scale." ] }, { "cell_type": "markdown", "metadata": { "id": "08d289fa873f" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the Jump Start Solution dataset (```thelook```), which contains syntheic retail sales data for a fictional eCommerce clothing retailer.\n", "\n", "The same dataset is also deployed to a Cloud Storage bucket in your project as Parquet files so that you can use it to try ingesting data from a local environment." ] }, { "cell_type": "markdown", "metadata": { "id": "aed92deeb4a0" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* BigQuery (storage and compute)\n", "* Cloud Functions\n", "\n", "Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),\n", "[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n", "and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),\n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "BF1j6f9HApxa" }, "source": [ "## Before you begin\n", "\n", "Complete the tasks in this section to set up your environment." ] }, { "cell_type": "markdown", "metadata": { "id": "WReHDGG5g0XY" }, "source": [ "#### Set your project ID\n", "\n", "If you don't know your project ID, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "executionInfo": { "elapsed": 19, "status": "ok", "timestamp": 1705950462905, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "oM1iC_MfAts1" }, "outputs": [], "source": [ "PROJECT_ID = ${PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Set the region\n", "\n", "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "executionInfo": { "elapsed": 1632, "status": "ok", "timestamp": 1705950681580, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "eF-Twtc4XGem" }, "outputs": [], "source": [ "REGION = ${REGION}" ] }, { "cell_type": "markdown", "metadata": { "id": "960505627ddf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "executionInfo": { "elapsed": 17, "status": "ok", "timestamp": 1705950696793, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "PyQmSRbKA8r-" }, "outputs": [], "source": [ "import bigframes.pandas as bf" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk,all" }, "source": [ "\n", "### Set BigQuery DataFrames options" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "executionInfo": { "elapsed": 16, "status": "ok", "timestamp": 1705950704127, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "NPPMuw2PXGeo" }, "outputs": [], "source": [ "bf.options.bigquery.project = PROJECT_ID\n", "bf.options.bigquery.location = REGION" ] }, { "cell_type": "markdown", "metadata": { "id": "pDfrKwMKE_dK" }, "source": [ "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." ] }, { "cell_type": "markdown", "metadata": { "id": "-19Uiwoo9pP4" }, "source": [ "## See the power of BigQuery DataFrames first-hand\n", "\n", "BigQuery DataFrames enables you to interact with datasets of any size, so that you can explore, transform, and understand even your biggest datasets using familiar tools like pandas and scikit-learn." ] }, { "cell_type": "markdown", "metadata": { "id": "KMX4D2uoBwM0" }, "source": [ "For example, take the BigQuery sample table `bigquery-samples.wikipedia_pageviews.200809h`, which is ~60 GB is size. This is not a dataset you'd likely be able process in pandas without extra infrastructure.\n", "\n", "With BigQuery DataFrames, however, computation is handled by BigQuery's highly scalable compute engine, meaning you can focus on doing data science without hitting size limitations." ] }, { "cell_type": "markdown", "metadata": { "id": "i98c46p1CXoV" }, "source": [ "If you'd like to try creating a BigQuery DataFrames DataFrame from this table, uncomment and run the next cell to load the table using the `read_gbq` method.\n", "\n", "> Note: Keep in mind that running these operations will count against your monthly [free tier allowance in BigQuery](https://cloud.google.com/bigquery/pricing#free-tier)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vyex9BQI-BNa" }, "outputs": [], "source": [ "# bq_df_sample = bf.read_gbq(\"bigquery-samples.wikipedia_pageviews.200809h\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gE6CEALjDZZV" }, "source": [ "No problem! BigQuery DataFrames makes a DataFrame, `bq_df_sample`, containing the entirety of the source table of data." ] }, { "cell_type": "markdown", "metadata": { "id": "T6lAIeelDwLz" }, "source": [ "Uncomment and run the following cell to see pandas in action over your new BigQuery DataFrames DataFrame.\n", "\n", "This code uses regex to filter the DataFrame to include only rows with Wikipedia page titles containing the word \"Google\", sums the total views by page title, and then returns the top 100 results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XfGq5apK-D_e" }, "outputs": [], "source": [ "# bq_df_sample[bq_df_sample.title.str.contains(r\"[Gg]oogle\")]\\\n", "# .groupby(['title'], as_index=False)['views'].sum(numeric_only=True)\\\n", "# .sort_values('views', ascending=False)\\\n", "# .head(100)" ] }, { "cell_type": "markdown", "metadata": { "id": "i6XV-HTN-IFF" }, "source": [ "In addition to giving you access to pandas, BigQuery DataFrames also enables you to build ML models, run inference, and deploy and run your own Python functions at scale. You'll see examples throughout this notebook.\n", "\n", "Now you'll move to the JSS dataset (`thelook`) for the remainder of this getting started guide." ] }, { "cell_type": "markdown", "metadata": { "id": "9EMAqR37AfLS" }, "source": [ "## Create a BigQuery DataFrames DataFrame\n", "\n", "You can create a BigQuery DataFrames DataFrame by reading data from any of the following locations:\n", "\n", "* A data file stored in Cloud Storage\n", "* Data stored in a BigQuery table\n", "* A local data file\n", "* An in-memory pandas DataFrame\n", "\n", "The following sections show how to use the first two options." ] }, { "cell_type": "markdown", "metadata": { "id": "iZDjzglh9eWZ" }, "source": [ "### Create a DataFrame from a GCS file\n", "\n", "Use these instructions in the following sections to create a BigQuery DataFrames DataFrame from a file stored in Google Cloud Storage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8Jry3NoFv3Wm" }, "source": [ "#### Define the file path\n", "\n", "First, follow the steps below to define the Parquet file URI." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1705951374283, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "SvyXzkRl783u" }, "outputs": [], "source": [ "fn_order_items = \"${GCS_BUCKET_URI}/thelook-ecommerce/order_items.parquet\"" ] }, { "cell_type": "markdown", "metadata": { "id": "sJsrwAQY_H6g" }, "source": [ "#### Create a DataFrame\n", "\n", "Create a BigQuery DataFrames DataFrame from the parquet file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EDAaIwHpQCDZ" }, "outputs": [], "source": [ "df_from_gcs = bf.read_parquet(path=fn_order_items)" ] }, { "cell_type": "markdown", "metadata": { "id": "U-RVfNCu_h_h" }, "source": [ "Take a look at the first few rows of the `orders` DataFrame that was just created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_gPD0Zn1Stdb" }, "outputs": [], "source": [ "df_from_gcs.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "rK0lNJmz_xkA" }, "source": [ "### Ingest data from a DataFrame to a BigQuery table\n", "\n", "BigQuery DataFrames lets you create a BigQuery table from a BigQuery DataFrames DataFrame on-the-fly." ] }, { "cell_type": "markdown", "metadata": { "id": "V1DWpmSCAEql" }, "source": [ "First, create a BigQuery dataset to house the table." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2753, "status": "ok", "timestamp": 1705951974404, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "ZSP7gt13QrQt", "outputId": "f77acb4b-6c1b-4790-990f-2278821a52b4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset thelook_bigframes created.\n" ] } ], "source": [ "from google.cloud import bigquery\n", "DATASET_ID = \"thelook_bigframes\"\n", "\n", "client = bigquery.Client(project=PROJECT_ID)\n", "dataset = bigquery.Dataset(PROJECT_ID + \".\" + DATASET_ID)\n", "client.delete_dataset(DATASET_ID, delete_contents=True, not_found_ok=True)\n", "dataset.location = REGION\n", "dataset = client.create_dataset(dataset)\n", "print(f\"Dataset {dataset.dataset_id} created.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Jd0dFISwAPPa" }, "source": [ "Next, use the `to_gbq` method to create a BigQuery table from the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oP1NIAmUBjop" }, "outputs": [], "source": [ "df_from_gcs.to_gbq(PROJECT_ID + \".\" + DATASET_ID + \".order_items\")" ] }, { "cell_type": "markdown", "metadata": { "id": "kfF6fnmmAZEK" }, "source": [ "### Create a DataFrame from BigQuery data\n", "You can create a BigQuery DataFrames DataFrame from a BigQuery table by using the `read_gbq` method and referencing either an entire table or a SQL query." ] }, { "cell_type": "markdown", "metadata": { "id": "TEy5jHJDD6hx" }, "source": [ "Create a BigQuery DataFrames DataFrame from the BigQuery table you created in the previous section, and view a few rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IBuo-d6dWfsA" }, "outputs": [], "source": [ "query_or_table = f\"{PROJECT_ID}.{DATASET_ID}.order_items\"\n", "bq_df = bf.read_gbq(query_or_table)\n", "bq_df[\"order_id\"] = bq_df[\"order_id\"].astype(\"string\")\n", "bq_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "rwPLjqW2Ajzh" }, "source": [ "## Inspect and manipulate data in BigQuery DataFrames" ] }, { "cell_type": "markdown", "metadata": { "id": "bExmYlL_ELtV" }, "source": [ "### Using pandas\n", "\n", "You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment. There are 150+ pandas functions supported in BigQuery DataFrames. You can view the list in [the documentation](https://cloud.google.com/python/docs/reference/bigframes/latest)." ] }, { "cell_type": "markdown", "metadata": { "id": "ZHFUc3Q_FHc1" }, "source": [ "To see this in action, inspect one of the columns (or series) of the BigQuery DataFrames DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6i6HkFJZa8na" }, "outputs": [], "source": [ "bq_df[\"sale_price\"].head(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "EJIZJaNXFQzh" }, "source": [ "Compute the sum of this series to find the total reveneue from all items sold:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YKwCW7Nsavap" }, "outputs": [], "source": [ "total_sales = bq_df[\"sale_price\"].sum()\n", "print(f\"total_sales: {total_sales}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "DSs1cnca-MOU" }, "source": [ "Calculate the mean `sales_price` of all items in an order using the `groupby` operation to group by `order_id`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4PyKMR61-Mjy" }, "outputs": [], "source": [ "bq_df[[\"order_id\", \"sale_price\"]].groupby(\n", " by=bq_df[\"order_id\"]).mean(numeric_only=True).head()" ] }, { "cell_type": "markdown", "metadata": { "id": "6sf9kZ2C9Ixe" }, "source": [ "You can confirm that the calculations were run in BigQuery by clicking \"Open job\" from the previous cells' output. This takes you to the BigQuery console to view the SQL statement and job details." ] }, { "cell_type": "markdown", "metadata": { "id": "cWVNZ8D_FUtT" }, "source": [ "### Using custom functions\n", "\n", "Running your own Python functions (or being able to bring your packages) and using them at scale is a challenge many data scientists face. BigQuery DataFrames makes it easy to deploy [remote functions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) that run scalar Python functions at BigQuery scale. These functions are persisted as [BigQuery remote functions](https://cloud.google.com/bigquery/docs/remote-functions) that you can then re-use." ] }, { "cell_type": "markdown", "metadata": { "id": "zjw8toUbHuRD" }, "source": [ "Running the cell below creates a custom function using the `remote_function` method. This function categorizes a value into one of two buckets: >= 50 or <50.\n", "\n", "> Note: Creating a function requires a [BigQuery connection](https://cloud.google.com/bigquery/docs/remote-functions#create_a_remote_function). This code assumes a pre-created connection named `bigframes-default-connection`. If\n", "the connection is not already created, BigQuery DataFrames attempts to create one assuming the [necessary APIs\n", "and IAM permissions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) are set up in the project.\n", "\n", "This cell takes a few minutes to run because it creates the BigQuery connection (if applicable) and deploys the Cloud Function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rSWTOG-vb2Fc" }, "outputs": [], "source": [ "@bf.remote_function([float], str)\n", "def get_bucket(num):\n", " if not num:\n", " return \"NA\"\n", " boundary = 50\n", " return \"at_or_above_50\" if num >= boundary else \"below_50\"" ] }, { "cell_type": "markdown", "metadata": { "id": "N7JH0BI5IOpK" }, "source": [ "The custom function is deployed as a Cloud Function, and is then integrated with BigQuery as a remote function.\n", "\n", "Save both of the function names so that you can clean them up at the end of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6ejPXoyEQpWE" }, "outputs": [], "source": [ "CLOUD_FUNCTION_NAME = format(get_bucket.bigframes_cloud_function)\n", "print(\"Cloud Function Name \" + CLOUD_FUNCTION_NAME)\n", "REMOTE_FUNCTION_NAME = format(get_bucket.bigframes_remote_function)\n", "print(\"Remote Function Name \" + REMOTE_FUNCTION_NAME)" ] }, { "cell_type": "markdown", "metadata": { "id": "vHV3JqKjJHsH" }, "source": [ "Apply the custom function to the BigQuery DataFrames DataFrame to bucketize the `sale_price` value of each item sold:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NxSd9WZFcIji" }, "outputs": [], "source": [ "bq_df = bq_df.assign(order_price_bucket=bq_df[\"sale_price\"].groupby(\n", " by=bq_df[\"order_id\"]).sum().apply(get_bucket))\n", "bq_df[[\"order_id\", \"order_price_bucket\"]].head(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "wCsmt0IwFkDy" }, "source": [ "## Summary and next steps\n", "\n", "You've created BigQuery DataFrames DataFrames, and inspected and manipulated data with pandas and custom remote functions at BigQuery scale and speed.\n", "\n", "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), including an introductory notebook for `bigframes.ml`." ] }, { "cell_type": "markdown", "metadata": { "id": "TpV-iwP9qw9c" }, "source": [ "### Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sx_vKniMq9ZX" }, "outputs": [], "source": [ "# # Delete the BigQuery dataset\n", "# from google.cloud import bigquery\n", "# client = bigquery.Client(project=PROJECT_ID)\n", "# client.delete_dataset(\n", "# DATASET_ID, delete_contents=True, not_found_ok=True\n", "# )\n", "# print(\"Deleted dataset '{}'.\".format(DATASET_ID))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_dTCXvCxtPw9" }, "outputs": [], "source": [ "# # Delete the BigQuery Connection\n", "# from google.cloud import bigquery_connection_v1 as bq_connection\n", "# client = bq_connection.ConnectionServiceClient()\n", "# CONNECTION_ID = f\"projects/{PROJECT_ID}/locations/{REGION}/connections/bigframes-default-connection\"\n", "# client.delete_connection(name=CONNECTION_ID)\n", "# print(\"Deleted connection '{}'.\".format(CONNECTION_ID))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EDAIIfcpwNOF" }, "outputs": [], "source": [ "# # Delete the Cloud Function\n", "# ! gcloud functions delete {CLOUD_FUNCTION_NAME} --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QwumLUKmVpuH" }, "outputs": [], "source": [ "# # Delete the Remote Function\n", "# REMOTE_FUNCTION_NAME = REMOTE_FUNCTION_NAME.replace(PROJECT_ID + \".\", \"\")\n", "# ! bq rm --routine --force=true {REMOTE_FUNCTION_NAME}" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }