notebooks/quickstart/iceberg/iceberg_quickstart.ipynb (365 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BigQuery tables for Apache Iceberg\n", "\n", "<table align=\"left\">\n", "\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n", " Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n", " <img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s\" alt=\"BQ logo\" width=\"35\">\n", " Open in BQ Studio\n", " </a>\n", " </td>\n", "\n", "</table>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "This notebook shows how to create a BigQuery managed table for Apache Iceberg and load data from a dataset in .parquet format into this table.\n", "\n", "#### **Steps**\n", "Using BigQuery SQL,\n", "1) It creates a BigQuery dataset\n", "2) It creates a BigQuery table for Apache Iceberg with a specific schema\n", "3) It loads data from any format to this managed table\n", "4) It queries the table and shows the table ddl\n", "\n", "#### **References**\n", "- https://cloud.google.com/bigquery/docs/iceberg-tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade google-cloud-bigquery google-cloud-bigquery-connection bigframes tqdm bigquery-magics -q" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext bigquery_magics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import bigframes.pandas as bpd\n", "import bigframes.bigquery as bbq\n", "bpd.options.display.progress_bar = None\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROJECT_ID = \"<PROJECT_ID>\"\n", "REGION = \"<REGION>\" \n", "\n", "PARQUET_FILES_GCS_URI = \"gs://dataproc-metastore-public-binaries/real_estate_sales/*\" # Input dataset (public)\n", "ICEBERG_BQ_MANAGED_TABLE_GCS_URI = \"gs://GCS_PATH\" # Path to store the iceberg dataset\n", "\n", "DATASET_ID = \"<DATASET_ID>\"\n", "TABLE_NAME = \"<TABLE_NAME>\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bpd.options.bigquery.project = PROJECT_ID\n", "bpd.options.bigquery.location = REGION\n", "bpd.options.display.progress_bar = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets first take a look at the original dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bpd.read_parquet(f\"{PARQUET_FILES_GCS_URI[:-1]}\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create BigQuery dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = f\"\"\"\n", "CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}`.{DATASET_ID} OPTIONS ( location = '{REGION}' );\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project {PROJECT_ID}\n", "$query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a BigQuery managed Iceberg table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = f\"\"\"\n", "CREATE OR REPLACE TABLE `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME} (\n", " serial_number INTEGER,\n", " list_year INTEGER,\n", " date_recorded STRING,\n", " town STRING,\n", " address STRING,\n", " assessed_value FLOAT64,\n", " sale_amount FLOAT64,\n", " sales_ratio FLOAT64,\n", " property_type STRING,\n", " residential_type STRING,\n", " non_use_code STRING,\n", " assessor_remarks STRING,\n", " opm_remarks STRING,\n", " longitude FLOAT64,\n", " latitude FLOAT64\n", ")\n", "CLUSTER BY town\n", "WITH CONNECTION DEFAULT\n", "OPTIONS (\n", "file_format = 'PARQUET',\n", "table_format = 'ICEBERG',\n", "storage_uri = '{ICEBERG_BQ_MANAGED_TABLE_GCS_URI}');\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project {PROJECT_ID}\n", "$query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import data into Iceberg table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets import it to this managed Iceberg table in BigQuery" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = f\"\"\"\n", "LOAD DATA INTO `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME}\n", "FROM FILES (\n", " uris=['{PARQUET_FILES_GCS_URI}'],\n", " format='PARQUET');\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project {PROJECT_ID}\n", "$query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Query the Iceberg table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now query the data from this BigQuery table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = f\"\"\"\n", "SELECT * FROM `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME} LIMIT 5;\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project {PROJECT_ID}\n", "$query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we take a look at the BigQuery Information Schema, we can see the Data Definition Language for this table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = f\"\"\"\n", "SELECT\n", " ddl\n", "FROM\n", " {DATASET_ID}.INFORMATION_SCHEMA.TABLES;\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery ddl --project {PROJECT_ID} \n", "$query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(ddl.values[0][0])" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }