notebooks/quickstart/iceberg/iceberg_quickstart.ipynb (365 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# BigQuery tables for Apache Iceberg\n",
"\n",
"<table align=\"left\">\n",
"\n",
" <td>\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
" </a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
" View on GitHub\n",
" </a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n",
" <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
" Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/quickstart/iceberg/iceberg_quickstart.ipynb\">\n",
" <img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s\" alt=\"BQ logo\" width=\"35\">\n",
" Open in BQ Studio\n",
" </a>\n",
" </td>\n",
"\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This notebook shows how to create a BigQuery managed table for Apache Iceberg and load data from a dataset in .parquet format into this table.\n",
"\n",
"#### **Steps**\n",
"Using BigQuery SQL,\n",
"1) It creates a BigQuery dataset\n",
"2) It creates a BigQuery table for Apache Iceberg with a specific schema\n",
"3) It loads data from any format to this managed table\n",
"4) It queries the table and shows the table ddl\n",
"\n",
"#### **References**\n",
"- https://cloud.google.com/bigquery/docs/iceberg-tables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade google-cloud-bigquery google-cloud-bigquery-connection bigframes tqdm bigquery-magics -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext bigquery_magics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import bigframes.pandas as bpd\n",
"import bigframes.bigquery as bbq\n",
"bpd.options.display.progress_bar = None\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = \"<PROJECT_ID>\"\n",
"REGION = \"<REGION>\" \n",
"\n",
"PARQUET_FILES_GCS_URI = \"gs://dataproc-metastore-public-binaries/real_estate_sales/*\" # Input dataset (public)\n",
"ICEBERG_BQ_MANAGED_TABLE_GCS_URI = \"gs://GCS_PATH\" # Path to store the iceberg dataset\n",
"\n",
"DATASET_ID = \"<DATASET_ID>\"\n",
"TABLE_NAME = \"<TABLE_NAME>\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bpd.options.bigquery.project = PROJECT_ID\n",
"bpd.options.bigquery.location = REGION\n",
"bpd.options.display.progress_bar = None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets first take a look at the original dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = bpd.read_parquet(f\"{PARQUET_FILES_GCS_URI[:-1]}\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create BigQuery dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}`.{DATASET_ID} OPTIONS ( location = '{REGION}' );\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bigquery --project {PROJECT_ID}\n",
"$query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a BigQuery managed Iceberg table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"CREATE OR REPLACE TABLE `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME} (\n",
" serial_number INTEGER,\n",
" list_year INTEGER,\n",
" date_recorded STRING,\n",
" town STRING,\n",
" address STRING,\n",
" assessed_value FLOAT64,\n",
" sale_amount FLOAT64,\n",
" sales_ratio FLOAT64,\n",
" property_type STRING,\n",
" residential_type STRING,\n",
" non_use_code STRING,\n",
" assessor_remarks STRING,\n",
" opm_remarks STRING,\n",
" longitude FLOAT64,\n",
" latitude FLOAT64\n",
")\n",
"CLUSTER BY town\n",
"WITH CONNECTION DEFAULT\n",
"OPTIONS (\n",
"file_format = 'PARQUET',\n",
"table_format = 'ICEBERG',\n",
"storage_uri = '{ICEBERG_BQ_MANAGED_TABLE_GCS_URI}');\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bigquery --project {PROJECT_ID}\n",
"$query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Import data into Iceberg table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets import it to this managed Iceberg table in BigQuery"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"LOAD DATA INTO `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME}\n",
"FROM FILES (\n",
" uris=['{PARQUET_FILES_GCS_URI}'],\n",
" format='PARQUET');\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bigquery --project {PROJECT_ID}\n",
"$query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Query the Iceberg table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now query the data from this BigQuery table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"SELECT * FROM `{PROJECT_ID}`.{DATASET_ID}.{TABLE_NAME} LIMIT 5;\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bigquery --project {PROJECT_ID}\n",
"$query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we take a look at the BigQuery Information Schema, we can see the Data Definition Language for this table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"SELECT\n",
" ddl\n",
"FROM\n",
" {DATASET_ID}.INFORMATION_SCHEMA.TABLES;\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bigquery ddl --project {PROJECT_ID} \n",
"$query"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddl.values[0][0])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}