notebooks/0_CopyDataToBigQuery.ipynb (317 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"display: flex; align-items: left;\">\n",
" <a href=\"https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0\">\n",
" <img src=\"../utilities/imgs/aaie.png\" style=\"margin-right\">\n",
" </a>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "copyright"
},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DRyGcAepAPJ5"
},
"source": [
"\n",
"# **Open Data QnA: Set up BigQuery Source**\n",
"\n",
"---\n",
"\n",
"This notebook shows how to copy a BigQuery public dataset to your GCP project \n",
"\n",
"\n",
"This is accomplished through the three following steps: \n",
"> i. Create a BigQuery dataset in your GCP project\n",
"\n",
"> ii. Create a table in the above dataset\n",
"\n",
"> iii. Copy data from the public dataset to the dataset on your project\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Change your Kernel to the created .venv with poetry from README.md**\n",
"\n",
"Below is the Kernel how it should look like before you proceed\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🔗 **1. Connect Your Google Cloud Project**\n",
"Time to connect your Google Cloud Project to this notebook. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Updated property [core/project].\n",
"Project has been set to three-p-o\n"
]
}
],
"source": [
"#@markdown Please fill in the value below with your GCP project ID and then run the cell.\n",
"PROJECT_ID = input(\"Enter the project id (same as your Setup Project) to copy source data in bigquery for this solution\")\n",
"\n",
"# Quick input validation\n",
"assert PROJECT_ID, \"⚠️ Please provide your Google Cloud Project ID\"\n",
"\n",
"# Configure gcloud.\n",
"!gcloud config set project {PROJECT_ID}\n",
"print(f'Project has been set to {PROJECT_ID}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yygMe6rPWxHS"
},
"source": [
"## 🔐 **2. Authenticate to Google Cloud**\n",
"Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n",
"\n",
"You can do this within Google Colab or using the Application Default Credentials in the Google Cloud CLI."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "PTXN1_DSXj2b"
},
"outputs": [],
"source": [
"# Authentication step\n",
"\n",
"\"\"\"Colab Auth\"\"\" \n",
"# from google.colab import auth\n",
"# auth.authenticate_user()\n",
"\n",
"\n",
"\"\"\"Jupiter Notebook Auth\"\"\"\n",
"import google.auth\n",
"import os\n",
"\n",
"credentials, project_id = google.auth.default()\n",
"\n",
"os.environ['GOOGLE_CLOUD_QUOTA_PROJECT']=PROJECT_ID\n",
"os.environ['GOOGLE_CLOUD_PROJECT']=PROJECT_ID"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Enable all the required APIs for the COPY\n",
"\n",
"!gcloud services enable \\\n",
" cloudapis.googleapis.com \\\n",
" compute.googleapis.com \\\n",
" iam.googleapis.com \\\n",
" bigquery.googleapis.com --project {PROJECT_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ☁️ **Copy a Public Dataset to your GCP Project**\n",
"\n",
"Copy a table from the public dataset to ask questions against. A sample table is chosen below, feel free to choose a different one. \n",
"\n",
"Note: BigQuery does not allow tables to be copied across regions in certain cases. Therefore, BQ_DST_REGION is set to be BQ_SRC_REGION. Change this parameter to check if copy to your region of interest is allowed.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Details of source Dataset\n",
"BQ_SRC_PROJECT = \"bigquery-public-data\"\n",
"BQ_SRC_DATASETS = [\"imdb\", \"imdb\"]\n",
"BQ_SRC_TABLES_LIST = [[\"title_principals\",\"title_crew\", \"title_basics\",\"name_basics\"], [\"reviews\", \"title_ratings\"]] # [] Specify empty list to copy 'all' tables, or a Specific list, eg: [\"table1\", \"table3\", \"table10\"]\n",
"BQ_SRC_REGIONS = [\"us\", \"us\"]\n",
"\n",
"# Details of destination Dataset\n",
"BQ_DST_PROJECT = PROJECT_ID\n",
"BQ_DST_DATASETS =[ \"imdb_people\", \"imdb_ratings\"] # List of destinaion dataset names\n",
"BQ_DST_REGIONS = BQ_SRC_REGIONS # Change if needed"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def createBQDataset(bq_project_id, dataset_name,dataset_region):\n",
" from google.cloud import bigquery\n",
" import google.api_core \n",
"\n",
" client=bigquery.Client(project=PROJECT_ID)\n",
"\n",
" dataset_ref = f\"{bq_project_id}.{dataset_name}\"\n",
" \n",
"\n",
" try:\n",
" client.get_dataset(dataset_ref)\n",
" print(\"Destination Dataset exists\")\n",
" except google.api_core.exceptions.NotFound:\n",
" print(\"Cannot find the dataset hence creating.......\")\n",
" dataset=bigquery.Dataset(dataset_ref)\n",
" dataset.location=dataset_region\n",
" client.create_dataset(dataset)\n",
" \n",
" return dataset_ref\n",
"\n",
"def createBQTable(bq_project_id,dataset_name, table_name):\n",
" from google.cloud import bigquery\n",
" import google.api_core \n",
"\n",
" client=bigquery.Client(project=PROJECT_ID)\n",
"\n",
" table_ref = client.dataset(dataset_name, project=bq_project_id).table(table_name)\n",
"\n",
" try:\n",
" client.get_table(table_ref)\n",
" print(f\"Destination Table {table_ref} exists\")\n",
" \n",
" except google.api_core.exceptions.NotFound:\n",
" print(f\"Creating the table {table_ref}.......\")\n",
" table = bigquery.Table(table_ref)\n",
" client.create_table(table)\n",
"\n",
" return table_ref\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cannot find the dataset hence creating.......\n",
"Creating the table three-p-o.imdb_people.title_principals.......\n",
"Creating the table three-p-o.imdb_people.title_crew.......\n",
"Creating the table three-p-o.imdb_people.title_basics.......\n",
"Creating the table three-p-o.imdb_people.name_basics.......\n",
"Cannot find the dataset hence creating.......\n",
"Creating the table three-p-o.imdb_ratings.reviews.......\n",
"Creating the table three-p-o.imdb_ratings.title_ratings.......\n",
"Done!\n"
]
}
],
"source": [
"#Create destination table and copy table data\n",
"from google.cloud import bigquery\n",
"\n",
"# Initialize BQ client\n",
"client=bigquery.Client(project=PROJECT_ID)\n",
"\n",
"for BQ_SRC_DATASET, BQ_SRC_TABLES, BQ_SRC_REGION, BQ_DST_DATASET, BQ_DST_REGION, in zip(BQ_SRC_DATASETS, BQ_SRC_TABLES_LIST, BQ_SRC_REGIONS, BQ_DST_DATASETS, BQ_DST_REGIONS):\n",
" \n",
" # Create Destination Dataset (If the dataset already exists, delete the dataset (and the tables with in) and create an empty dataset)\n",
" dst_dataset_ref=createBQDataset(BQ_DST_PROJECT,BQ_DST_DATASET,BQ_DST_REGION)\n",
"\n",
" if not BQ_SRC_TABLES:\n",
" #if tables are not explicitly provided, get the list of tables from bigquery\n",
" dataset_id = f'{BQ_SRC_PROJECT}.{BQ_SRC_DATASET}'\n",
" bq_tables_obj = client.list_tables(dataset_id)\n",
" BQ_SRC_TABLES = [table_obj.table_id for table_obj in bq_tables_obj]\n",
" \n",
" for BQ_SRC_TABLE in BQ_SRC_TABLES:\n",
"\n",
" dst_table_ref=createBQTable(BQ_DST_PROJECT,BQ_DST_DATASET,BQ_SRC_TABLE)\n",
" src_table_ref = client.dataset(BQ_SRC_DATASET, project=BQ_SRC_PROJECT).table(BQ_SRC_TABLE)\n",
"\n",
" job_config = bigquery.CopyJobConfig(write_disposition=\"WRITE_TRUNCATE\")\n",
"\n",
" copy_job = client.copy_table(src_table_ref, dst_table_ref, job_config=job_config)\n",
" # Wait for the job to complete and check for errors\n",
" copy_job.result() \n",
"\n",
"print('Done!')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### If all the above steps are executed suucessfully, the Bigquery Public dataset should be copied to your GCP project"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}