notebooks/quickstart/dataproc_metastore/metastore_spark_quickstart.ipynb (247 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "18689af1",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# Copyright 2023 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "bbb9bd29",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# Dataproc Metastore Quickstart"
]
},
{
"cell_type": "markdown",
"id": "d9a4c073-9967-447a-b32e-1db752f6be3b",
"metadata": {
"tags": []
},
"source": [
"#### Dataproc Metastore\n",
"\n",
"- Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.\n",
"- Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata.\n",
"- This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.\n",
"- More on [Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)\n",
"\n",
"#### Datasets in the public Dataproc Metastore instance\n",
"\n",
"- You can configure you Dataproc cluster or Serverless Runtime to connect to our public read-only Dataproc Metastore and read the dataset tables using *spark.read.table(\"public_datasets.\\<table_name\\>\")*\n",
"\n",
"|GCP project|DMPS instance|Location|Version|\n",
"|-----------------------------|-------------------|-----------|-----|\n",
"|dataproc-workspaces-notebooks|public-metastore-v1|us-central1|3.1.2|\n"
]
},
{
"cell_type": "markdown",
"id": "3c9fb208-2689-48b5-806d-6504a3212f25",
"metadata": {
"tags": []
},
"source": [
"## Using Dataproc Metastore Public Datasets"
]
},
{
"cell_type": "markdown",
"id": "ca685386-aa7b-4a40-b083-a2c0cedeae7c",
"metadata": {},
"source": [
"#### Using Dataproc Jupyter Lab plugin\n",
"\n",
"- Option 1, via the Jupyter Lab UI:\n",
" 1) Create New Runtime Template\n",
" 2) In the Metastore section, select the **dataproc-workspaces-notebooks** GCP project\n",
" 3) Select **projects/dataproc-workspaces-notebooks/locations/us-central1/services/public-metastore-v1**\n",
" 4) Select this runtime as Jupyter Kernel\n",
"\n",
"<center><img src=\"../../docs/images/create-runtime.png\"/></center>\n",
"<center><img src=\"../../docs/images/metastore-select.png\"/></center>\n",
"\n",
"- Option 2, via CLI:\n",
" 1) Create New Runtime Template configuration yaml file ([example](./runtime_configuration.yaml))\n",
" 3) Use the CLI instructions below to create your dataproc serverless runtime\n",
" 4) Select this runtime as Jupyter Kernel\n",
"\n",
" **CLI instructions**\n",
"\n",
" - Create a dataproc serverless runtime: ```gcloud beta dataproc session-templates import TEMPLATE_ID --source=SOURCE_FILE --location=DATAPROC-REGION --project=PROJECT_ID```\n",
" - Viewing a runtime template configuration: ```gcloud beta dataproc session-templates describe TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```\n",
" - Listing runtime templates in a project and region: ```gcloud beta dataproc session-templates list --location=DATAPROC-REGION --project=PROJECT_ID```\n",
" - Exporting a runtime template configuration to a file: ```gcloud beta dataproc session-templates export TEMPLATE_ID --destination=FILE --location=DATAPROC-REGION --project=PROJECT_ID```\n",
" - Exporting a runtime template configuration to standard output: ```gcloud beta dataproc session-templates export TEMPLATE_ID```\n",
" - Deleting a runtime template: ```gcloud beta dataproc session-templates delete TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "0d816e14-f4f5-49cc-81ad-76f28a91b383",
"metadata": {},
"source": [
"#### Using Dataproc Cluster\n",
"\n",
"Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command \n",
"\n",
"1) Export variables\n",
"```console\n",
"export GCP_PROJECT=<your_gcp_project>\n",
"export REGION=<your_region>\n",
"export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10\n",
"export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation\n",
"export SERVICE_ACCOUNT=<your_service_account>\n",
"```\n",
"\n",
"2) Create Dataproc cluster with a Dataproc Metastore service attached\n",
"```console\n",
"gcloud dataproc clusters create $CLUSTER_NAME \\\n",
" --region=$REGION \\\n",
" --project=$GCP_PROJECT \\\n",
" --service-account=$SERVICE_ACCOUNT \\\n",
" --image-version=$CLUSTER_IMAGE_VERSION \\\n",
" --scopes=https://www.googleapis.com/auth/cloud-platform \\\n",
" --enable-component-gateway \\\n",
" --optional-components JUPYTER \\\n",
" --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev\n",
"\n",
"# * For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "92232ed2-c734-4463-82ff-e34a18f305b1",
"metadata": {},
"source": [
"#### Use PySpark to list tables in the **public_datasets**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e03a39d4-d08f-4ceb-a11b-6cea8329ccb7",
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a34ce061-651b-4fb9-88bb-bd59d4eaf2e0",
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"Dataproc Metastore Example\") \\\n",
" .enableHiveSupport() \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "e59a3538",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Available tables"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "faeaca6f",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"spark.sql(\"SHOW TABLES IN public_datasets\").show()"
]
},
{
"cell_type": "markdown",
"id": "59efb57c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"| database| tableName|isTemporary|\n",
"|---------------|-----------------|-----------|\n",
"|public_datasets| cuad_v1| false|\n",
"|public_datasets| winequality_red| false|\n",
"|public_datasets|winequality_white| false|\n",
"|public_datasets|real_estate_sales| false|\n",
"|public_datasets|sms_spam_collection| false|\n",
"|public_datasets|us_customer_price_index_yearly| false|\n",
"|public_datasets|ai4i_2020_predictive_maintenance| false|\n",
"|public_datasets|stanford_online_products| false|\n",
"|public_datasets|youtube_ucg| false|"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}