notebooks/quickstart/dataproc_metastore/metastore_spark_quickstart.ipynb (247 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "18689af1", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "id": "bbb9bd29", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "# Dataproc Metastore Quickstart" ] }, { "cell_type": "markdown", "id": "d9a4c073-9967-447a-b32e-1db752f6be3b", "metadata": { "tags": [] }, "source": [ "#### Dataproc Metastore\n", "\n", "- Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.\n", "- Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata.\n", "- This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.\n", "- More on [Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)\n", "\n", "#### Datasets in the public Dataproc Metastore instance\n", "\n", "- You can configure you Dataproc cluster or Serverless Runtime to connect to our public read-only Dataproc Metastore and read the dataset tables using *spark.read.table(\"public_datasets.\\<table_name\\>\")*\n", "\n", "|GCP project|DMPS instance|Location|Version|\n", "|-----------------------------|-------------------|-----------|-----|\n", "|dataproc-workspaces-notebooks|public-metastore-v1|us-central1|3.1.2|\n" ] }, { "cell_type": "markdown", "id": "3c9fb208-2689-48b5-806d-6504a3212f25", "metadata": { "tags": [] }, "source": [ "## Using Dataproc Metastore Public Datasets" ] }, { "cell_type": "markdown", "id": "ca685386-aa7b-4a40-b083-a2c0cedeae7c", "metadata": {}, "source": [ "#### Using Dataproc Jupyter Lab plugin\n", "\n", "- Option 1, via the Jupyter Lab UI:\n", " 1) Create New Runtime Template\n", " 2) In the Metastore section, select the **dataproc-workspaces-notebooks** GCP project\n", " 3) Select **projects/dataproc-workspaces-notebooks/locations/us-central1/services/public-metastore-v1**\n", " 4) Select this runtime as Jupyter Kernel\n", "\n", "<center><img src=\"../../docs/images/create-runtime.png\"/></center>\n", "<center><img src=\"../../docs/images/metastore-select.png\"/></center>\n", "\n", "- Option 2, via CLI:\n", " 1) Create New Runtime Template configuration yaml file ([example](./runtime_configuration.yaml))\n", " 3) Use the CLI instructions below to create your dataproc serverless runtime\n", " 4) Select this runtime as Jupyter Kernel\n", "\n", " **CLI instructions**\n", "\n", " - Create a dataproc serverless runtime: ```gcloud beta dataproc session-templates import TEMPLATE_ID --source=SOURCE_FILE --location=DATAPROC-REGION --project=PROJECT_ID```\n", " - Viewing a runtime template configuration: ```gcloud beta dataproc session-templates describe TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```\n", " - Listing runtime templates in a project and region: ```gcloud beta dataproc session-templates list --location=DATAPROC-REGION --project=PROJECT_ID```\n", " - Exporting a runtime template configuration to a file: ```gcloud beta dataproc session-templates export TEMPLATE_ID --destination=FILE --location=DATAPROC-REGION --project=PROJECT_ID```\n", " - Exporting a runtime template configuration to standard output: ```gcloud beta dataproc session-templates export TEMPLATE_ID```\n", " - Deleting a runtime template: ```gcloud beta dataproc session-templates delete TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "0d816e14-f4f5-49cc-81ad-76f28a91b383", "metadata": {}, "source": [ "#### Using Dataproc Cluster\n", "\n", "Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command \n", "\n", "1) Export variables\n", "```console\n", "export GCP_PROJECT=<your_gcp_project>\n", "export REGION=<your_region>\n", "export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10\n", "export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation\n", "export SERVICE_ACCOUNT=<your_service_account>\n", "```\n", "\n", "2) Create Dataproc cluster with a Dataproc Metastore service attached\n", "```console\n", "gcloud dataproc clusters create $CLUSTER_NAME \\\n", " --region=$REGION \\\n", " --project=$GCP_PROJECT \\\n", " --service-account=$SERVICE_ACCOUNT \\\n", " --image-version=$CLUSTER_IMAGE_VERSION \\\n", " --scopes=https://www.googleapis.com/auth/cloud-platform \\\n", " --enable-component-gateway \\\n", " --optional-components JUPYTER \\\n", " --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev\n", "\n", "# * For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed\n", "```" ] }, { "cell_type": "markdown", "id": "92232ed2-c734-4463-82ff-e34a18f305b1", "metadata": {}, "source": [ "#### Use PySpark to list tables in the **public_datasets**:" ] }, { "cell_type": "code", "execution_count": null, "id": "e03a39d4-d08f-4ceb-a11b-6cea8329ccb7", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": null, "id": "a34ce061-651b-4fb9-88bb-bd59d4eaf2e0", "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder \\\n", " .appName(\"Dataproc Metastore Example\") \\\n", " .enableHiveSupport() \\\n", " .getOrCreate()" ] }, { "cell_type": "markdown", "id": "e59a3538", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Available tables" ] }, { "cell_type": "code", "execution_count": null, "id": "faeaca6f", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "spark.sql(\"SHOW TABLES IN public_datasets\").show()" ] }, { "cell_type": "markdown", "id": "59efb57c", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "| database| tableName|isTemporary|\n", "|---------------|-----------------|-----------|\n", "|public_datasets| cuad_v1| false|\n", "|public_datasets| winequality_red| false|\n", "|public_datasets|winequality_white| false|\n", "|public_datasets|real_estate_sales| false|\n", "|public_datasets|sms_spam_collection| false|\n", "|public_datasets|us_customer_price_index_yearly| false|\n", "|public_datasets|ai4i_2020_predictive_maintenance| false|\n", "|public_datasets|stanford_online_products| false|\n", "|public_datasets|youtube_ucg| false|" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 5 }