In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataproc Metastore Quickstart

#### Dataproc Metastore

- Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.
- Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata.
- This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing engines and tools you're using.
- More on [Dataproc Metastore service documentation](https://cloud.google.com/dataproc-metastore/docs/overview)

#### Datasets in the public Dataproc Metastore instance

- You can configure you Dataproc cluster or Serverless Runtime to connect to our public read-only Dataproc Metastore and read the dataset tables using *spark.read.table("public_datasets.\<table_name\>")*

|GCP project|DMPS instance|Location|Version|
|-----------------------------|-------------------|-----------|-----|
|dataproc-workspaces-notebooks|public-metastore-v1|us-central1|3.1.2|


## Using Dataproc Metastore Public Datasets

#### Using Dataproc Jupyter Lab plugin

- Option 1, via the Jupyter Lab UI:
    1) Create New Runtime Template
    2) In the Metastore section, select the **dataproc-workspaces-notebooks** GCP project
    3) Select **projects/dataproc-workspaces-notebooks/locations/us-central1/services/public-metastore-v1**
    4) Select this runtime as Jupyter Kernel

<center><img src="../../docs/images/create-runtime.png"/></center>
<center><img src="../../docs/images/metastore-select.png"/></center>

- Option 2, via CLI:
    1) Create New Runtime Template configuration yaml file ([example](./runtime_configuration.yaml))
    3) Use the CLI instructions below to create your dataproc serverless runtime
    4) Select this runtime as Jupyter Kernel

    **CLI instructions**

    - Create a dataproc serverless runtime: ```gcloud beta dataproc session-templates import TEMPLATE_ID --source=SOURCE_FILE --location=DATAPROC-REGION --project=PROJECT_ID```
    - Viewing a runtime template configuration: ```gcloud beta dataproc session-templates describe TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```
    - Listing runtime templates in a project and region: ```gcloud beta dataproc session-templates list --location=DATAPROC-REGION --project=PROJECT_ID```
    - Exporting a runtime template configuration to a file: ```gcloud beta dataproc session-templates export TEMPLATE_ID --destination=FILE --location=DATAPROC-REGION --project=PROJECT_ID```
    - Exporting a runtime template configuration to standard output: ```gcloud beta dataproc session-templates export TEMPLATE_ID```
    - Deleting a runtime template: ```gcloud beta dataproc session-templates delete TEMPLATE_ID --location=DATAPROC-REGION --project=PROJECT_ID```




#### Using Dataproc Cluster

Create a Dataproc Cluster with a Dataproc Metastore service attached to it, via the UI or the following gcloud command  

1) Export variables
```console
export GCP_PROJECT=<your_gcp_project>
export REGION=<your_region>
export CLUSTER_IMAGE_VERSION=<your_image_version> # ex: 2.0-debian10
export CLUSTER_NAME=<your_desired_cluster_name> # ex: cluster-with-federation
export SERVICE_ACCOUNT=<your_service_account>
```

2) Create Dataproc cluster with a Dataproc Metastore service attached
```console
gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$REGION \
    --project=$GCP_PROJECT \
    --service-account=$SERVICE_ACCOUNT \
    --image-version=$CLUSTER_IMAGE_VERSION \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --enable-component-gateway \
    --optional-components JUPYTER \
    --dataproc-metastore projects/dataproc-workspaces-notebooks/locations/us-central1/services/metastore-dev

# * For image version > 2.1, the --scopes=https://www.googleapis.com/auth/cloud-platform flag is not needed
```

#### Use PySpark to list tables in the **public_datasets**:

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("Dataproc Metastore Example") \
    .enableHiveSupport() \
    .getOrCreate()

## Available tables

In [None]:
spark.sql("SHOW TABLES IN public_datasets").show()

|       database|        tableName|isTemporary|
|---------------|-----------------|-----------|
|public_datasets|          cuad_v1|      false|
|public_datasets|  winequality_red|      false|
|public_datasets|winequality_white|      false|
|public_datasets|real_estate_sales|      false|
|public_datasets|sms_spam_collection|      false|
|public_datasets|us_customer_price_index_yearly|      false|
|public_datasets|ai4i_2020_predictive_maintenance|      false|
|public_datasets|stanford_online_products|      false|
|public_datasets|youtube_ucg|      false|