notebooks/quickstart/delta_format/delta_quickstart.ipynb (196 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "1d896dad-68b4-4405-8612-b8696fb68b03",
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2023 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "ff6ecae8-6c51-4611-bcfc-9bf71dce657b",
"metadata": {},
"source": [
"# Delta files from Google Cloud Storage"
]
},
{
"cell_type": "markdown",
"id": "a0be0044-5c18-46bd-8fb1-d0e5b16d638a",
"metadata": {},
"source": [
"<table align=\"left\">\n",
"\n",
"<a href=\"https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/foundational/delta_format/delta_quickstart.ipynb\">\n",
"<img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
"View on GitHub\n",
"</a>\n",
"</td>\n",
"<td>\n",
"<a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/foundational/delta_format/delta_quickstart.ipynb\">\n",
"<img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
"Open in Vertex AI Workbench\n",
"</a>\n",
"</td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "27f989cc-913d-4796-85a3-7d8705d40c89",
"metadata": {},
"source": [
"In this quickstart, you'll learn how to seamlessly interact with data stored in the Delta Lake format on Google Cloud Storage (GCS). \n",
"We'll leverage the power of Dataproc Serverless runtimes, configured with the necessary Delta Lake libraries, to perform common data operations. "
]
},
{
"cell_type": "markdown",
"id": "bc3f3b77-83b6-4be6-8b60-fe0d350e5c86",
"metadata": {},
"source": [
"When creating a Dataproc Runtime Template using the Dataproc Jupyterlab Plugin, setup the following Spark property:\n",
"\n",
"**spark.jars** \n",
"gs://dataproc-metastore-public-binaries/dependencies/delta-storage-3.1.0.jar,gs://dataproc-metastore-public-binaries/dependencies/delta-spark_2.13-3.1.0.jar\n",
"\n",
"<center><img src=\"../../docs/images/delta-jars.png\" width=\"50%\" height=\"50%\"/></center>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39bc5ec2-d9fe-4e4a-8606-c3b9ee0ac6d1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ce62d5e-ef60-4074-89ad-e11740407c3c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"delta\") \\\n",
" .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n",
" .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n",
" .enableHiveSupport() \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "bf5eba75-792f-4974-a8f6-0d7dbf3a12ee",
"metadata": {},
"source": [
"#### You can read a dataset in the delta format from our public datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2416d16a-38aa-4345-80c2-5e3239bcc405",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df = spark.read.format(\"delta\").option(\"header\",True).load(\"gs://dataproc-metastore-public-binaries/gas_sensors\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "909d3afe-4e3e-4d3b-ad47-fb9a24944d63",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"id": "04940250-b2e7-4e87-a702-4817f10238f3",
"metadata": {},
"source": [
"| Times| COppm|Humidityrh|TemperatureC|FlowratemLmin|HeatervoltageV| R1MOhm| R2MOhm| R3MOhm| R4MOhm| R5MOhm| R6MOhm| R7MOhm| R8MOhm|R9MOhm|R10MOhm|R11MOhm|R12MOhm|R13MOhm|R14MOhm|\n",
"|------|------|----------|------------|-------------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------|\n",
"|0.0000|0.0000| 32.7023| 9.9032| 240.8069| 0.8900| 0.0738| 0.1314| 0.0987| 0.0936| 0.1026| 0.1152| 0.1105| 0.0891|0.0951| 0.1083| 0.1037| 0.1009| 0.0927| 0.1009|\n",
"|0.3070|0.0000| 45.5699| 13.7998| 240.8029| 0.8950| 0.0786| 0.1286| 0.1019| 0.0932| 0.1051| 0.1129| 0.1128| 0.0905|0.0958| 0.1103| 0.1043| 0.1025| 0.0942| 0.1020|\n",
"|0.6170|0.0000| 58.5539| 17.7318| 240.7989| 0.8979| 0.0816| 0.1287| 0.1052| 0.0940| 0.1075| 0.1128| 0.1149| 0.0918|0.0966| 0.1119| 0.1050| 0.1036| 0.0958| 0.1028|\n",
"|0.9240|0.0000| 71.4123| 21.6256| 240.7949| 0.8971| 0.0834| 0.1303| 0.1083| 0.0954| 0.1100| 0.1138| 0.1172| 0.0928|0.0972| 0.1130| 0.1054| 0.1044| 0.0968| 0.1033|\n",
"|1.2340|0.0000| 83.8100| 25.3800| 240.7917| 0.8980| 0.0851| 0.1324| 0.1112| 0.0967| 0.1120| 0.1151| 0.1194| 0.0936|0.0976| 0.1138| 0.1057| 0.1049| 0.0975| 0.1038|\n",
"|1.5410|0.0000| 83.8100| 25.3800| 240.8061| 0.8986| 0.0867| 0.1343| 0.1138| 0.0981| 0.1139| 0.1165| 0.1213| 0.0942|0.0980| 0.1144| 0.1058| 0.1053| 0.0980| 0.1041|"
]
},
{
"cell_type": "markdown",
"id": "52939565-a2a7-43d3-9f21-690d8117f9b4",
"metadata": {},
"source": [
"#### You can write transformed data to your bucket in the delta format like this:"
]
},
{
"cell_type": "markdown",
"id": "bba858c1-d1b1-4042-9608-35b0dc381569",
"metadata": {},
"source": [
"df.write.mode(\"append\").format(\"delta\").save(\"gs://\\<YOUR_GCS_BUCKET\\>/\\<PATH\\>/\")"
]
}
],
"metadata": {
"environment": {
"kernel": "9c39b79e5d2e7072beb4bd59-delta-runtime",
"name": "workbench-notebooks.m120",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m120"
},
"kernelspec": {
"display_name": "delta-runtime on Serverless Spark (Remote)",
"language": "python",
"name": "9c39b79e5d2e7072beb4bd59-delta-runtime"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}