data-analytics-demos/bigquery-data-governance/colab-enterprise/03-Data-Insights.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "8rRxIQAxABNK" }, "source": [ "### Overview" ] }, { "cell_type": "markdown", "metadata": { "id": "IauvRjuUABNL" }, "source": [ "**Overview**: Data insights offers an automated way to explore and understand your data. With data insights, Gemini uses metadata to generate natural language questions about your table and the queries to answer them. This helps you uncover patterns, assess data quality, and perform statistical analysis.\n", "\n", "This notebook will gather all the tables in the raw, enriched and curated zones and create a data insights scan on each table. Data insights will allow you to generate working queries on each table to start your analytical process.\n", "\n", "**Process Flow:** \n", "\n", "1. Select all the tables in the raw, enriched, and curated datasets.\n", "2. Create a list of dictionaries to hold the table details and some additional fields to hold the scan name and scan state.\n", "3. Set concurrency level. We will run up to 5 scans at once.\n", "4. While all scans not completed:\n", " * Count the number of scans currently in the Pending, Running, Unspecified, and Cancelling states.\n", " * If we are less than our concurrency level (5) then start more scans.\n", " * When starting the scan, save the scan name and set the state to Unspecified.\n", " * Count the number of scans currently in the Pending, Running, Unspecified, and Cancelling states.\n", " * If zero are running, exit loop.\n", "5. For each successful scan:\n", " * Update the labels with the associated BigQuery table so the scan will show in the Google Console user interface.\n", "\n", "Notes:\n", "* You should run a Data Profile scan first since you get better insights if a profile scan exists.\n", "* This notebook runs the scans manually. Typically, you should schedule a scan on a schedule and not worry about processing.\n", "\n", "Cost:\n", "* Approximate cost: Less than 1 dollar\n", "\n", "Author:\n", "* Adam Paternostro" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Uynoj7r5ABNL" }, "outputs": [], "source": [ "# Architecture Diagram\n", "from IPython.display import Image\n", "Image(url='https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-Data-Governance-Data-Insights.png', width=1200)" ] }, { "cell_type": "markdown", "metadata": { "id": "S7zKuqTpABNL" }, "source": [ "### Video Walkthrough" ] }, { "cell_type": "markdown", "metadata": { "id": "8qRI86K6ABNL" }, "source": [ "[Video](https://storage.googleapis.com/data-analytics-golden-demo/colab-videos/Data-Insights.mp4)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Fc0k4fw4ABNL" }, "outputs": [], "source": [ "from IPython.display import HTML\n", "\n", "HTML(\"\"\"\n", "<video width=\"800\" height=\"600\" controls>\n", " <source src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-videos/Data-Insights.mp4\" type=\"video/mp4\">\n", " Your browser does not support the video tag.\n", "</video>\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "id": "HMsUvoF4BP7Y" }, "source": [ "### License" ] }, { "cell_type": "markdown", "metadata": { "id": "jQgQkbOvj55d" }, "source": [ "```\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "m65vp54BUFRi" }, "source": [ "### Pip installs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5MaWM6H5i6rX" }, "outputs": [], "source": [ "# NOTE: All calls in this notebooks are done via REST APIs\n", "\n", "# PIP Installs (if necessary)\n", "import sys\n", "\n", "# !{sys.executable} -m pip install REPLACE-ME" ] }, { "cell_type": "markdown", "metadata": { "id": "UmyL-Rg4Dr_f" }, "source": [ "### Initialize" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xOYsEVSXp6IP" }, "outputs": [], "source": [ "from PIL import Image\n", "from IPython.display import HTML\n", "import IPython.display\n", "import google.auth\n", "import requests\n", "import json\n", "import uuid\n", "import base64\n", "import os\n", "import cv2\n", "import random\n", "import time\n", "import datetime\n", "import base64\n", "import random\n", "\n", "import logging\n", "from tenacity import retry, wait_exponential, stop_after_attempt, before_sleep_log, retry_if_exception" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wMlHl3bnkFPZ" }, "outputs": [], "source": [ "# Set these (run this cell to verify the output)\n", "\n", "bigquery_location = \"${bigquery_location}\"\n", "dataplex_region = \"${dataplex_region}\"\n", "\n", "# Get the current date and time\n", "now = datetime.datetime.now()\n", "\n", "# Format the date and time as desired\n", "formatted_date = now.strftime(\"%Y-%m-%d-%H-%M\")\n", "\n", "# Get some values using gcloud\n", "project_id = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n", "user = !(gcloud auth list --filter=status:ACTIVE --format=\"value(account)\")\n", "\n", "if len(user) != 1:\n", " raise RuntimeError(f\"user is not set: {user}\")\n", "user = user[0]\n", "\n", "print(f\"project_id = {project_id}\")\n", "print(f\"user = {user}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "sZ6m_wGrK0YG" }, "source": [ "### Helper Methods" ] }, { "cell_type": "markdown", "metadata": { "id": "JbOjdSP1kN9T" }, "source": [ "#### restAPIHelper\n", "Calls the Google Cloud REST API using the current users credentials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "40wlwnY4kM11" }, "outputs": [], "source": [ "def restAPIHelper(url: str, http_verb: str, request_body: str) -> str:\n", " \"\"\"Calls the Google Cloud REST API passing in the current users credentials\"\"\"\n", "\n", " import requests\n", " import google.auth\n", " import json\n", "\n", " # Get an access token based upon the current user\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " if http_verb == \"GET\":\n", " response = requests.get(url, headers=headers)\n", " elif http_verb == \"POST\":\n", " response = requests.post(url, json=request_body, headers=headers)\n", " elif http_verb == \"PUT\":\n", " response = requests.put(url, json=request_body, headers=headers)\n", " elif http_verb == \"PATCH\":\n", " response = requests.patch(url, json=request_body, headers=headers)\n", " elif http_verb == \"DELETE\":\n", " response = requests.delete(url, headers=headers)\n", " else:\n", " raise RuntimeError(f\"Unknown HTTP verb: {http_verb}\")\n", "\n", " if response.status_code == 200:\n", " return json.loads(response.content)\n", " #image_data = json.loads(response.content)[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " else:\n", " error = f\"Error restAPIHelper -> ' Status: '{response.status_code}' Text: '{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": { "id": "ILcRKf-zgsP5" }, "source": [ "#### RunQuery\n", "Runs a BigQuery query." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HCo0Mr6zgsyn" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", " from google.cloud import bigquery\n", " client = bigquery.Client()\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " raise Exception(query_job.error_result)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZQYL-OEvAkZ7" }, "source": [ "### Data Insights Scan - Helper Methods" ] }, { "cell_type": "markdown", "metadata": { "id": "BVXA0swqCbLn" }, "source": [ "#### existsDataInsightScan\n", "- Tests to see if a Data Insights Scan exists\n", "- Returns True/False" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XEqmygbrCaYn" }, "outputs": [], "source": [ "def existsDataInsightScan(project_id, dataplex_region, data_insights_scan_name):\n", " \"\"\"Test to see if a scan exists.\"\"\"\n", "\n", " # Gather existing data scans\n", " # https://cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans/list\n", "\n", " url = f\"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{dataplex_region}/dataScans\"\n", "\n", " # Gather existing data scans\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createDataDocumentScan (GET) json_result: {json_result}\")\n", "\n", " # Test to see if data scan exists, if so return\n", " if \"dataScans\" in json_result:\n", " for item in json_result[\"dataScans\"]:\n", " print(f\"Scan names: {item['name']}\")\n", " if item[\"name\"] == f\"projects/{project_id}/locations/{dataplex_region}/dataScans/{data_insights_scan_name}\":\n", " print(f\"Data Document Scan {data_insights_scan_name} already exists\")\n", " return True\n", "\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "D-k_P6NwAvT5" }, "source": [ "#### createDataInsightScan\n", "- Creates a insights scan, but does not run it\n", "- If the scan exists, the does nothing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GCfNKo1TAkBE" }, "outputs": [], "source": [ "def createDataInsightScan(project_id, dataplex_region, data_insight_scan_name, data_insight_display_name, bigquery_dataset_name, bigquery_table_name):\n", " \"\"\"Creates the data insights scan.\"\"\"\n", "\n", " if existsDataInsightScan(project_id, dataplex_region, data_insight_scan_name) == False:\n", " # Create a new scan\n", " # https://cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans/create\n", " print(\"Creating Data Insight Scan\")\n", "\n", " url = f\"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{dataplex_region}/dataScans?dataScanId={data_insight_scan_name}\"\n", "\n", " request_body = {\n", " \"displayName\": data_insight_display_name,\n", " \"type\": \"DATA_DOCUMENTATION\",\n", " \"dataDocumentationSpec\": {},\n", " \"data\":{\n", " \"resource\": f\"//bigquery.googleapis.com/projects/{project_id}/datasets/{bigquery_dataset_name}/tables/{bigquery_table_name}\"\n", " }\n", " }\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", "\n", " name = json_result[\"metadata\"][\"target\"]\n", " print(f\"Data Insight Scan created: {name}\")\n", " else:\n", " print(f\"Data Insight Scan exists: projects/{project_id}/locations/{dataplex_region}/dataScans/{data_insight_scan_name}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "q7LBwnD6A2pM" }, "source": [ "#### startDataInsightScan\n", "- Starts a data insight scan (async)\n", "- Returns the \"job name\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "C7Xdx8bKAj-a" }, "outputs": [], "source": [ "def startDataInsightScan(project_id, dataplex_region, data_insight_scan_name):\n", " \"\"\"Starts the scan\"\"\"\n", "\n", " # Create a new scan\n", " # https://cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans/run\n", " print(\"Running Data Insight Scan\")\n", "\n", " url = f\"https://dataplex.googleapis.com/v1/projects/{project_id}/locations/{dataplex_region}/dataScans/{data_insight_scan_name}:run\"\n", "\n", " request_body = { }\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " job_name = json_result[\"job\"][\"name\"]\n", " job_state = json_result[\"job\"][\"state\"]\n", " print(f\"Document Data Scan Run created: {job_name} - State: {job_state}\")\n", "\n", " return job_name\n" ] }, { "cell_type": "markdown", "metadata": { "id": "X8db9qIHBOO8" }, "source": [ "#### getStateDataInsightScan\n", "- Gets the state of a scan (to see if it is done)\n", "- Returns the \"state\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JhHrg0lcAj7r" }, "outputs": [], "source": [ "def getStateDataInsightScan(project_id, dataplex_region, data_insight_scan_job_name):\n", " \"\"\"Runs the data insight scan job and monitors until it completes\"\"\"\n", "\n", " # Gets the \"state\" of a scan\n", " url = f\"https://dataplex.googleapis.com/v1/{data_insight_scan_job_name}\"\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " return json_result[\"state\"]\n", " #== \"STATE_UNSPECIFIED\" or json_result[\"state\"] == \"RUNNING\" or json_result[\"state\"] == \"PENDING\":\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Kd51jAhcZBUm" }, "source": [ "#### updateBigQueryTableDataplexLabels\n", "- Patches the BigQuery table so that we associate the a Dataplex item with the BigQuery table so you see it in the UI\n", "- Returns nothing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q0D8cTWCZBpM" }, "outputs": [], "source": [ "def updateBigQueryTableDataplexLabels(project_id, dataplex_region, dataplex_asset_type, dataplex_asset_scan_name, bigquery_dataset_name, bigquery_table_name):\n", " \"\"\"Sets the labels on the BigQuery table so users can see the scans in the Console.\"\"\"\n", "\n", " # Patch BigQuery\n", " # https://cloud.google.com/dataplex/docs/reference/rest/v1/projects.locations.dataScans/create\n", " print(\"Patching BigQuery Dataplex Labels\")\n", "\n", " url = f\"https://bigquery.googleapis.com/bigquery/v2/projects/{project_id}/datasets/{bigquery_dataset_name}/tables/{bigquery_table_name}\"\n", "\n", " request_body = {}\n", " if dataplex_asset_type == \"DATA-PROFILE-SCAN\":\n", " request_body = {\n", " \"labels\" : {\n", " \"dataplex-dp-published-project\" : project_id,\n", " \"dataplex-dp-published-location\" : dataplex_region,\n", " \"dataplex-dp-published-scan\" : dataplex_asset_scan_name,\n", " }\n", " }\n", " elif dataplex_asset_type == \"DATA-INSIGHTS-SCAN\":\n", " request_body = {\n", " \"labels\" : {\n", " \"dataplex-data-documentation-published-project\" : project_id,\n", " \"dataplex-data-documentation-published-location\" : dataplex_region,\n", " \"dataplex-data-documentation-published-scan\" : dataplex_asset_scan_name,\n", " }\n", " }\n", " else:\n", " raise Exception(f\"Unknown dataplex_asset_type of {dataplex_asset_type}\")\n", "\n", " json_result = restAPIHelper(url, \"PATCH\", request_body)\n", " print(json_result)" ] }, { "cell_type": "markdown", "metadata": { "id": "QCFEfpdmEydA" }, "source": [ "### Run Data Insight Scan - Algorithm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q702fsShe5vH" }, "outputs": [], "source": [ "# Get all the tables we want to scan for each dataset\n", "scans_to_perform = []\n", "dataset_list = [\"${bigquery_governed_data_raw_dataset}\",\"${bigquery_governed_data_enriched_dataset}\",\"${bigquery_governed_data_curated_dataset}\"]\n", "\n", "sql = \"\"\n", "for dataset_name in dataset_list:\n", " sql += f\"SELECT table_schema, table_name, table_type from `{dataset_name}.INFORMATION_SCHEMA.TABLES` UNION ALL \"\n", "\n", "# Remove training union all\n", "sql = sql.rstrip(\" UNION ALL \")\n", "\n", "result_df = RunQuery(sql)\n", "table_list = []\n", "\n", "# data_insight_scan_name: \"Field data_scan_id must contain only lowercase letters, numbers, and/or hyphens\n", "for index, row in result_df.iterrows():\n", " item = {\n", " \"project_id\": project_id,\n", " \"dataplex_region\": dataplex_region,\n", " \"data_insight_scan_name\": f\"{row['table_schema']}-{row['table_name']}-insight-scan\".lower().replace(\"_\",\"-\"),\n", " \"data_insight_display_name\": f\"{row['table_name']} scan\",\n", " \"bigquery_dataset_name\": row['table_schema'],\n", " \"bigquery_table_name\": row['table_name'],\n", "\n", " # Used by below loop for processing\n", " \"data_insight_scan_state\": \"\",\n", " \"data_insight_scan_job_name\": \"\"\n", " }\n", " scans_to_perform.append(item)\n", "\n", " print(f\"item: {item}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QfY7hsAHMnfp" }, "outputs": [], "source": [ "# Run the scans (up to a certain concurrency level)\n", "numberOfScansToRunConcurrently = 5\n", "\n", "while True:\n", " # Count the number of scans that are running\n", " concurrentScanCount = 0\n", " for item in scans_to_perform:\n", " if item[\"data_insight_scan_state\"] == \"PENDING\" or \\\n", " item[\"data_insight_scan_state\"] == \"STATE_UNSPECIFIED\" or \\\n", " item[\"data_insight_scan_state\"] == \"RUNNING\" or \\\n", " item[\"data_insight_scan_state\"] == \"CANCELING\":\n", " # Update our count\n", " print(f\"Concurrent Scan Count: {item['bigquery_dataset_name']}.{item['bigquery_table_name']} -> {item['data_insight_scan_state']}\")\n", " concurrentScanCount += 1\n", " else:\n", " print(f\"Concurrent Scan Count: {item['bigquery_dataset_name']}.{item['bigquery_table_name']} -> {item['data_insight_scan_state']}\")\n", "\n", " print(f\"concurrentScanCount: {concurrentScanCount}\")\n", "\n", " # Start new scans under our concurrency count\n", " scansStarted = -1\n", " while concurrentScanCount < numberOfScansToRunConcurrently and scansStarted != 0:\n", " # Start new scans up to the concurrency limit\n", " scansStarted = 0\n", " for item in scans_to_perform:\n", " if concurrentScanCount < numberOfScansToRunConcurrently and \\\n", " item[\"data_insight_scan_state\"] == \"\":\n", " # start a new scan\n", " createDataInsightScan(item[\"project_id\"], item[\"dataplex_region\"],\n", " item[\"data_insight_scan_name\"], item[\"data_insight_display_name\"],\n", " item[\"bigquery_dataset_name\"], item[\"bigquery_table_name\"])\n", " started = False\n", " item[\"data_insight_scan_job_name\"] = \"\"\n", " while started == False:\n", " try:\n", " item[\"data_insight_scan_job_name\"] = startDataInsightScan(item[\"project_id\"], item[\"dataplex_region\"], item[\"data_insight_scan_name\"])\n", " item[\"data_insight_scan_state\"] = \"STATE_UNSPECIFIED\"\n", " started = True\n", " scansStarted += 1\n", " concurrentScanCount += 1\n", " except Exception as e:\n", " scan_full_name = f'projects/{item[\"project_id\"]}/locations/{item[\"dataplex_region\"]}/dataScans/{item[\"data_insight_scan_name\"]}'\n", " message = f\"Provided DataScan '{scan_full_name}' does not exist.\"\n", " print(message)\n", " if message in str(e):\n", " print(f\"Data scan is not available to start. Waiting...\")\n", " time.sleep(5)\n", " else:\n", " raise e # Re-raise the exception for other errors\n", "\n", "\n", " # Update the status for the scans that are processing\n", " for item in scans_to_perform:\n", " if item[\"data_insight_scan_state\"] == \"PENDING\" or \\\n", " item[\"data_insight_scan_state\"] == \"STATE_UNSPECIFIED\" or \\\n", " item[\"data_insight_scan_state\"] == \"RUNNING\" or \\\n", " item[\"data_insight_scan_state\"] == \"CANCELING\":\n", " # Get the latest state\n", " item[\"data_insight_scan_state\"] = getStateDataInsightScan(item[\"project_id\"], item[\"dataplex_region\"], item[\"data_insight_scan_job_name\"])\n", "\n", " if concurrentScanCount == 0:\n", " # nothing processing- exit\n", " break\n", " else:\n", " # wait for processing\n", " print(f\"concurrentScanCount: {concurrentScanCount}\")\n", " time.sleep(10)\n", "\n", "# Update the BigQuery labels so our scans show in the Console UI\n", "for item in scans_to_perform:\n", " if item[\"data_insight_scan_state\"] == \"SUCCEEDED\":\n", " # skip CANCELLED or FAILED states\n", " updateBigQueryTableDataplexLabels(item[\"project_id\"], item[\"dataplex_region\"],\n", " \"DATA-INSIGHTS-SCAN\", item[\"data_insight_scan_name\"],\n", " item[\"bigquery_dataset_name\"], item[\"bigquery_table_name\"])\n", "\n", " print(f\"Associated scan for table {item['bigquery_dataset_name']}.{item['bigquery_table_name']} associated with BigQuery Console UI.\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "42IxhtRRrvR-" }, "source": [ "### Clean Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6lF2Z7skFbvf" }, "outputs": [], "source": [ "# Placeholder" ] }, { "cell_type": "markdown", "metadata": { "id": "ASQ2BPisXDA0" }, "source": [ "### Reference Links\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rTY6xJdZ3ul8" }, "source": [ "- [REPLACE-ME](https://REPLACE-ME)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "8rRxIQAxABNK", "S7zKuqTpABNL", "HMsUvoF4BP7Y", "m65vp54BUFRi", "UmyL-Rg4Dr_f", "sZ6m_wGrK0YG", "JbOjdSP1kN9T", "42IxhtRRrvR-", "ASQ2BPisXDA0" ], "name": "03-Data-Insights", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

data-analytics-demos/bigquery-data-governance/colab-enterprise/03-Data-Insights.ipynb (755 lines of code) (raw):