notebooks/bqml-generate-text-embedding-model.ipynb (288 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "pSOcMZ9jBSCJ" }, "source": [ "\n", "# BigQuery ML (BQML) - Generate Text Embedding Using Pre-trained TensorFlow Models\n", "\n", "This notebook will explore how to generate NNLM, SWIVEL, and BERT text embedding models using pre-trained TensorFlow models with [TextEmbeddingModelGenerator](https://github.com/GoogleCloudPlatform/bigquery-ml-utils/blob/master/model_generator/text_embedding_generator.py) from the`bigquery-ml-utils` library. The TextEmbeddingModelGenerator automatically loads one of the three text embedding model ([NNLM](https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2), [SWIVEL](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), [BERT](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/4)) from TensorFlow hub and integrates a default signature such that the resulting model can be immediately integrated with BQML.\n", "\n", "This notebook will cover:\n", "- Installing appropriate libraries.\n", "- Generating desired text embedding model with TextEmbeddingModelGenerator.\n", "- Exporting the generated model to a GCS bucket.\n", "\n", "**This content will accompany the blog post - TBD**\n", "\n", "---\n", "\n", "**Prerequisites:**\n", "\n", "None\n", "\n", "**Services Used:**\n", "- BigQuery\n", "- TensorFlow Hub: Workbench (this notebook)\n", "- GCS\n", "\n", "**Resources:**\n", "- [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery/docs/bqml-introduction)\n", "- [Overview of BQML methods and workflows](https://cloud.google.com/bigquery/docs/e2e-journey)\n", "- [BigQuery](https://cloud.google.com/bigquery)\n", " - [Documentation](https://cloud.google.com/bigquery/docs/query-overview)\n", " - [API](https://cloud.google.com/bigquery/docs/reference/libraries-overview)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lxaimNFUErfC" }, "source": [ "---\n", "## Colab Setup\n", "\n", "To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/bigquery-ml-utils/blob/master/notebooks/bqml-generate-text-embedding-model.ipynb) and run the cells in this section. Otherwise, skip this section.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yj9IHjVg_rYM" }, "outputs": [], "source": [ "!pip install bigquery_ml_utils" ] }, { "cell_type": "markdown", "metadata": { "id": "suoRpOeXBO0f" }, "source": [ "**RESTART RUNTIME**\n", "\n", "The Next cell will restart the runtime by first stopping it and then Colab will automatically restart - you may need to dismiss a popup warning letting you know about this unexpected restart. This restart makes the installs above available to the current session." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qpei8j71LVwJ" }, "outputs": [], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "0Jqrjlb9Lllh" }, "source": [ "---\n", "## Environment Setup\n", "Import necessary packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "z_9Ll7JbL3l5" }, "outputs": [], "source": [ "from bigquery_ml_utils import model_generator\n", "import tensorflow as tf\n", "import tensorflow_text" ] }, { "cell_type": "markdown", "metadata": { "id": "dWi4Ms4sOfIp" }, "source": [ "---\n", "\n", "## Generate a Text Embedding Model\n", "`bigquery-ml-utils` currently offers three text embedding models - NNLM, SWIVEL, and BERT." ] }, { "cell_type": "markdown", "metadata": { "id": "12FwYkGxO78X" }, "source": [ "Initiate desired model and local output directory:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8F3eyaZ6PMyf" }, "outputs": [], "source": [ "MODEL_NAME = \"swivel\" # options: {\"nnlm\", \"swivel\", \"bert\"}\n", "LOCAL_OUTPUT_DIR = \"./swivel\" # replace with desired local output directory" ] }, { "cell_type": "markdown", "metadata": { "id": "E1j3jHMAPfy9" }, "source": [ "Establish an instance of TextEmbeddingModelGenerator and generate desired text embedding model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IrzTfgnUPpDF" }, "outputs": [], "source": [ "text_embedding_model_generator = model_generator.TextEmbeddingModelGenerator()\n", "text_embedding_model_generator.generate_text_embedding_model(MODEL_NAME, LOCAL_OUTPUT_DIR)" ] }, { "cell_type": "markdown", "metadata": { "id": "fHzB4A4rPsDk" }, "source": [ "Print generated model's signature to confirm that model has been correctly generated:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UESlFb7cP0I4" }, "outputs": [], "source": [ "reload_embedding_model = tf.saved_model.load(LOCAL_OUTPUT_DIR)\n", "print(reload_embedding_model.signatures[\"serving_default\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "ANkOq0ZSP4rx" }, "source": [ "---\n", "## Export Model to GCS Bucket\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jkow0oipSUfW" }, "source": [ "Authenticate gcloud account with Google sign-in:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9YUPNq5vQ0Vp" }, "outputs": [], "source": [ "import googleapiclient\n", "from google.colab import auth as google_auth\n", "\n", "PROJECT_ID=\"sample-project-id\" # replace with project ID\n", "\n", "google_auth.authenticate_user()\n", "!gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "P8rtBdTxRLi_" }, "source": [ "Copy model's contents to specified GCS bucket:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ly1U-nbqSBxV" }, "outputs": [], "source": [ "GCS_BUCKET=\"bashtest\" # replace with GCS bucket name\n", "\n", "# if bucket doesn't exist, make new bucket\n", "# otherwise, use existing bucket\n", "!gsutil mb gs://{GCS_BUCKET}/\n", "\n", "# copy model's content to bucket and list out the bucket's content\n", "!gsutil cp -r {LOCAL_OUTPUT_DIR} gs://{GCS_BUCKET}/\n", "!gsutil ls gs://{GCS_BUCKET}/{MODEL_NAME}" ] }, { "cell_type": "markdown", "metadata": { "id": "kENRlvYmSf9Y" }, "source": [ "---\n", "\n", "## Make predictions\n", "\n", "Once the generated text embedding model is copied into a GCS bucket, make predictions by following one of the two paths listed below:\n", "- NNLM, SWIVEL: [Make predictions with imported TensorFlow models](https://cloud.google.com/bigquery/docs/making-predictions-with-imported-tensorflow-models)\n", "- BERT: [Make predictions with remote models on Vertex AI](https://cloud.google.com/bigquery/docs/bigquery-ml-remote-model-tutorial)" ] } ], "metadata": { "colab": { "private_outputs": true, "provenance": [ { "file_id": "1764POEDanktlxne7WxHla6cReEwXgaM8", "timestamp": 1690992478894 } ] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }