notebooks/bqml-generate-text-embedding-model.ipynb (288 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "pSOcMZ9jBSCJ"
},
"source": [
"\n",
"# BigQuery ML (BQML) - Generate Text Embedding Using Pre-trained TensorFlow Models\n",
"\n",
"This notebook will explore how to generate NNLM, SWIVEL, and BERT text embedding models using pre-trained TensorFlow models with [TextEmbeddingModelGenerator](https://github.com/GoogleCloudPlatform/bigquery-ml-utils/blob/master/model_generator/text_embedding_generator.py) from the`bigquery-ml-utils` library. The TextEmbeddingModelGenerator automatically loads one of the three text embedding model ([NNLM](https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2), [SWIVEL](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), [BERT](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/4)) from TensorFlow hub and integrates a default signature such that the resulting model can be immediately integrated with BQML.\n",
"\n",
"This notebook will cover:\n",
"- Installing appropriate libraries.\n",
"- Generating desired text embedding model with TextEmbeddingModelGenerator.\n",
"- Exporting the generated model to a GCS bucket.\n",
"\n",
"**This content will accompany the blog post - TBD**\n",
"\n",
"---\n",
"\n",
"**Prerequisites:**\n",
"\n",
"None\n",
"\n",
"**Services Used:**\n",
"- BigQuery\n",
"- TensorFlow Hub: Workbench (this notebook)\n",
"- GCS\n",
"\n",
"**Resources:**\n",
"- [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery/docs/bqml-introduction)\n",
"- [Overview of BQML methods and workflows](https://cloud.google.com/bigquery/docs/e2e-journey)\n",
"- [BigQuery](https://cloud.google.com/bigquery)\n",
" - [Documentation](https://cloud.google.com/bigquery/docs/query-overview)\n",
" - [API](https://cloud.google.com/bigquery/docs/reference/libraries-overview)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lxaimNFUErfC"
},
"source": [
"---\n",
"## Colab Setup\n",
"\n",
"To run this notebook in Colab click [](https://colab.research.google.com/github/GoogleCloudPlatform/bigquery-ml-utils/blob/master/notebooks/bqml-generate-text-embedding-model.ipynb) and run the cells in this section. Otherwise, skip this section.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yj9IHjVg_rYM"
},
"outputs": [],
"source": [
"!pip install bigquery_ml_utils"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "suoRpOeXBO0f"
},
"source": [
"**RESTART RUNTIME**\n",
"\n",
"The Next cell will restart the runtime by first stopping it and then Colab will automatically restart - you may need to dismiss a popup warning letting you know about this unexpected restart. This restart makes the installs above available to the current session."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qpei8j71LVwJ"
},
"outputs": [],
"source": [
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0Jqrjlb9Lllh"
},
"source": [
"---\n",
"## Environment Setup\n",
"Import necessary packages:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "z_9Ll7JbL3l5"
},
"outputs": [],
"source": [
"from bigquery_ml_utils import model_generator\n",
"import tensorflow as tf\n",
"import tensorflow_text"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dWi4Ms4sOfIp"
},
"source": [
"---\n",
"\n",
"## Generate a Text Embedding Model\n",
"`bigquery-ml-utils` currently offers three text embedding models - NNLM, SWIVEL, and BERT."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "12FwYkGxO78X"
},
"source": [
"Initiate desired model and local output directory:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8F3eyaZ6PMyf"
},
"outputs": [],
"source": [
"MODEL_NAME = \"swivel\" # options: {\"nnlm\", \"swivel\", \"bert\"}\n",
"LOCAL_OUTPUT_DIR = \"./swivel\" # replace with desired local output directory"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E1j3jHMAPfy9"
},
"source": [
"Establish an instance of TextEmbeddingModelGenerator and generate desired text embedding model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IrzTfgnUPpDF"
},
"outputs": [],
"source": [
"text_embedding_model_generator = model_generator.TextEmbeddingModelGenerator()\n",
"text_embedding_model_generator.generate_text_embedding_model(MODEL_NAME, LOCAL_OUTPUT_DIR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fHzB4A4rPsDk"
},
"source": [
"Print generated model's signature to confirm that model has been correctly generated:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UESlFb7cP0I4"
},
"outputs": [],
"source": [
"reload_embedding_model = tf.saved_model.load(LOCAL_OUTPUT_DIR)\n",
"print(reload_embedding_model.signatures[\"serving_default\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ANkOq0ZSP4rx"
},
"source": [
"---\n",
"## Export Model to GCS Bucket\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jkow0oipSUfW"
},
"source": [
"Authenticate gcloud account with Google sign-in:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9YUPNq5vQ0Vp"
},
"outputs": [],
"source": [
"import googleapiclient\n",
"from google.colab import auth as google_auth\n",
"\n",
"PROJECT_ID=\"sample-project-id\" # replace with project ID\n",
"\n",
"google_auth.authenticate_user()\n",
"!gcloud config set project {PROJECT_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P8rtBdTxRLi_"
},
"source": [
"Copy model's contents to specified GCS bucket:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ly1U-nbqSBxV"
},
"outputs": [],
"source": [
"GCS_BUCKET=\"bashtest\" # replace with GCS bucket name\n",
"\n",
"# if bucket doesn't exist, make new bucket\n",
"# otherwise, use existing bucket\n",
"!gsutil mb gs://{GCS_BUCKET}/\n",
"\n",
"# copy model's content to bucket and list out the bucket's content\n",
"!gsutil cp -r {LOCAL_OUTPUT_DIR} gs://{GCS_BUCKET}/\n",
"!gsutil ls gs://{GCS_BUCKET}/{MODEL_NAME}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kENRlvYmSf9Y"
},
"source": [
"---\n",
"\n",
"## Make predictions\n",
"\n",
"Once the generated text embedding model is copied into a GCS bucket, make predictions by following one of the two paths listed below:\n",
"- NNLM, SWIVEL: [Make predictions with imported TensorFlow models](https://cloud.google.com/bigquery/docs/making-predictions-with-imported-tensorflow-models)\n",
"- BERT: [Make predictions with remote models on Vertex AI](https://cloud.google.com/bigquery/docs/bigquery-ml-remote-model-tutorial)"
]
}
],
"metadata": {
"colab": {
"private_outputs": true,
"provenance": [
{
"file_id": "1764POEDanktlxne7WxHla6cReEwXgaM8",
"timestamp": 1690992478894
}
]
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}