notebook/gen_ai_jss.ipynb (1,119 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e", "tags": [] }, "source": [ "# Generative AI Document Summarization\n", "\n", "<table align=\"left\">\n", "\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/terraform-genai-doc-summarization/blob/main/notebook/gen_ai_jss.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/terraform-genai-doc-summarization/blob/main/notebook/gen_ai_jss.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td> \n", "</table>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "24743cf4a1e1" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.8.16 (local)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "This notebook is a companion to the [Generative AI Document Summarization Jump Start Solution](https://cloud.google.com/architecture/ai-ml/generative-ai-document-summarization). With this notebook, you can use the summarization solution to create summaries of academic PDF files. In the notebook, you will programmatically upload a PDF file to a Cloud Storage bucket and then view the summary of that PDF in a BigQuery table. \n", "\n", "+ Learn more about [using text chat LLM with Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview).\n", "+ Learn more about [querying tables in Cloud BigQuery](https://cloud.google.com/bigquery/docs/tables).\n", "+ Learn more about [creating EventArc triggers for Cloud Functions](https://cloud.google.com/functions/docs/calling/eventarc).\n", "+ Learn more about [storing data in Cloud Storage](https://cloud.google.com/storage/docs/uploading-objects).\n", "+ Learn more about [transcribing PDFs with Cloud Vision OCR](https://cloud.google.com/vision/docs/pdf)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "d975e698c9a4" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to trigger a Cloud Function process that transcribes characters from a PDF, stores the complete PDF text in a Storage bucket, summarizes the PDF, and then upserts the document data (summary, complete text, URI) into a BigQuery table.\n", "\n", "This tutorial uses the following Google Cloud services and resources:\n", "\n", "- Vertex AI Generative AI\n", "- Cloud BigQuery\n", "- Cloud Vision OCR\n", "- Cloud EventArc triggers\n", "- Cloud Functions\n", "- Cloud Storage\n", "\n", "The steps performed include:\n", "\n", "- Trigger an EventArc event by uploading a PDF to a Cloud Storage bucket\n", "- Query the BigQuery table to see the results of the summarization process" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "08d289fa873f" }, "source": [ "### Dataset\n", "\n", "This notebook uses a [Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) that contains a large collection of academic summaries from [arXiv.org](https://arxiv.org/). This dataset is made publicly available through a Cloud Storage bucket." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "aed92deeb4a0" }, "source": [ "### Costs \n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* BigQuery\n", "* Vision\n", "* Cloud Functions\n", "* EventArc\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n", "and [BigQuery pricing](https://cloud.google.com/bigquery/pricing),\n", "and [Cloud Vision pricing](https://cloud.google.com/vision/pricing),\n", "and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),\n", "and [Cloud EventArc pricing](https://cloud.google.com/eventarc/pricing),\n", "and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), \n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/?hl=en_US&_ga=2.16285759.-826855678.1689377111#id=78888c9b-02ac-4130-9327-fecd7f4cfb11)\n", "to generate a cost estimate based on your projected usage." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "i7EUnXsZhAGF" }, "source": [ "## Installation\n", "\n", "Install the following packages required to execute this notebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%writefile requirements.txt\n", "\n", "google-cloud-aiplatform\n", "google-cloud-bigquery\n", "google-cloud-logging\n", "google-cloud-storage\n", "google-cloud-vision\n", "polling2\n", "tqdm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2b4ef9b72d43" }, "outputs": [], "source": [ "# Install the packages\n", "import os\n", "\n", "if not os.getenv(\"IS_TESTING\"):\n", " USER = \"--user\"\n", "else:\n", " USER = \"\"\n", "! pip3 install {USER} --upgrade -r requirements.txt" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "58707a750154" }, "source": [ "### Restart the kernel (Colab only)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f200f10a1da3" }, "outputs": [], "source": [ "# Automatically restart kernel after installs so that your environment can access the new packages\n", "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "BF1j6f9HApxa" }, "source": [ "## Before you begin\n", "\n", "### Set up your Google Cloud project\n", "\n", "This notebook assumes that you have already deployed this solution using either the [Terraform script](https://github.com/GoogleCloudPlatform/terraform-genai-doc-summarization) or using the [Solutions console](https://console.cloud.google.com/products/solutions/details/generative-ai-document-summarization). During this deployment, several actions required to run this solution were performed on your behalf:\n", "\n", "1. The [Cloud Function](https://console.cloud.google.com/functions/list) was deployed.\n", "\n", "2. The [EventArc trigger](https://console.cloud.google.com/eventarc/triggers) was applied to the input Cloud Storage bucket.\n", "\n", "3. The following APIs were enabled for you: \n", "\n", " - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)\n", " - [BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com)\n", " - [Cloud Vision API](https://console.cloud.google.com/flows/enableapi?apiid=vision.googleapis.com)\n", "\n", "\n", "<div style=\"background-color:rgb(225,245,254); color:rgb(1,87,155); padding:2px;\"><strong>Note:</strong> It is recommended to run this notebook from <a href=\"https://colab.sandbox.google.com/\">Google Colaboratory</a>. If you are running this notebook locally instead, you need to install the <a href=\"https://cloud.google.com/sdk\" target=\"_blank\">Cloud SDK</a>.</div>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "WReHDGG5g0XY" }, "source": [ "#### Set your project ID\n", "\n", "**If you don't know your project ID**, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oM1iC_MfAts1" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-name]\" # @param {type:\"string\"}\n", "\n", "# Set the project id\n", "! gcloud config set project {PROJECT_ID}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Region\n", "\n", "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "region" }, "outputs": [], "source": [ "REGION = \"us-central1\" # @param {type: \"string\"}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "sBCra4QMA2wR" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you may have to manually authenticate. Follow one of the relevant instructions below." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ef21552ccea8" }, "source": [ "**1. Colab, uncomment and run**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "603adbbf0532" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "de775a3773ba" }, "source": [ "**2. Local JupyterLab instance, uncomment and run:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "254614fa0c46" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "f6b2ccc891ed" }, "source": [ "**3. Service account or other**\n", "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "960505627ddf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PyQmSRbKA8r-" }, "outputs": [], "source": [ "import json\n", "import os\n", "import polling2\n", "import re\n", "\n", "from datetime import datetime\n", "from typing import Sequence, Mapping\n", "from tqdm.notebook import tqdm\n", "from google.cloud import aiplatform\n", "from google.cloud import bigquery\n", "from google.cloud import logging\n", "from google.cloud import storage\n", "from google.cloud import vision\n", "\n", "import vertexai\n", "from vertexai.preview.language_models import TextGenerationModel\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Download test data\n", "\n", "This Jump Start Solution uses data from [arXiv.org](https://arxiv.org/) to demonstrate the summarization capabilities of Vertex AI. [Kaggle.com](https://www.kaggle.com/datasets/Cornell-University/arxiv) has made many arXiv.org scholarly papers available, free of charge, from a Google Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List all the comparative linguistics papers from Cloud Storage\n", "! gsutil ls gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the available arXiv.org papers has been selected for you in the following cell. You can swap out the selected paper with another from the same source. Be sure to choose a paper that is single-column format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = '9404002v1'\n", "file_uri = f'gs://arxiv-dataset/arxiv/cmp-lg/pdf/9404/{filename}.pdf'\n", "\n", "# Create a local folder and download some test PDFs\n", "if not os.path.exists('pdfs'):\n", " os.mkdir('pdfs')\n", "\n", "! gsutil cp -r $file_uri pdfs/" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Upload test data to Storage bucket\n", "\n", "The Terraform scripts for this solution applies an EventArc trigger to a Cloud Storage bucket. When a PDF is uploaded to the storage bucket, the EventArc trigger fires, starting the summarization process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "INPUT_BUCKET = f'{PROJECT_ID}_uploads'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Running the following cell uploads a local PDF file (downloaded in the previous section) to the target Cloud Storage bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_complete_text = f'{filename}_summary.txt'\n", "pdf = f'pdfs/{filename}.pdf'\n", "\n", "storage_client = storage.Client(project=PROJECT_ID)\n", "bucket = storage_client.bucket(INPUT_BUCKET)\n", "blob = bucket.blob(pdf)\n", "blob.upload_from_filename(pdf)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This upload process kicks off the summarization process. You can view the progress of the summarization process in the [Logs Explorer](https://console.cloud.google.com/logs/query for your project).\n", "\n", "To filter the logs, click the **Log name** drop-down menu and type \"summarization-by-llm\". Select the \"summarization-by-llm\" logger in the menu and then click apply to close the drop-down." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Optional: View summarization process in Cloud Logging\n", "\n", "You can view the results of the summarization Cloud Function as it writes updates to Cloud Logging. Each run of the summarization pipeline is associated with a `cloud_event_id`. By filtering for this ID, you can track the summarization process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logger_name = 'summarization-by-llm'\n", "\n", "@polling2.poll_decorator(check_success=lambda x: x != '', step=0.5, timeout=90)\n", "def get_cloud_event_id(pdf_filename, bar):\n", " logging_client = logging.Client(project=PROJECT_ID)\n", " logger = logging_client.logger(logger_name)\n", " \n", " pattern = 'cloud_event_id\\((.*)\\):'\n", " cloud_id = ''\n", " for entry in logger.list_entries(filter_=pdf_filename, max_results=100):\n", " entry_text = entry.payload\n", " res = re.search(pattern, entry_text)\n", " if res != None:\n", " cloud_id = res.group(1)\n", " print(cloud_id)\n", " bar.update(100)\n", " \n", " if cloud_id != '':\n", " return cloud_id\n", " return cloud_id" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with tqdm(total=100) as bar:\n", " cloud_event_id = get_cloud_event_id(filename, bar)\n", " bar.close()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the `cloud_event_id`, we can filter for this cloud event." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f'cloud_event_id: {cloud_event_id}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@polling2.poll_decorator(step=10, timeout=70)\n", "def get_cloud_event_logs(cloud_event_id):\n", " print(\"polling\")\n", " logging_client = logging.Client(project=PROJECT_ID)\n", " logger = logging_client.logger(logger_name)\n", " \n", " entries = []\n", " for entry in logger.list_entries(filter_=cloud_event_id, max_results=100):\n", " entry_text = entry.payload\n", " entries.append(entry_text)\n", " return entries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entries = []\n", "bar = tqdm(total=6)\n", "\n", "for _ in range(6):\n", " tmp_entries = get_cloud_event_logs(cloud_event_id)\n", " for e in tmp_entries:\n", " if e not in entries:\n", " bar.update(1)\n", " entries.append(e)\n", " print(e)\n", " \n", " " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Query the BigQuery table to see the summary\n", "\n", "Once the summarization flow has completed, the summary of the PDF document should be available for you to read. To get the summary of the PDF document, you can query the BigQuery table that contains the summary.\n", "\n", "If you do not get a result the first time you run the query, then the summarization pipeline might still be running. You might need to wait a minute to allow the pipeline to finish and to try the query again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bigquery_client = bigquery.Client(project=PROJECT_ID)\n", "\n", "table_name = f\"{PROJECT_ID}.summary_dataset.summary_table\"\n", "\n", "# Compose the SQL query to select the summary for the PDF document\n", "sql_query = f\"SELECT summary FROM `{table_name}` WHERE filename LIKE '%{file_complete_text}%'\"\n", "\n", "job = bigquery_client.query(sql_query)\n", "rows = job.result()\n", "row_list = list(rows)\n", "\n", "if len(row_list) != 0:\n", " summary = row_list[0]\n", "\n", "print(summary['summary'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional: Run pipeline components individually\n", "\n", "The summarization pipeline is composed of multiple independent components. There is a component that performs optical character recognition on the PDF, another that stores data in a Storage bucket, another that performs summarization with a LLM, and yet another that stores new rows into the BigQuery table.\n", "\n", "In this section, you can run each component individually to understand how they work together." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Perform OCR with Cloud Vision\n", "\n", "The first component in the pipeline performs optical character recognition (OCR) using Cloud Vision. Run the following cells to run optical character recognition on the PDF file you downloaded previously.\n", "\n", "Note that OCR can take a while to complete. You might need to wait for a result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def document_extract(\n", " bucket: str,\n", " name: str,\n", " output_bucket: str,\n", " project_id: str,\n", " timeout: int = 420,\n", ") -> str:\n", " \"\"\"Perform OCR with PDF/TIFF as source files on GCS.\n", "\n", " Original sample is here:\n", " https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/vision/snippets/detect/detect.py#L806\n", "\n", " Note: This function can cause the IOPub data rate to be exceeded on a\n", " Jupyter server. This rate can be changed by setting the variable\n", " `--ServerApp.iopub_data_rate_limit\n", "\n", " Args:\n", " bucket (str): GCS URI of the bucket containing the PDF/TIFF files.\n", " name (str): name of the PDF/TIFF file.\n", " output_bucket: bucket to store output in\n", " timeout (int): Timeout in seconds for the request.\n", "\n", "\n", " Returns:\n", " str: the complete text\n", " \"\"\"\n", "\n", " gcs_source_uri = f\"gs://{bucket}/{name}\"\n", " prefix = \"ocr\"\n", " gcs_destination_uri = f\"gs://{output_bucket}/{prefix}/\"\n", " mime_type = \"application/pdf\"\n", " batch_size = 2\n", "\n", " # Perform Vision OCR\n", " client = vision.ImageAnnotatorClient()\n", "\n", " feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)\n", "\n", " gcs_source = vision.GcsSource(uri=gcs_source_uri)\n", " input_config = vision.InputConfig(gcs_source=gcs_source, mime_type=mime_type)\n", "\n", " gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)\n", " output_config = vision.OutputConfig(\n", " gcs_destination=gcs_destination, batch_size=batch_size\n", " )\n", "\n", " async_request = vision.AsyncAnnotateFileRequest(\n", " features=[feature], input_config=input_config, output_config=output_config\n", " )\n", "\n", " operation = client.async_batch_annotate_files(requests=[async_request])\n", "\n", " print(\"OCR: waiting for the operation to finish.\")\n", " operation.result(timeout=timeout)\n", "\n", " # Once the request has completed and the output has been\n", " # written to GCS, we can list all the output files.\n", " return get_ocr_output_from_bucket(gcs_destination_uri, output_bucket, project_id)\n", "\n", "\n", "def get_ocr_output_from_bucket(gcs_destination_uri: str,\n", " bucket_name: str,\n", " project_id: str) -> str:\n", " \"\"\"Iterates over blobs in output bucket to get full OCR result.\n", "\n", " Arguments:\n", " gcs_destination_uri: the URI where the OCR output was saved.\n", " bucket_name: the name of the bucket where the output was saved.\n", "\n", " Returns:\n", " The full text of the document\n", " \"\"\"\n", " storage_client = storage.Client(project=project_id)\n", "\n", " match = re.match(r\"gs://([^/]+)/(.+)\", gcs_destination_uri)\n", " prefix = match.group(2)\n", " bucket = storage_client.get_bucket(bucket_name)\n", "\n", " # List objects with the given prefix, filtering out folders.\n", " blob_list = [\n", " blob\n", " for blob in list(bucket.list_blobs(prefix=prefix))\n", " if not blob.name.endswith(\"/\")\n", " ]\n", "\n", " # Concatenate all text from the blobs\n", " complete_text = \"\"\n", " for output in blob_list:\n", " json_string = output.download_as_bytes().decode(\"utf-8\")\n", " response = json.loads(json_string)\n", "\n", " # The actual response for the first page of the input file.\n", " page_response = response[\"responses\"][0]\n", " annotation = page_response[\"fullTextAnnotation\"]\n", "\n", " complete_text = complete_text + annotation[\"text\"]\n", "\n", " return complete_text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = \"arxiv-dataset\"\n", "pdf_name = \"arxiv/cmp-lg/pdf/9404/9404002v1.pdf\"\n", "output_bucket = f\"{PROJECT_ID}_output\"\n", "\n", "complete_text = document_extract(bucket=bucket,\n", " name=pdf_name,\n", " output_bucket=output_bucket,\n", " project_id=PROJECT_ID)\n", "\n", "# Entire text is long; print just first 1000 characters\n", "print(complete_text[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarize with Vertex AI LLM\n", "\n", "Next, you can send the complete text of the PDF to be summarized. Vertex AI allows you to use many different types of LLM models. In this case, you use a LLM model designed for text summarization, `text-bison@001`. You send a prediction request to Vertex AI, providing the name of the LLM you want to use. The Vertex AI service then sends the model's response back to you. In the following cells, the Python SDK for Vertex AI provides all of the helper methods and classes you need to perform this process.\n", "\n", "Note that Vertex AI predictions have a limit of characters that can be sent in a request payload. For this reason, a heuristic is needed to isolate only certain text blocks that you need--the abstract and the conclusion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_large_language_model(\n", " project_id: str,\n", " model_name: str,\n", " temperature: float,\n", " max_decode_steps: int,\n", " top_p: float,\n", " top_k: int,\n", " content: str,\n", " location: str = \"us-central1\",\n", " tuned_model_name: str = \"\",\n", ") -> str:\n", " \"\"\"Predict using a Large Language Model.\n", "\n", " Args:\n", " project_id (str): the Google Cloud project ID\n", " model_name (str): the name of the LLM model to use\n", " temperature (float): controls the randomness of predictions\n", " max_decode_steps (int): the maximum number of decode steps\n", " top_p (float): cumulative probability of parameter highest vocabulary tokens\n", " top_k (int): number of highest propbability vocabulary tokens to keep for top-k-filtering\n", " content (str): the text to summarize\n", " location (str): the Google Cloud region to run in\n", " tuned_model_name (str): a tuned LLM model to use; default is none\n", "\n", " Returns:\n", " The summarization of the content\n", " \"\"\"\n", " vertexai.init(\n", " project=project_id,\n", " location=location,\n", " )\n", "\n", " model = TextGenerationModel.from_pretrained(model_name)\n", " if tuned_model_name:\n", " model = model.get_tuned_model(tuned_model_name)\n", " response = model.predict(\n", " content,\n", " temperature=temperature,\n", " max_output_tokens=max_decode_steps,\n", " top_k=top_k,\n", " top_p=top_p,\n", " )\n", " return response.text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ABSTRACT_H1 = \"abstract\"\n", "CONCLUSION_H1 = \"conclusion\"\n", "ABSTRACT_LENGTH = 150 * 10 # Abstract recommended max word length * avg 10 letters long\n", "CONCLUSION_LENGTH = 200 * 10 # Conclusion max word legnth * avg 10 letters long\n", "\n", "def truncate_complete_text(complete_text: str) -> str:\n", " \"\"\"Extracts the abstract and conclusion from an academic paper.\n", "\n", " Uses a heuristics to approximate the extent of the abstract and conclusion.\n", " For abstract: assumes beginning after the string `abstract` and extends for 6-7 sentences\n", " For conclusion: assumes beginning after the string `conclusion` and extends for 7-9 sentences\n", "\n", " Args:\n", " complete_text (str): the complete text of the academic paper\n", "\n", " Returns\n", " str: the truncated paper\n", " \"\"\"\n", " complete_text = complete_text.lower()\n", " abstract_start = complete_text.find(ABSTRACT_H1)\n", " conclusion_start = complete_text.find(CONCLUSION_H1)\n", "\n", " abstract = complete_text[abstract_start:ABSTRACT_LENGTH]\n", " conclusion = complete_text[conclusion_start:]\n", " if len(conclusion) > CONCLUSION_LENGTH:\n", " conclusion = conclusion[:CONCLUSION_LENGTH]\n", "\n", " return f\"\"\"\n", " Abstract: {abstract}\n", "\n", " Conclusion: {conclusion}\n", " \"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_name = \"text-bison@001\"\n", "temperature = 0.2\n", "max_decode_steps = 1024\n", "top_p = 0.8\n", "top_k = 40\n", "\n", "prompt = 'Summarize:'\n", "extracted_text_trunc = truncate_complete_text(complete_text=complete_text)\n", "content = f\"{prompt}\\n{extracted_text_trunc}\"\n", "\n", "summary = predict_large_language_model(\n", " project_id=PROJECT_ID,\n", " model_name=model_name,\n", " temperature=temperature,\n", " top_p=top_p,\n", " top_k=top_k,\n", " max_decode_steps=max_decode_steps,\n", " content=content)\n", "\n", "print(summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Store summary in Cloud Storage\n", "\n", "The output from multiple steps in the summarization process are stored in Cloud Storage. The following cells saves summarization text as a TXT file in a Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def upload_to_gcs(project: str, bucket: str, name: str, data: str):\n", " \"\"\"Upload a string to Google Cloud Storage bucket.\n", "\n", " Args:\n", " bucket (str): the name of the Storage bucket. Do not include \"gs://\"\n", " name (str): the name of the file to create in the bucket\n", " data (str): the data to store\n", "\n", " \"\"\"\n", " client = storage.Client(project=project)\n", " bucket = client.get_bucket(bucket)\n", " blob = bucket.blob(name)\n", " blob.upload_from_string(data)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "summary_text_filename = \"summaries/manual.txt\"\n", "\n", "upload_to_gcs(project=PROJECT_ID, bucket=output_bucket, name=summary_text_filename, data=summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upsert data into BigQuery\n", "\n", "Now that you have the summary for the file, you can update the BigQuery table that contains all of the file summaries.\n", "\n", "The following cells updates the BigQuery table in your project named `summary_dataset.summary_table` with the summaries created by Vertex LLM." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def write_summarization_to_table(\n", " project_id: str,\n", " dataset_id: str,\n", " table_id: str,\n", " bucket: str,\n", " filename: str,\n", " complete_text: str,\n", " complete_text_uri: str,\n", " summary: str,\n", " summary_uri: str,\n", " timestamp: datetime,\n", ") -> Sequence[Mapping]:\n", " \"\"\"Updates the BigQuery table with the document summarization\n", "\n", " Original sample is here:\n", " https://cloud.google.com/bigquery/docs/samples/bigquery-table-insert-rows-explicit-none-insert-ids\n", "\n", " Args:\n", " project_id (str): the Google Cloud project ID\n", " dataset_id (str): the name of the BigQuery dataset\n", " table_id (str): the name of the BigQuery table\n", " bucket (str): the name of the bucket with the PDF\n", " filename (str): path of PDF relative to bucket root\n", " complete_text (str): the complete text of the PDF\n", " complete_text_uri (str): the Storage URI of the complete TXT document\n", " summary (str): the text summary of the document\n", " summary_uri (str): the Storage URI of the summary TXT document\n", " timestamp (datetime): when the processing occurred\n", " \"\"\"\n", " client = bigquery.Client(project=project_id)\n", "\n", " table_name = f\"{project_id}.{dataset_id}.{table_id}\"\n", "\n", " rows_to_insert = [\n", " {\n", " \"bucket\": bucket,\n", " \"filename\": filename,\n", " \"extracted_text\": complete_text,\n", " \"summary_uri\": summary_uri,\n", " \"summary\": summary,\n", " \"complete_text_uri\": complete_text_uri,\n", " \"timestamp\": timestamp.isoformat(),\n", " }\n", " ]\n", "\n", " errors = client.insert_rows_json(\n", " table_name, rows_to_insert, row_ids=bigquery.AutoRowIDs.GENERATE_UUID\n", " )\n", " if errors != []:\n", " logging_client = logging.Client()\n", " logger = logging_client.logger(logger_name)\n", " logger.log(\n", " f\"Encountered errors while inserting rows: {errors}\", severity=\"ERROR\"\n", " )\n", " return errors\n", "\n", " return []" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_id = \"summary_dataset\"\n", "table_id = \"summary_table\"\n", "bucket = \"gs://arxiv-dataset\"\n", "\n", "errors = write_summarization_to_table(\n", " project_id=PROJECT_ID,\n", " dataset_id=dataset_id,\n", " table_id=table_id,\n", " bucket=bucket,\n", " filename=pdf_name,\n", " complete_text=complete_text,\n", " complete_text_uri=\"\",\n", " summary=summary,\n", " summary_uri=f\"gs://{output_bucket}/{summary_text_filename}\",\n", " timestamp=datetime.now(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can query the BigQuery table to ensure that the PDF summary has been inserted into the table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bigquery_client = bigquery.Client(project=PROJECT_ID)\n", "\n", "table_name = f\"{PROJECT_ID}.summary_dataset.summary_table\"\n", "\n", "# Compose the SQL query to select the summary for the PDF document\n", "sql_query = f\"SELECT summary FROM `{table_name}` WHERE filename LIKE '%{pdf_name}%'\"\n", "\n", "job = bigquery_client.query(sql_query)\n", "rows = job.result()\n", "row_list = list(rows)\n", "\n", "if len(row_list) != 0:\n", " summary = row_list[0]\n", "\n", "print(summary['summary'])" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "notebook_template.ipynb", "toc_visible": true }, "environment": { "kernel": "python3", "name": "common-cu110.m103", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cu110:m103" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 4 }