toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb (466 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "2c03d1b3", "metadata": {}, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "id": "bd9f052f", "metadata": {}, "source": [ "# Batch Processing with Document AI Toolbox\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fdocument-ai-samples%2Fmain%2Ftoolbox-batch-processing%2Fdocumentai-toolbox-batch-entity-extraction.ipynb\">\n", " <img width=\"32px\" src=\"https://storage.googleapis.com/github-repo/colab_enterprise.svg\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n", " <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", "</table>\n" ] }, { "cell_type": "markdown", "id": "bad843d2", "metadata": {}, "source": [ "[Document AI Toolbox](https://cloud.google.com/document-ai/docs/toolbox) is an SDK for Python that provides utility\n", "functions for managing, manipulating, and extracting information from the [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) object.\n", "\n", "It creates a [\"wrapped\" document object](https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document) from a processed document response from JSON files in\n", "Cloud Storage, local JSON files, or output directly from the [`process_document()`](https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/process) method.\n", "\n", "It can perform the following actions:\n", "\n", "- Combine fragmented [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) JSON files from Batch Processing into a single [\"wrapped\" document](https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document).\n", " - Export shards as a unified [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document).\n", "\n", "- Get [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) output from:\n", " - [Cloud Storage](https://cloud.google.com/storage)\n", " - [`BatchProcessMetadata`](https://cloud.google.com/document-ai/docs/reference/rest/Shared.Types/BatchProcessMetadata)\n", " - [`Operation` name](https://cloud.google.com/document-ai/docs/reference/rest/Shared.Types/ListOperationsResponse#Operation.FIELDS.name)\n", "\n", "- Access text from [`Pages`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#page), [`Lines`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#line), [`Paragraphs`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#paragraph), [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield), and [`Tables`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#table) without handling [`Layout`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#Layout) information.\n", "\n", "- Search for [`Pages`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#page) containing a target string or matching a regular expression.\n", "\n", "- Search for [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield) by name.\n", "\n", "- Search for [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) by type.\n", "\n", "- Convert [`Tables`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#table) to a [Pandas](https://pandas.pydata.org/) Dataframe or CSV.\n", "\n", "- Insert [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) and [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield) into a [BigQuery](https://cloud.google.com/bigquery) table.\n", "\n", "- Split a PDF file based on [output from a Splitter/Classifier processor]([#splitting](https://cloud.google.com/document-ai/docs/splitters)).\n", "\n", "- Extract image [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) from [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) [bounding boxes](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#boundingpoly).\n", "\n", "- Convert [`Documents`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) to and from commonly used formats:\n", " - [Cloud Vision API](https://cloud.google.com/vision) [`AnnotateFileResponse`](https://cloud.google.com/vision/docs/reference/rest/v1/BatchAnnotateFilesResponse#AnnotateFileResponse)\n", " - [hOCR](https://en.wikipedia.org/wiki/HOCR)\n", " - Third-party document processing formats\n", "\n", "- Create batches of documents for processing from a [Cloud Storage](https://cloud.google.com/) folder.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1ouFwNhyEWPf", "metadata": { "id": "1ouFwNhyEWPf" }, "outputs": [], "source": [ "%pip install --upgrade --user -q google-cloud-documentai google-cloud-documentai-toolbox pandas" ] }, { "cell_type": "markdown", "id": "d61f2f79", "metadata": {}, "source": [ "**Colab only:** Run the following cell to restart the kernel or use the restart button. For Vertex AI Workbench you can restart the terminal using the button on top." ] }, { "cell_type": "code", "execution_count": null, "id": "af87f6d8", "metadata": {}, "outputs": [], "source": [ "# Automatically restart kernel after installs so that your environment can access the new packages\n", "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "id": "d10c67cc", "metadata": {}, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n", "</div>" ] }, { "cell_type": "markdown", "id": "cd2caade", "metadata": {}, "source": [ "### Authenticating your notebook environment\n", "\n", "* If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n", "* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)." ] }, { "cell_type": "code", "execution_count": null, "id": "665bf8e0", "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "# Additional authentication is required for Google Colab\n", "if \"google.colab\" in sys.modules:\n", " # Authenticate user to Google Cloud\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "code", "execution_count": null, "id": "7DSQUZl7wtY8", "metadata": { "id": "7DSQUZl7wtY8" }, "outputs": [], "source": [ "# TODO(developer): Fill these variables before running the sample.\n", "project_id = \"YOUR_PROJECT_ID\" # @param {type:\"string\"}\n", "# https://cloud.google.com/document-ai/docs/regions\n", "location = \"us\" # @param {type:\"string\"}\n", "\n", "# Create processor before running sample\n", "# https://cloud.google.com/document-ai/docs/create-processor\n", "processor_id = \"YOUR_PROCESSOR_ID\" # @param {type:\"string\"}\n", "# https://cloud.google.com/document-ai/docs/manage-processor-versions\n", "processor_version_id = \"stable\" # @param {type:\"string\"}\n", "\n", "# Format: `gs://bucket/directory/`\n", "gcs_input_uri = \"YOUR_INPUT_BUCKET\" # @param {type:\"string\"}\n", "# Must end with a trailing slash `/`. Format: `gs://bucket/directory/subdirectory/`\n", "gcs_output_uri = \"YOUR_OUTPUT_BUCKET\" # @param {type:\"string\"}\n", "\n", "batch_size = 1000\n", "# Optional. The fields to return in the Document object.\n", "field_mask = \"text,entities,pages,shardInfo\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "id": "Sh1oolV7Mael", "metadata": { "id": "Sh1oolV7Mael" }, "outputs": [], "source": [ "# Set the project id\n", "!gcloud config set project {project_id}\n", "!gcloud auth application-default login -q" ] }, { "cell_type": "code", "execution_count": null, "id": "x9hTHkFrX_5N", "metadata": { "executionInfo": { "elapsed": 257, "status": "ok", "timestamp": 1694541469217, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "x9hTHkFrX_5N" }, "outputs": [], "source": [ "from IPython.display import display\n", "\n", "from typing import List, Optional\n", "\n", "# https://googleapis.dev/python/google-api-core/latest/client_options.html\n", "from google.api_core.client_options import ClientOptions\n", "\n", "# https://cloud.google.com/python/docs/reference/documentai/latest\n", "from google.cloud import documentai\n", "\n", "# https://cloud.google.com/document-ai/docs/toolbox\n", "from google.cloud import documentai_toolbox\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "3Iaq7M5MvkqG", "metadata": { "id": "3Iaq7M5MvkqG" }, "source": [ "## Batch Processing\n", "\n", "- Create batches of 1000 documents in Google Cloud Storage.\n", "- Make a batch processing request for each batch.\n", "- Get long-running operation ID for each request." ] }, { "cell_type": "code", "execution_count": null, "id": "35856bf2-aa5e-436b-977a-9e5725b1a595", "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1694541463780, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "35856bf2-aa5e-436b-977a-9e5725b1a595", "trusted": true }, "outputs": [], "source": [ "def batch_process_toolbox(\n", " project_id: str,\n", " location: str,\n", " processor_id: str,\n", " processor_version_id: str,\n", " gcs_input_uri: str,\n", " gcs_output_uri: str,\n", " batch_size: int,\n", " field_mask: Optional[str] = None,\n", " skip_human_review: bool = True,\n", ") -> List:\n", " client = documentai.DocumentProcessorServiceClient(\n", " client_options=ClientOptions(\n", " api_endpoint=f\"{location}-documentai.googleapis.com\"\n", " )\n", " )\n", "\n", " # The full resource name of the processor version, e.g.:\n", " # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}\n", " name = client.processor_version_path(\n", " project_id, location, processor_id, processor_version_id\n", " )\n", "\n", " # Cloud Storage URI for the Output Directory\n", " output_config = documentai.DocumentOutputConfig(\n", " gcs_output_config=documentai.DocumentOutputConfig.GcsOutputConfig(\n", " gcs_uri=gcs_output_uri, field_mask=field_mask\n", " )\n", " )\n", "\n", " # Create batches of documents for processing\n", " # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.utilities.gcs_utilities\n", " gcs_bucket_name, gcs_prefix = documentai_toolbox.gcs_utilities.split_gcs_uri(\n", " gcs_input_uri\n", " )\n", " batches = documentai_toolbox.gcs_utilities.create_batches(\n", " gcs_bucket_name, gcs_prefix, batch_size=batch_size\n", " )\n", "\n", " operations = []\n", "\n", " print(f\"{len(batches)} batches created.\")\n", " for batch in batches:\n", " print(f\"{len(batch.gcs_documents.documents)} files in batch.\")\n", " print(batch.gcs_documents.documents)\n", "\n", " # https://cloud.google.com/document-ai/docs/send-request?hl=en#async-processor\n", " # `batch_process_documents()` returns a Long Running Operation (LRO)\n", " operation = client.batch_process_documents(\n", " request=documentai.BatchProcessRequest(\n", " name=name,\n", " input_documents=batch,\n", " document_output_config=output_config,\n", " skip_human_review=skip_human_review,\n", " )\n", " )\n", " operations.append(operation)\n", "\n", " return operations" ] }, { "cell_type": "markdown", "id": "op0ZCWTIwDgR", "metadata": { "id": "op0ZCWTIwDgR" }, "source": [ "## Retrieve results once processing is complete\n", "\n", "- Get output [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) JSON files from `gcs_output_bucket` based on the Operation ID." ] }, { "cell_type": "code", "execution_count": null, "id": "KxVFCVNVLLwW", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2604, "status": "ok", "timestamp": 1694541481158, "user": { "displayName": "", "userId": "" }, "user_tz": 300 }, "id": "KxVFCVNVLLwW", "outputId": "2ada6f15-b774-4f55-fa73-b0e7064cd437" }, "outputs": [], "source": [ "operations = batch_process_toolbox(\n", " project_id,\n", " location,\n", " processor_id,\n", " processor_version_id,\n", " gcs_input_uri,\n", " gcs_output_uri,\n", " batch_size,\n", " field_mask,\n", ")\n", "\n", "# Can do this asynchronously to avoid blocking\n", "documents: List[documentai_toolbox.document.Document] = []\n", "\n", "TIMEOUT = 60\n", "\n", "for operation in operations:\n", " # https://cloud.google.com/document-ai/docs/long-running-operations\n", " print(f\"Waiting for operation {operation.operation.name}\")\n", " operation.result(timeout=TIMEOUT)\n", " documents.extend(\n", " documentai_toolbox.document.Document.from_batch_process_metadata(\n", " documentai.BatchProcessMetadata(operation.metadata)\n", " )\n", " )" ] }, { "cell_type": "markdown", "id": "445FQsfrwc4N", "metadata": { "id": "445FQsfrwc4N" }, "source": [ "## Print results\n", "\n", "- Export extracted entities as dictionary\n", "- Load into Pandas DataFrame\n", "- Print DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "f1486462", "metadata": {}, "outputs": [], "source": [ "for document in documents:\n", " # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document#google_cloud_documentai_toolbox_wrappers_document_Document_entities_to_dict\n", " entities = document.entities_to_dict()\n", " # Optional: Export to BQ\n", " # job = document.entities_to_bigquery(dataset_name, table_name, project_id=project_id)\n", "\n", " df = pd.DataFrame([entities])\n", "\n", " display(df)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.undefined.undefined" } }, "nbformat": 4, "nbformat_minor": 5 }