toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb (466 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "2c03d1b3",
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "bd9f052f",
"metadata": {},
"source": [
"# Batch Processing with Document AI Toolbox\n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fdocument-ai-samples%2Fmain%2Ftoolbox-batch-processing%2Fdocumentai-toolbox-batch-entity-extraction.ipynb\">\n",
" <img width=\"32px\" src=\"https://storage.googleapis.com/github-repo/colab_enterprise.svg\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n",
" <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
"</table>\n"
]
},
{
"cell_type": "markdown",
"id": "bad843d2",
"metadata": {},
"source": [
"[Document AI Toolbox](https://cloud.google.com/document-ai/docs/toolbox) is an SDK for Python that provides utility\n",
"functions for managing, manipulating, and extracting information from the [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) object.\n",
"\n",
"It creates a [\"wrapped\" document object](https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document) from a processed document response from JSON files in\n",
"Cloud Storage, local JSON files, or output directly from the [`process_document()`](https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/process) method.\n",
"\n",
"It can perform the following actions:\n",
"\n",
"- Combine fragmented [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) JSON files from Batch Processing into a single [\"wrapped\" document](https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document).\n",
" - Export shards as a unified [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document).\n",
"\n",
"- Get [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) output from:\n",
" - [Cloud Storage](https://cloud.google.com/storage)\n",
" - [`BatchProcessMetadata`](https://cloud.google.com/document-ai/docs/reference/rest/Shared.Types/BatchProcessMetadata)\n",
" - [`Operation` name](https://cloud.google.com/document-ai/docs/reference/rest/Shared.Types/ListOperationsResponse#Operation.FIELDS.name)\n",
"\n",
"- Access text from [`Pages`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#page), [`Lines`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#line), [`Paragraphs`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#paragraph), [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield), and [`Tables`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#table) without handling [`Layout`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#Layout) information.\n",
"\n",
"- Search for [`Pages`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#page) containing a target string or matching a regular expression.\n",
"\n",
"- Search for [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield) by name.\n",
"\n",
"- Search for [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) by type.\n",
"\n",
"- Convert [`Tables`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#table) to a [Pandas](https://pandas.pydata.org/) Dataframe or CSV.\n",
"\n",
"- Insert [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) and [`FormFields`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#formfield) into a [BigQuery](https://cloud.google.com/bigquery) table.\n",
"\n",
"- Split a PDF file based on [output from a Splitter/Classifier processor]([#splitting](https://cloud.google.com/document-ai/docs/splitters)).\n",
"\n",
"- Extract image [`Entities`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#entity) from [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) [bounding boxes](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#boundingpoly).\n",
"\n",
"- Convert [`Documents`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) to and from commonly used formats:\n",
" - [Cloud Vision API](https://cloud.google.com/vision) [`AnnotateFileResponse`](https://cloud.google.com/vision/docs/reference/rest/v1/BatchAnnotateFilesResponse#AnnotateFileResponse)\n",
" - [hOCR](https://en.wikipedia.org/wiki/HOCR)\n",
" - Third-party document processing formats\n",
"\n",
"- Create batches of documents for processing from a [Cloud Storage](https://cloud.google.com/) folder.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ouFwNhyEWPf",
"metadata": {
"id": "1ouFwNhyEWPf"
},
"outputs": [],
"source": [
"%pip install --upgrade --user -q google-cloud-documentai google-cloud-documentai-toolbox pandas"
]
},
{
"cell_type": "markdown",
"id": "d61f2f79",
"metadata": {},
"source": [
"**Colab only:** Run the following cell to restart the kernel or use the restart button. For Vertex AI Workbench you can restart the terminal using the button on top."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af87f6d8",
"metadata": {},
"outputs": [],
"source": [
"# Automatically restart kernel after installs so that your environment can access the new packages\n",
"import IPython\n",
"\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "d10c67cc",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
"<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "cd2caade",
"metadata": {},
"source": [
"### Authenticating your notebook environment\n",
"\n",
"* If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n",
"* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "665bf8e0",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"# Additional authentication is required for Google Colab\n",
"if \"google.colab\" in sys.modules:\n",
" # Authenticate user to Google Cloud\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7DSQUZl7wtY8",
"metadata": {
"id": "7DSQUZl7wtY8"
},
"outputs": [],
"source": [
"# TODO(developer): Fill these variables before running the sample.\n",
"project_id = \"YOUR_PROJECT_ID\" # @param {type:\"string\"}\n",
"# https://cloud.google.com/document-ai/docs/regions\n",
"location = \"us\" # @param {type:\"string\"}\n",
"\n",
"# Create processor before running sample\n",
"# https://cloud.google.com/document-ai/docs/create-processor\n",
"processor_id = \"YOUR_PROCESSOR_ID\" # @param {type:\"string\"}\n",
"# https://cloud.google.com/document-ai/docs/manage-processor-versions\n",
"processor_version_id = \"stable\" # @param {type:\"string\"}\n",
"\n",
"# Format: `gs://bucket/directory/`\n",
"gcs_input_uri = \"YOUR_INPUT_BUCKET\" # @param {type:\"string\"}\n",
"# Must end with a trailing slash `/`. Format: `gs://bucket/directory/subdirectory/`\n",
"gcs_output_uri = \"YOUR_OUTPUT_BUCKET\" # @param {type:\"string\"}\n",
"\n",
"batch_size = 1000\n",
"# Optional. The fields to return in the Document object.\n",
"field_mask = \"text,entities,pages,shardInfo\" # @param {type:\"string\"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "Sh1oolV7Mael",
"metadata": {
"id": "Sh1oolV7Mael"
},
"outputs": [],
"source": [
"# Set the project id\n",
"!gcloud config set project {project_id}\n",
"!gcloud auth application-default login -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "x9hTHkFrX_5N",
"metadata": {
"executionInfo": {
"elapsed": 257,
"status": "ok",
"timestamp": 1694541469217,
"user": {
"displayName": "",
"userId": ""
},
"user_tz": 300
},
"id": "x9hTHkFrX_5N"
},
"outputs": [],
"source": [
"from IPython.display import display\n",
"\n",
"from typing import List, Optional\n",
"\n",
"# https://googleapis.dev/python/google-api-core/latest/client_options.html\n",
"from google.api_core.client_options import ClientOptions\n",
"\n",
"# https://cloud.google.com/python/docs/reference/documentai/latest\n",
"from google.cloud import documentai\n",
"\n",
"# https://cloud.google.com/document-ai/docs/toolbox\n",
"from google.cloud import documentai_toolbox\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "3Iaq7M5MvkqG",
"metadata": {
"id": "3Iaq7M5MvkqG"
},
"source": [
"## Batch Processing\n",
"\n",
"- Create batches of 1000 documents in Google Cloud Storage.\n",
"- Make a batch processing request for each batch.\n",
"- Get long-running operation ID for each request."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35856bf2-aa5e-436b-977a-9e5725b1a595",
"metadata": {
"executionInfo": {
"elapsed": 3,
"status": "ok",
"timestamp": 1694541463780,
"user": {
"displayName": "",
"userId": ""
},
"user_tz": 300
},
"id": "35856bf2-aa5e-436b-977a-9e5725b1a595",
"trusted": true
},
"outputs": [],
"source": [
"def batch_process_toolbox(\n",
" project_id: str,\n",
" location: str,\n",
" processor_id: str,\n",
" processor_version_id: str,\n",
" gcs_input_uri: str,\n",
" gcs_output_uri: str,\n",
" batch_size: int,\n",
" field_mask: Optional[str] = None,\n",
" skip_human_review: bool = True,\n",
") -> List:\n",
" client = documentai.DocumentProcessorServiceClient(\n",
" client_options=ClientOptions(\n",
" api_endpoint=f\"{location}-documentai.googleapis.com\"\n",
" )\n",
" )\n",
"\n",
" # The full resource name of the processor version, e.g.:\n",
" # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}\n",
" name = client.processor_version_path(\n",
" project_id, location, processor_id, processor_version_id\n",
" )\n",
"\n",
" # Cloud Storage URI for the Output Directory\n",
" output_config = documentai.DocumentOutputConfig(\n",
" gcs_output_config=documentai.DocumentOutputConfig.GcsOutputConfig(\n",
" gcs_uri=gcs_output_uri, field_mask=field_mask\n",
" )\n",
" )\n",
"\n",
" # Create batches of documents for processing\n",
" # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.utilities.gcs_utilities\n",
" gcs_bucket_name, gcs_prefix = documentai_toolbox.gcs_utilities.split_gcs_uri(\n",
" gcs_input_uri\n",
" )\n",
" batches = documentai_toolbox.gcs_utilities.create_batches(\n",
" gcs_bucket_name, gcs_prefix, batch_size=batch_size\n",
" )\n",
"\n",
" operations = []\n",
"\n",
" print(f\"{len(batches)} batches created.\")\n",
" for batch in batches:\n",
" print(f\"{len(batch.gcs_documents.documents)} files in batch.\")\n",
" print(batch.gcs_documents.documents)\n",
"\n",
" # https://cloud.google.com/document-ai/docs/send-request?hl=en#async-processor\n",
" # `batch_process_documents()` returns a Long Running Operation (LRO)\n",
" operation = client.batch_process_documents(\n",
" request=documentai.BatchProcessRequest(\n",
" name=name,\n",
" input_documents=batch,\n",
" document_output_config=output_config,\n",
" skip_human_review=skip_human_review,\n",
" )\n",
" )\n",
" operations.append(operation)\n",
"\n",
" return operations"
]
},
{
"cell_type": "markdown",
"id": "op0ZCWTIwDgR",
"metadata": {
"id": "op0ZCWTIwDgR"
},
"source": [
"## Retrieve results once processing is complete\n",
"\n",
"- Get output [`Document`](https://cloud.google.com/document-ai/docs/reference/rest/v1/Document) JSON files from `gcs_output_bucket` based on the Operation ID."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "KxVFCVNVLLwW",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 2604,
"status": "ok",
"timestamp": 1694541481158,
"user": {
"displayName": "",
"userId": ""
},
"user_tz": 300
},
"id": "KxVFCVNVLLwW",
"outputId": "2ada6f15-b774-4f55-fa73-b0e7064cd437"
},
"outputs": [],
"source": [
"operations = batch_process_toolbox(\n",
" project_id,\n",
" location,\n",
" processor_id,\n",
" processor_version_id,\n",
" gcs_input_uri,\n",
" gcs_output_uri,\n",
" batch_size,\n",
" field_mask,\n",
")\n",
"\n",
"# Can do this asynchronously to avoid blocking\n",
"documents: List[documentai_toolbox.document.Document] = []\n",
"\n",
"TIMEOUT = 60\n",
"\n",
"for operation in operations:\n",
" # https://cloud.google.com/document-ai/docs/long-running-operations\n",
" print(f\"Waiting for operation {operation.operation.name}\")\n",
" operation.result(timeout=TIMEOUT)\n",
" documents.extend(\n",
" documentai_toolbox.document.Document.from_batch_process_metadata(\n",
" documentai.BatchProcessMetadata(operation.metadata)\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "445FQsfrwc4N",
"metadata": {
"id": "445FQsfrwc4N"
},
"source": [
"## Print results\n",
"\n",
"- Export extracted entities as dictionary\n",
"- Load into Pandas DataFrame\n",
"- Print DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1486462",
"metadata": {},
"outputs": [],
"source": [
"for document in documents:\n",
" # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document#google_cloud_documentai_toolbox_wrappers_document_Document_entities_to_dict\n",
" entities = document.entities_to_dict()\n",
" # Optional: Export to BQ\n",
" # job = document.entities_to_bigquery(dataset_name, table_name, project_id=project_id)\n",
"\n",
" df = pd.DataFrame([entities])\n",
"\n",
" display(df)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.undefined.undefined"
}
},
"nbformat": 4,
"nbformat_minor": 5
}