incubator-tools/enhance_checkbox/code.ipynb (392 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "5afb9e19-d1ce-4777-b6c0-53b58bdbfc05", "metadata": {}, "source": [ "# Enhance checkbox performance with OCR 2.0" ] }, { "cell_type": "markdown", "id": "57ad0cd7-7c40-44e6-99b7-595c01d9bced", "metadata": {}, "source": [ "## Objective\n", "\n", "This tool assists in improving checkbox performance by importing OCR2.0 output JSON processed with premium features into CDE2.0. Make sure to create a schema that includes entity names using efficient prompt engineering techniques and utilize the auto-labeling feature with a foundational model that reduces labeling efforts by 80%.\n", "\n", "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.\n", "\n", "## Prerequisites\n", " 1. Vertex AI Notebook\n", " 2. Access to Projects and Document AI Processors\n" ] }, { "cell_type": "markdown", "id": "e904a37c-5d34-424c-b0e4-4b7541c1065c", "metadata": { "tags": [] }, "source": [ "## Step by Step procedure " ] }, { "cell_type": "markdown", "id": "e61a193a-969e-4747-a394-e5aa33f4a31c", "metadata": {}, "source": [ "### Processes Documents through OCR 2.0\n", "\n", "Run the below code with inputs of yours" ] }, { "cell_type": "code", "execution_count": null, "id": "fb430d60-628d-483b-b1a1-392af638b797", "metadata": {}, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py\n", "# !pip install google-cloud-documentai #Run if needed" ] }, { "cell_type": "code", "execution_count": null, "id": "f3f0dea2-758c-4e5b-b68d-c372e991c415", "metadata": {}, "outputs": [], "source": [ "import json\n", "import utilities\n", "import pandas as pd\n", "from google.cloud import documentai_v1 as documentai\n", "from google.api_core.client_options import ClientOptions\n", "from google.longrunning.operations_pb2 import GetOperationRequest\n", "from google.cloud import storage" ] }, { "cell_type": "code", "execution_count": null, "id": "e54ebe9a-796c-4fa9-a503-acc0e62f2dfe", "metadata": {}, "outputs": [], "source": [ "# Configuration\n", "project_id = \"<PROJECT_ID>\" # Google Cloud Project ID\n", "location = \"<LOCATION>\" # Document AI Processor Location (e.g., \"us\" or \"eu\")\n", "processor_id = \"<PROCESSOR_ID>\" # Document AI OCR Processor ID (Make sure to set the OCR V2 Version as Default)\n", "gcs_input_uri = \"<GCS_INPUT_URI>\" # Google Cloud Storage input folder path containing PDF or image files with checkboxes\n", "gcs_output_uri = \"<GCS_OUTPUT_URI>\" # Google Cloud Storage output folder path where the JSON results will be stored\n", "\n", "\n", "def batch_process_documents_with_premium_options(\n", " project_id: str,\n", " location: str,\n", " processor_id: str,\n", " gcs_input_uri: str,\n", " gcs_output_uri: str,\n", " processor_version_id: Optional[str] = None,\n", " timeout: int = 500,\n", ") -> Any:\n", " \"\"\"It will perform Batch Process on raw input documents\n", "\n", " Args:\n", " project_id (str): GCP project ID\n", " location (str): Processor location us or eu\n", " processor_id (str): GCP DocumentAI ProcessorID\n", " gcs_input_uri (str): GCS path which contains all input files\n", " gcs_output_uri (str): GCS path to store processed JSON results\n", " processor_version_id (str, optional): VersionID of GCP DocumentAI Processor. Defaults to None.\n", " timeout (int, optional): Maximum waiting time for operation to complete. Defaults to 500.\n", "\n", " Returns:\n", " operation.Operation: LRO operation ID for current batch-job\n", " \"\"\"\n", "\n", " opts = {\"api_endpoint\": f\"{location}-documentai.googleapis.com\"}\n", " client = documentai.DocumentProcessorServiceClient(client_options=opts)\n", " input_config = documentai.BatchDocumentsInputConfig(\n", " gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)\n", " )\n", " output_config = documentai.DocumentOutputConfig(\n", " gcs_output_config={\"gcs_uri\": gcs_output_uri}\n", " )\n", " process_options = documentai.ProcessOptions(\n", " ocr_config=documentai.OcrConfig(\n", " enable_native_pdf_parsing=True,\n", " enable_image_quality_scores=True,\n", " enable_symbol=True,\n", " premium_features=documentai.OcrConfig.PremiumFeatures(\n", " enable_selection_mark_detection=True,\n", " compute_style_info=True,\n", " enable_math_ocr=True,\n", " ),\n", " )\n", " )\n", " print(\"Documents are processing(batch-documents)...\")\n", " name = (\n", " client.processor_version_path(\n", " project_id, location, processor_id, processor_version_id\n", " )\n", " if processor_version_id\n", " else client.processor_path(project_id, location, processor_id)\n", " )\n", " request = documentai.types.document_processor_service.BatchProcessRequest(\n", " name=name,\n", " input_documents=input_config,\n", " document_output_config=output_config,\n", " process_options=process_options,\n", " )\n", " operation = client.batch_process_documents(request)\n", " print(\"Waiting for operation to complete...\")\n", " operation.result(timeout=timeout)\n", " return operation\n", "\n", "\n", "# Batch process documents using the specified processor\n", "process_documents_result = batch_process_documents_with_premium_options(\n", " project_id=project_id,\n", " location=location,\n", " processor_id=processor_id,\n", " gcs_input_uri=gcs_input_uri,\n", " gcs_output_uri=gcs_output_uri,\n", ")\n", "\n", "print(\"Batch processing result:\", process_documents_result)" ] }, { "cell_type": "markdown", "id": "67469482-174d-475b-ac82-9ecc47c612cb", "metadata": {}, "source": [ "## Instructions\n", "\n", "## Follow any of the two approaches mentioned below\n", "\n", "1. Uptrain or Finetune a Pretrained CDE Model\n", "2. Custom CDE Model" ] }, { "cell_type": "markdown", "id": "2b765663-394e-4006-82c6-c458abd0f0a5", "metadata": {}, "source": [ "## Approach 1 - Uptrain or Finetune a Pretrained CDE Model" ] }, { "cell_type": "markdown", "id": "180edd9b-91fa-4728-b07e-3c2d3d0c03d6", "metadata": {}, "source": [ "1. Create a CDE processor and ensure that you have access to `pretrained-foundation-model-v1.1-2024-03-12` in your CDE processor. If you don't have access, raise a bug and request to be allowlisted for your project.\n", "\n", "2. Add the entity schema for the checkbox entities to your processor. Here's an example schema:\n", "\n", "<td><img src=\"./images/cde_2_checkbox_schema.png\" style=\"height: 200px; width: 600px;\"></td>\n", "\n", "3. Import the OCR 2.0 JSON files obtained from the step &quot;Process Documents through OCR 2.0&quot; into the processor. Import them as training and test data. During the import process, select the option to auto-label using the `pretrained-foundation-model-v1.1-2024-03-12` processor.\n", "\n", "4. Once the JSON files are auto-labeled and imported into the processor, verify the documents to ensure that all the checkbox labels are correct. Mark them as labeled after verifying all the auto-labeled entities.\n", "\n", "5. Uptrain a model on top of `pretrained-foundation-model-v1.1-2024-03-12` or any existing version that has OCR 2.0 as the base OCR. The uptraining process typically takes around an hour and creates a new model version.\n", "\n", "6. If the up-trained version doesn't provide the targeted F1-score, experiment with fine-tuning using different knobs. For example:\n", " - Train step: 400/350/300\n", " - Learning rate: 0.5/1.5/0.6/1\n", " \n", "<td><img src=\"./images/finetuned.png\" style=\"height: 450px; width: 400px;\"></td>\n", " \n", " Fine-tuning may take more than 3 hours for model training and creates a new fine-tuned version.\n", "\n", "By following these steps, you can uptrain or finetune a pretrained CDE model to improve its performance on checkbox entity recognition." ] }, { "cell_type": "markdown", "id": "573e7ffa-c93d-4f85-8a3a-721be7e418c8", "metadata": {}, "source": [ "## Checkbox Post Processing script for Finetuned model version\n", "\n", "From the Fine-Tuned processor version, you may not find the following field \"normalizedValue\":{\"booleanValue\":true} for checkboxes, if it is so. Here is the post processing script to handle the check_boxes efficiently shown with sample output.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "cfe9e634-36b3-42e2-b209-ceb229ead895", "metadata": {}, "outputs": [], "source": [ "project_id = \"YOUR_PROJECT_ID\" # Google Cloud project ID\n", "location = \"us\" # Processor location Eg. us or eu\n", "processor_id = \"YOUR_CDE_PROCESSOR_ID\" # CDE processor ID\n", "processor_version = \"YOUR_CDE_PROCESSOR_VERSION\" # CDE processor Version either uptrained or Finetuned which you got from the above step\n", "bucket_name = \"YOUR_BUCKET_NAME\"\n", "file_path = (\n", " \"PATH/TO/YOUR/FILE.pdf\" # Path to the file within the specified bucket of your PDF\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "12e5d94a-5a8e-45ea-a6d3-d0127ef3f4a5", "metadata": {}, "outputs": [], "source": [ "# read pdf file from the bucket\n", "storage_client = storage.Client()\n", "bucket = storage_client.bucket(bucket_name)\n", "blob = bucket.blob(file_path)\n", "pdf_bytes = blob.download_as_bytes()\n", "\n", "# post-processing code to handle checkboxes\n", "res = utilities.process_document_sample(\n", " project_id, location, processor_id, pdf_bytes, processor_version\n", ")\n", "\n", "entity_list = []\n", "for ent in res.document.entities:\n", " normalized_value = ent.mention_text\n", " if ent.type_.endswith(\"checkbox\"):\n", " if \"☑\" in ent.mention_text:\n", " normalized_value = \"True\"\n", " elif \"☐\" in ent.mention_text:\n", " normalized_value = \"False\"\n", " else:\n", " pass\n", " entity_list.append((ent.type_, ent.mention_text, ent.confidence, normalized_value))\n", "entities_df = pd.DataFrame(\n", " entity_list,\n", " columns=[\"Entity_Name\", \"Entity_Value\", \"Confidence_Score\", \"Normalized_Value\"],\n", ")" ] }, { "cell_type": "markdown", "id": "197505f3-53ce-498d-aa0f-cbfaae42688d", "metadata": { "tags": [] }, "source": [ "### Output\n", "\n", "<td><img src=\"./images/df_output.png\" style=\"height: 200px; width: 600px;\"></td>" ] }, { "cell_type": "markdown", "id": "7ccea18a-6d39-474c-ace3-f3e05d46c00d", "metadata": {}, "source": [ "## Approach 2 - Custom CDE Model" ] }, { "cell_type": "markdown", "id": "861d76c8-a005-4a25-b57c-8578df16b26f", "metadata": {}, "source": [ "If you are not seeing the checkboxes detected properly with the Fine-Tuned version and have less F1 score. Try to train a custom model based cde processor from scratch with the same OCR2.0 jsons.\n", "\n", "<td><img src=\"./images/model_based.png\" style=\"height: 450px; width: 420px;\"></td>\n", "\n", "\n", "Once the custom model based processor is trained you can use that for the checkbox prediction.\n", "\n", "**Caveat** : If we process the pdf’s directly using the custom model, it fails to predict checkboxes as the custom model is using OCR1.0 as a base ocr. Hence we need to process the pdf with OCR 2.0 first and feed the output json to a custom trained model. Here is the script for the same with a sample processor output." ] }, { "cell_type": "code", "execution_count": null, "id": "9e778f2d-897f-4e15-9573-a34e87881395", "metadata": { "tags": [] }, "outputs": [], "source": [ "import utilities\n", "\n", "# Configuration\n", "project_id = \"<PROJECT_ID>\" # Google Cloud Project ID\n", "location = \"<LOCATION>\" # Document AI Processor Location (e.g., \"us\" or \"eu\")\n", "processor_id = \"<PROCESSOR_ID>\" # Doc AI Processor ID (Make sure to set the Custom model based Version which you got from previous step as Default)\n", "gcs_input_uri = \"<GCS_INPUT_URI>\" # Google Cloud Storage input folder path of OCR 2.0 JSON files obtained from the step \"Process Documents through OCR 2.0\"\n", "gcs_output_uri = \"<GCS_OUTPUT_URI>\" # Google Cloud Storage output folder path where the JSON results will be stored\n", "\n", "# Batch process documents using the specified processor\n", "process_documents_result = utilities.batch_process_documents_sample(\n", " project_id=project_id,\n", " location=location,\n", " processor_id=processor_id,\n", " gcs_input_uri=gcs_input_uri,\n", " gcs_output_uri=gcs_output_uri,\n", ")\n", "\n", "print(\"Batch processing result:\", process_documents_result)" ] }, { "cell_type": "markdown", "id": "c955a0b6-7f16-4e70-9270-988e4a5b6892", "metadata": { "tags": [] }, "source": [ "**Output: JSON in the Output folder with the checkbox entities**\n", "\n", "<td><img src=\"./images/model_json_out.png\" style=\"height: 450px; width: 450px;\"></td>" ] }, { "cell_type": "code", "execution_count": null, "id": "fbf38143-7016-4a41-a184-f216a2193d0a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-root-py", "name": "workbench-notebooks.m118", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m118" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }