incubator-tools/enhance_checkbox/code.ipynb (392 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "5afb9e19-d1ce-4777-b6c0-53b58bdbfc05",
"metadata": {},
"source": [
"# Enhance checkbox performance with OCR 2.0"
]
},
{
"cell_type": "markdown",
"id": "57ad0cd7-7c40-44e6-99b7-595c01d9bced",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This tool assists in improving checkbox performance by importing OCR2.0 output JSON processed with premium features into CDE2.0. Make sure to create a schema that includes entity names using efficient prompt engineering techniques and utilize the auto-labeling feature with a foundational model that reduces labeling efforts by 80%.\n",
"\n",
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.\n",
"\n",
"## Prerequisites\n",
" 1. Vertex AI Notebook\n",
" 2. Access to Projects and Document AI Processors\n"
]
},
{
"cell_type": "markdown",
"id": "e904a37c-5d34-424c-b0e4-4b7541c1065c",
"metadata": {
"tags": []
},
"source": [
"## Step by Step procedure "
]
},
{
"cell_type": "markdown",
"id": "e61a193a-969e-4747-a394-e5aa33f4a31c",
"metadata": {},
"source": [
"### Processes Documents through OCR 2.0\n",
"\n",
"Run the below code with inputs of yours"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb430d60-628d-483b-b1a1-392af638b797",
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py\n",
"# !pip install google-cloud-documentai #Run if needed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3f0dea2-758c-4e5b-b68d-c372e991c415",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import utilities\n",
"import pandas as pd\n",
"from google.cloud import documentai_v1 as documentai\n",
"from google.api_core.client_options import ClientOptions\n",
"from google.longrunning.operations_pb2 import GetOperationRequest\n",
"from google.cloud import storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e54ebe9a-796c-4fa9-a503-acc0e62f2dfe",
"metadata": {},
"outputs": [],
"source": [
"# Configuration\n",
"project_id = \"<PROJECT_ID>\" # Google Cloud Project ID\n",
"location = \"<LOCATION>\" # Document AI Processor Location (e.g., \"us\" or \"eu\")\n",
"processor_id = \"<PROCESSOR_ID>\" # Document AI OCR Processor ID (Make sure to set the OCR V2 Version as Default)\n",
"gcs_input_uri = \"<GCS_INPUT_URI>\" # Google Cloud Storage input folder path containing PDF or image files with checkboxes\n",
"gcs_output_uri = \"<GCS_OUTPUT_URI>\" # Google Cloud Storage output folder path where the JSON results will be stored\n",
"\n",
"\n",
"def batch_process_documents_with_premium_options(\n",
" project_id: str,\n",
" location: str,\n",
" processor_id: str,\n",
" gcs_input_uri: str,\n",
" gcs_output_uri: str,\n",
" processor_version_id: Optional[str] = None,\n",
" timeout: int = 500,\n",
") -> Any:\n",
" \"\"\"It will perform Batch Process on raw input documents\n",
"\n",
" Args:\n",
" project_id (str): GCP project ID\n",
" location (str): Processor location us or eu\n",
" processor_id (str): GCP DocumentAI ProcessorID\n",
" gcs_input_uri (str): GCS path which contains all input files\n",
" gcs_output_uri (str): GCS path to store processed JSON results\n",
" processor_version_id (str, optional): VersionID of GCP DocumentAI Processor. Defaults to None.\n",
" timeout (int, optional): Maximum waiting time for operation to complete. Defaults to 500.\n",
"\n",
" Returns:\n",
" operation.Operation: LRO operation ID for current batch-job\n",
" \"\"\"\n",
"\n",
" opts = {\"api_endpoint\": f\"{location}-documentai.googleapis.com\"}\n",
" client = documentai.DocumentProcessorServiceClient(client_options=opts)\n",
" input_config = documentai.BatchDocumentsInputConfig(\n",
" gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)\n",
" )\n",
" output_config = documentai.DocumentOutputConfig(\n",
" gcs_output_config={\"gcs_uri\": gcs_output_uri}\n",
" )\n",
" process_options = documentai.ProcessOptions(\n",
" ocr_config=documentai.OcrConfig(\n",
" enable_native_pdf_parsing=True,\n",
" enable_image_quality_scores=True,\n",
" enable_symbol=True,\n",
" premium_features=documentai.OcrConfig.PremiumFeatures(\n",
" enable_selection_mark_detection=True,\n",
" compute_style_info=True,\n",
" enable_math_ocr=True,\n",
" ),\n",
" )\n",
" )\n",
" print(\"Documents are processing(batch-documents)...\")\n",
" name = (\n",
" client.processor_version_path(\n",
" project_id, location, processor_id, processor_version_id\n",
" )\n",
" if processor_version_id\n",
" else client.processor_path(project_id, location, processor_id)\n",
" )\n",
" request = documentai.types.document_processor_service.BatchProcessRequest(\n",
" name=name,\n",
" input_documents=input_config,\n",
" document_output_config=output_config,\n",
" process_options=process_options,\n",
" )\n",
" operation = client.batch_process_documents(request)\n",
" print(\"Waiting for operation to complete...\")\n",
" operation.result(timeout=timeout)\n",
" return operation\n",
"\n",
"\n",
"# Batch process documents using the specified processor\n",
"process_documents_result = batch_process_documents_with_premium_options(\n",
" project_id=project_id,\n",
" location=location,\n",
" processor_id=processor_id,\n",
" gcs_input_uri=gcs_input_uri,\n",
" gcs_output_uri=gcs_output_uri,\n",
")\n",
"\n",
"print(\"Batch processing result:\", process_documents_result)"
]
},
{
"cell_type": "markdown",
"id": "67469482-174d-475b-ac82-9ecc47c612cb",
"metadata": {},
"source": [
"## Instructions\n",
"\n",
"## Follow any of the two approaches mentioned below\n",
"\n",
"1. Uptrain or Finetune a Pretrained CDE Model\n",
"2. Custom CDE Model"
]
},
{
"cell_type": "markdown",
"id": "2b765663-394e-4006-82c6-c458abd0f0a5",
"metadata": {},
"source": [
"## Approach 1 - Uptrain or Finetune a Pretrained CDE Model"
]
},
{
"cell_type": "markdown",
"id": "180edd9b-91fa-4728-b07e-3c2d3d0c03d6",
"metadata": {},
"source": [
"1. Create a CDE processor and ensure that you have access to `pretrained-foundation-model-v1.1-2024-03-12` in your CDE processor. If you don't have access, raise a bug and request to be allowlisted for your project.\n",
"\n",
"2. Add the entity schema for the checkbox entities to your processor. Here's an example schema:\n",
"\n",
"<td><img src=\"./images/cde_2_checkbox_schema.png\" style=\"height: 200px; width: 600px;\"></td>\n",
"\n",
"3. Import the OCR 2.0 JSON files obtained from the step "Process Documents through OCR 2.0" into the processor. Import them as training and test data. During the import process, select the option to auto-label using the `pretrained-foundation-model-v1.1-2024-03-12` processor.\n",
"\n",
"4. Once the JSON files are auto-labeled and imported into the processor, verify the documents to ensure that all the checkbox labels are correct. Mark them as labeled after verifying all the auto-labeled entities.\n",
"\n",
"5. Uptrain a model on top of `pretrained-foundation-model-v1.1-2024-03-12` or any existing version that has OCR 2.0 as the base OCR. The uptraining process typically takes around an hour and creates a new model version.\n",
"\n",
"6. If the up-trained version doesn't provide the targeted F1-score, experiment with fine-tuning using different knobs. For example:\n",
" - Train step: 400/350/300\n",
" - Learning rate: 0.5/1.5/0.6/1\n",
" \n",
"<td><img src=\"./images/finetuned.png\" style=\"height: 450px; width: 400px;\"></td>\n",
" \n",
" Fine-tuning may take more than 3 hours for model training and creates a new fine-tuned version.\n",
"\n",
"By following these steps, you can uptrain or finetune a pretrained CDE model to improve its performance on checkbox entity recognition."
]
},
{
"cell_type": "markdown",
"id": "573e7ffa-c93d-4f85-8a3a-721be7e418c8",
"metadata": {},
"source": [
"## Checkbox Post Processing script for Finetuned model version\n",
"\n",
"From the Fine-Tuned processor version, you may not find the following field \"normalizedValue\":{\"booleanValue\":true} for checkboxes, if it is so. Here is the post processing script to handle the check_boxes efficiently shown with sample output.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfe9e634-36b3-42e2-b209-ceb229ead895",
"metadata": {},
"outputs": [],
"source": [
"project_id = \"YOUR_PROJECT_ID\" # Google Cloud project ID\n",
"location = \"us\" # Processor location Eg. us or eu\n",
"processor_id = \"YOUR_CDE_PROCESSOR_ID\" # CDE processor ID\n",
"processor_version = \"YOUR_CDE_PROCESSOR_VERSION\" # CDE processor Version either uptrained or Finetuned which you got from the above step\n",
"bucket_name = \"YOUR_BUCKET_NAME\"\n",
"file_path = (\n",
" \"PATH/TO/YOUR/FILE.pdf\" # Path to the file within the specified bucket of your PDF\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12e5d94a-5a8e-45ea-a6d3-d0127ef3f4a5",
"metadata": {},
"outputs": [],
"source": [
"# read pdf file from the bucket\n",
"storage_client = storage.Client()\n",
"bucket = storage_client.bucket(bucket_name)\n",
"blob = bucket.blob(file_path)\n",
"pdf_bytes = blob.download_as_bytes()\n",
"\n",
"# post-processing code to handle checkboxes\n",
"res = utilities.process_document_sample(\n",
" project_id, location, processor_id, pdf_bytes, processor_version\n",
")\n",
"\n",
"entity_list = []\n",
"for ent in res.document.entities:\n",
" normalized_value = ent.mention_text\n",
" if ent.type_.endswith(\"checkbox\"):\n",
" if \"☑\" in ent.mention_text:\n",
" normalized_value = \"True\"\n",
" elif \"☐\" in ent.mention_text:\n",
" normalized_value = \"False\"\n",
" else:\n",
" pass\n",
" entity_list.append((ent.type_, ent.mention_text, ent.confidence, normalized_value))\n",
"entities_df = pd.DataFrame(\n",
" entity_list,\n",
" columns=[\"Entity_Name\", \"Entity_Value\", \"Confidence_Score\", \"Normalized_Value\"],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "197505f3-53ce-498d-aa0f-cbfaae42688d",
"metadata": {
"tags": []
},
"source": [
"### Output\n",
"\n",
"<td><img src=\"./images/df_output.png\" style=\"height: 200px; width: 600px;\"></td>"
]
},
{
"cell_type": "markdown",
"id": "7ccea18a-6d39-474c-ace3-f3e05d46c00d",
"metadata": {},
"source": [
"## Approach 2 - Custom CDE Model"
]
},
{
"cell_type": "markdown",
"id": "861d76c8-a005-4a25-b57c-8578df16b26f",
"metadata": {},
"source": [
"If you are not seeing the checkboxes detected properly with the Fine-Tuned version and have less F1 score. Try to train a custom model based cde processor from scratch with the same OCR2.0 jsons.\n",
"\n",
"<td><img src=\"./images/model_based.png\" style=\"height: 450px; width: 420px;\"></td>\n",
"\n",
"\n",
"Once the custom model based processor is trained you can use that for the checkbox prediction.\n",
"\n",
"**Caveat** : If we process the pdf’s directly using the custom model, it fails to predict checkboxes as the custom model is using OCR1.0 as a base ocr. Hence we need to process the pdf with OCR 2.0 first and feed the output json to a custom trained model. Here is the script for the same with a sample processor output."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e778f2d-897f-4e15-9573-a34e87881395",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import utilities\n",
"\n",
"# Configuration\n",
"project_id = \"<PROJECT_ID>\" # Google Cloud Project ID\n",
"location = \"<LOCATION>\" # Document AI Processor Location (e.g., \"us\" or \"eu\")\n",
"processor_id = \"<PROCESSOR_ID>\" # Doc AI Processor ID (Make sure to set the Custom model based Version which you got from previous step as Default)\n",
"gcs_input_uri = \"<GCS_INPUT_URI>\" # Google Cloud Storage input folder path of OCR 2.0 JSON files obtained from the step \"Process Documents through OCR 2.0\"\n",
"gcs_output_uri = \"<GCS_OUTPUT_URI>\" # Google Cloud Storage output folder path where the JSON results will be stored\n",
"\n",
"# Batch process documents using the specified processor\n",
"process_documents_result = utilities.batch_process_documents_sample(\n",
" project_id=project_id,\n",
" location=location,\n",
" processor_id=processor_id,\n",
" gcs_input_uri=gcs_input_uri,\n",
" gcs_output_uri=gcs_output_uri,\n",
")\n",
"\n",
"print(\"Batch processing result:\", process_documents_result)"
]
},
{
"cell_type": "markdown",
"id": "c955a0b6-7f16-4e70-9270-988e4a5b6892",
"metadata": {
"tags": []
},
"source": [
"**Output: JSON in the Output folder with the checkbox entities**\n",
"\n",
"<td><img src=\"./images/model_json_out.png\" style=\"height: 450px; width: 450px;\"></td>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbf38143-7016-4a41-a184-f216a2193d0a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-root-py",
"name": "workbench-notebooks.m118",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m118"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}