# Enhance checkbox performance with OCR 2.0

## Objective

This tool assists in improving checkbox performance by importing OCR2.0 output JSON processed with premium features into CDE2.0. Make sure to create a schema that includes entity names using efficient prompt engineering techniques and utilize the auto-labeling feature with a foundational model that reduces labeling efforts by 80%.

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Prerequisites
 1. Vertex AI Notebook
 2. Access to Projects and Document AI Processors


## Step by Step procedure 

### Processes Documents through OCR 2.0

Run the below code with inputs of yours

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py
# !pip install google-cloud-documentai #Run if needed

In [None]:
import json
import utilities
import pandas as pd
from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
from google.longrunning.operations_pb2 import GetOperationRequest
from google.cloud import storage

In [None]:
# Configuration
project_id = "<PROJECT_ID>"  # Google Cloud Project ID
location = "<LOCATION>"  # Document AI Processor Location (e.g., "us" or "eu")
processor_id = "<PROCESSOR_ID>"  # Document AI OCR Processor ID (Make sure to set the OCR V2 Version as Default)
gcs_input_uri = "<GCS_INPUT_URI>"  # Google Cloud Storage input folder path containing PDF or image files with checkboxes
gcs_output_uri = "<GCS_OUTPUT_URI>"  # Google Cloud Storage output folder path where the JSON results will be stored


def batch_process_documents_with_premium_options(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    gcs_output_uri: str,
    processor_version_id: Optional[str] = None,
    timeout: int = 500,
) -> Any:
    """It will perform Batch Process on raw input documents

    Args:
        project_id (str): GCP project ID
        location (str): Processor location us or eu
        processor_id (str): GCP DocumentAI ProcessorID
        gcs_input_uri (str): GCS path which contains all input files
        gcs_output_uri (str): GCS path to store processed JSON results
        processor_version_id (str, optional): VersionID of GCP DocumentAI Processor. Defaults to None.
        timeout (int, optional): Maximum waiting time for operation to complete. Defaults to 500.

    Returns:
        operation.Operation: LRO operation ID for current batch-job
    """

    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=opts)
    input_config = documentai.BatchDocumentsInputConfig(
        gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
    )
    output_config = documentai.DocumentOutputConfig(
        gcs_output_config={"gcs_uri": gcs_output_uri}
    )
    process_options = documentai.ProcessOptions(
        ocr_config=documentai.OcrConfig(
            enable_native_pdf_parsing=True,
            enable_image_quality_scores=True,
            enable_symbol=True,
            premium_features=documentai.OcrConfig.PremiumFeatures(
                enable_selection_mark_detection=True,
                compute_style_info=True,
                enable_math_ocr=True,
            ),
        )
    )
    print("Documents are processing(batch-documents)...")
    name = (
        client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
        if processor_version_id
        else client.processor_path(project_id, location, processor_id)
    )
    request = documentai.types.document_processor_service.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
        process_options=process_options,
    )
    operation = client.batch_process_documents(request)
    print("Waiting for operation to complete...")
    operation.result(timeout=timeout)
    return operation


# Batch process documents using the specified processor
process_documents_result = batch_process_documents_with_premium_options(
    project_id=project_id,
    location=location,
    processor_id=processor_id,
    gcs_input_uri=gcs_input_uri,
    gcs_output_uri=gcs_output_uri,
)

print("Batch processing result:", process_documents_result)

## Instructions

## Follow any of the two approaches mentioned below

1. Uptrain or Finetune a Pretrained CDE Model
2. Custom CDE Model

## Approach 1 - Uptrain or Finetune a Pretrained CDE Model

1. Create a CDE processor and ensure that you have access to `pretrained-foundation-model-v1.1-2024-03-12` in your CDE processor. If you don't have access, raise a bug and request to be allowlisted for your project.

2. Add the entity schema for the checkbox entities to your processor. Here's an example schema:

<td><img src="./images/cde_2_checkbox_schema.png" style="height: 200px; width: 600px;"></td>

3. Import the OCR 2.0 JSON files obtained from the step &quot;Process Documents through OCR 2.0&quot; into the processor. Import them as training and test data. During the import process, select the option to auto-label using the `pretrained-foundation-model-v1.1-2024-03-12` processor.

4. Once the JSON files are auto-labeled and imported into the processor, verify the documents to ensure that all the checkbox labels are correct. Mark them as labeled after verifying all the auto-labeled entities.

5. Uptrain a model on top of `pretrained-foundation-model-v1.1-2024-03-12` or any existing version that has OCR 2.0 as the base OCR. The uptraining process typically takes around an hour and creates a new model version.

6. If the up-trained version doesn't provide the targeted F1-score, experiment with fine-tuning using different knobs. For example:
   - Train step: 400/350/300
   - Learning rate: 0.5/1.5/0.6/1
   
<td><img src="./images/finetuned.png" style="height: 450px; width: 400px;"></td>
   
   Fine-tuning may take more than 3 hours for model training and creates a new fine-tuned version.

By following these steps, you can uptrain or finetune a pretrained CDE model to improve its performance on checkbox entity recognition.

## Checkbox Post Processing script for Finetuned model version

From the Fine-Tuned processor version, you may not find the following field "normalizedValue":{"booleanValue":true} for checkboxes, if it is so. Here is the post processing script to handle the check_boxes efficiently shown with sample output.


In [None]:
project_id = "YOUR_PROJECT_ID"  # Google Cloud project ID
location = "us"  # Processor location Eg. us or eu
processor_id = "YOUR_CDE_PROCESSOR_ID"  # CDE processor ID
processor_version = "YOUR_CDE_PROCESSOR_VERSION"  # CDE processor Version either uptrained or Finetuned which you got from the above step
bucket_name = "YOUR_BUCKET_NAME"
file_path = (
    "PATH/TO/YOUR/FILE.pdf"  # Path to the file within the specified bucket of your PDF
)

In [None]:
# read pdf file from the bucket
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_path)
pdf_bytes = blob.download_as_bytes()

# post-processing code to handle checkboxes
res = utilities.process_document_sample(
    project_id, location, processor_id, pdf_bytes, processor_version
)

entity_list = []
for ent in res.document.entities:
    normalized_value = ent.mention_text
    if ent.type_.endswith("checkbox"):
        if "☑" in ent.mention_text:
            normalized_value = "True"
        elif "☐" in ent.mention_text:
            normalized_value = "False"
        else:
            pass
    entity_list.append((ent.type_, ent.mention_text, ent.confidence, normalized_value))
entities_df = pd.DataFrame(
    entity_list,
    columns=["Entity_Name", "Entity_Value", "Confidence_Score", "Normalized_Value"],
)

### Output

<td><img src="./images/df_output.png" style="height: 200px; width: 600px;"></td>

## Approach 2 - Custom CDE Model

If you are not seeing the checkboxes detected properly with the Fine-Tuned version and have less F1 score. Try to train a custom model based cde processor from scratch with the same OCR2.0 jsons.

<td><img src="./images/model_based.png" style="height: 450px; width: 420px;"></td>


Once the custom model based processor is trained you can use that for the checkbox prediction.

**Caveat** : If we process the pdf’s directly using the custom model, it fails to predict checkboxes as the custom model is using OCR1.0 as a base ocr. Hence we need to process the pdf with OCR 2.0 first and feed the output json to a custom trained model. Here is the script for the same with a sample processor output.

In [None]:
import utilities

# Configuration
project_id = "<PROJECT_ID>"  # Google Cloud Project ID
location = "<LOCATION>"  # Document AI Processor Location (e.g., "us" or "eu")
processor_id = "<PROCESSOR_ID>"  # Doc AI Processor ID (Make sure to set the Custom model based Version which you got from previous step as Default)
gcs_input_uri = "<GCS_INPUT_URI>"  # Google Cloud Storage input folder path of OCR 2.0 JSON files obtained from the step "Process Documents through OCR 2.0"
gcs_output_uri = "<GCS_OUTPUT_URI>"  # Google Cloud Storage output folder path where the JSON results will be stored

# Batch process documents using the specified processor
process_documents_result = utilities.batch_process_documents_sample(
    project_id=project_id,
    location=location,
    processor_id=processor_id,
    gcs_input_uri=gcs_input_uri,
    gcs_output_uri=gcs_output_uri,
)

print("Batch processing result:", process_documents_result)

**Output: JSON in the Output folder with the checkbox entities**

<td><img src="./images/model_json_out.png" style="height: 450px; width: 450px;"></td>