#  Reverse Annotation Tool

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This tool helps in annotating or labeling the entities in the document based on the ocr text tokens. The notebook script expects the input file containing the name of entities in tabular format. And the first row is the header representing the entities that need to be labeled in every document. The script calls the processor and parses each of these input documents. The parsed document is then annotated if input entities are present in the document based on the OCR text tokens. The result is an output json file with updated entities and exported into a storage bucket path. This result json files can be imported into a processor to further check the annotations are existing as per the input file which was provided to the script prior the execution.

# Prerequisites
* Vertex AI Notebook
* Input csv file containing list of files to be labeled.
* Document AI Processor
* GCS bucket for processing of  the input documents and writing the output.

# Step-by-Step Procedure

## 1. Import Modules/Packages

In [7]:
!pip install google-cloud-documentai
!pip install google-cloud-storage
!pip install numpy
!pip install pandas
!pip install fuzzywuzzy

In [8]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [3]:
import csv
import re
from typing import Dict, List, Tuple, Union

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage

from utilities import process_document_sample, store_document_as_json

## 2. Input Details

* **PROJECT_ID** : GCP project Id
* **LOCATION** : Location of DocumentAI processor, either `us` or `eu`
* **PROCESSOR_ID** : DocumentAI processor Id
* **PROCESSOR_VERSION** : DocumentAI processor verrsion Id(eg- pretrained-invoice-v2.0-2023-12-06)
* **INPUT_BUCKET** : It is input GCS folder path which contains pdf files
* **OUTPUT_BUCKET** : It is a GCS folder path to store post-processing results
* **READ_SCHEMA_FILENAME** : It is a csv file contains entities(type & mention_text) data, which are needed to be annotated. In csv, Column-1(FileNames) contains file names , Column-2(entity_type) contains data to be annotated, Column-3 and its following fields should follow same field-schema as Column-2. In otherwords it is a schema file containing a tabular data with header row as name of the entities that needs to be identified and annotated in the document and the following rows are for each file whose values needs to be extracted.  

    <img src='./images/csv_sample.png' width=800 height=400>

In [10]:
PROJECT_ID = "xx-xx-xx"
PROCESSOR_ID = "xx-xx-xx"
PROCESSOR_VERSION = "pretrained-invoice-v2.0-2023-12-06"
INPUT_BUCKET = "gs://BUCKET_NAME/reverse_annotation_tool/input/"
OUTPUT_BUCKET = "gs://BUCKET_NAME/reverse_annotation_tool/output/"
LOCATION = "us"
# Column headers based on your original CSV structure
READ_SCHEMA_FILENAME = "schema_and_data.csv"

## 3. Run Below Code-Cells

In [None]:
def read_input_schema(read_schema_filename: str) -> pd.DataFrame:
    """
    Reads an input schema from a CSV file.

    Args:
    - read_schema_file_name (str): Path to the CSV file containing the schema.

    Returns:
    - pd.DataFrame: DataFrame containing the schema data.
    """

    df = pd.read_csv(read_schema_filename, dtype=str)
    df = df.drop(df[df["FileNames"] == "Type"].index)
    df.replace("", np.nan, inplace=True)
    return df


def get_token_range(json_data: documentai.Document) -> Dict[range, Dict[str, int]]:
    """
    Gets the token ranges from the provided JSON data.

    Args:
    - json_data (documentai.Document): JSON data containing page and token information.

    Returns:
    - dict: Dictionary containing token ranges with page number and token number information.
    """

    token_range = {}
    for pn, page in enumerate(json_data.pages):
        for tn, token in enumerate(page.tokens):
            ts = token.layout.text_anchor.text_segments[0]
            start_index = ts.start_index
            end_index = ts.end_index
            token_range[range(start_index, end_index)] = {
                "page_number": pn,
                "token_number": tn,
            }
    return token_range


def fix_page_anchor_entity(
    entity: documentai.Document.Entity,
    json_data: documentai.Document,
    token_range: Dict[range, Dict[str, int]],
) -> documentai.Document.Entity:
    """
    Fixes the page anchor entity based on the provided JSON data and token range.

    Args:
    - entity (documentai.Document.Entity): Entity object to be fixed.
    - json_data (documentai.Document): JSON data containing page and token information.
    - token_range (Dict[range, Dict[str, int]]):
        Dictionary containing token ranges with page number and token number information.

    Returns:
    - documentai.Document.Entity: Fixed entity object.
    """

    start = entity.text_anchor.text_segments[0].start_index
    end = entity.text_anchor.text_segments[0].end_index - 1

    for j in token_range:
        if start in j:
            lower_token = token_range[j]
    for j in token_range:
        if end in j:
            upper_token = token_range[j]

    lower_token_data = (
        json_data.pages[lower_token["page_number"]]
        .tokens[lower_token["token_number"]]
        .layout.bounding_poly.normalized_vertices
    )
    upper_token_data = (
        json_data.pages[int(upper_token["page_number"])]
        .tokens[int(upper_token["token_number"])]
        .layout.bounding_poly.normalized_vertices
    )

    def get_coords(
        normalized_vertex: documentai.NormalizedVertex,
    ) -> Tuple[float, float]:
        return normalized_vertex.x, normalized_vertex.y

    xa, ya = get_coords(lower_token_data[0])
    xa_, ya_ = get_coords(upper_token_data[0])

    xb, yb = get_coords(lower_token_data[1])
    xb_, yb_ = get_coords(upper_token_data[1])

    xc, yc = get_coords(lower_token_data[2])
    xc_, yc_ = get_coords(upper_token_data[2])

    xd, yd = get_coords(lower_token_data[3])
    xd_, yd_ = get_coords(upper_token_data[3])

    cord1 = {"x": min(xa, xa_), "y": min(ya, ya_)}
    cord2 = {"x": max(xb, xb_), "y": min(yb, yb_)}
    cord3 = {"x": max(xc, xc_), "y": max(yc, yc_)}
    cord4 = {"x": min(xd, xd_), "y": max(yd, yd_)}
    nvs = []
    for coords in [cord1, cord2, cord3, cord4]:
        x, y = coords["x"], coords["y"]
        nv = documentai.NormalizedVertex(x=x, y=y)
        nvs.append(nv)
    entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices = nvs
    entity.page_anchor.page_refs[0].page = lower_token["page_number"]
    return entity


def create_entity(
    mention_text: str, type_: str, match: re.Match
) -> documentai.Document.Entity:
    """
    Creates a Document Entity based on the provided mention text, type, and match object.

    Args:
    - mention_text (str): The text to be mentioned in the entity.
    - type_ (str): The type of the entity.
    - match (re.Match): Match object representing the start and end indices of the mention text.

    Returns:
    - documentai.Document.Entity: The created Document Entity.
    """

    entity = documentai.Document.Entity()
    entity.mention_text = mention_text
    entity.type = type_
    bp = documentai.BoundingPoly(normalized_vertices=[])
    ts = documentai.Document.TextAnchor.TextSegment(
        start_index=str(match.start()), end_index=str(match.end())
    )
    entity.text_anchor.text_segments = [ts]
    entity.page_anchor.page_refs = [
        documentai.Document.PageAnchor.PageRef(bounding_poly=bp)
    ]

    return entity


# Line Items processing
def extract_anchors(prop: documentai.Document.Entity) -> Tuple[str, str]:
    """It will look for text anchors and page anchors in Entity object

    Args:
        prop (documentai.Document.Entity): DocumentAI Entity object

    Returns:
        Tuple[str, str]: It contains text_anchors and page_anchors in string-format
    """
    text_anchor = f"{prop.text_anchor.text_segments}" if prop.text_anchor else "MISSING"
    page_anchor = f"{prop.page_anchor.page_refs[0]}" if prop.page_anchor else "MISSING"
    return text_anchor, page_anchor


def improved_similarity_score(str1: str, str2: str) -> float:
    """it return similarity/fuzzy ratio between two strings

    Args:
        str1 (str): It is a text
        str2 (str): It is also a text

    Returns:
        float: similarity ration between string_1 and string_2
    """

    str1_parts = set(str1.split())
    str2_parts = set(str2.split())
    common_parts = str1_parts.intersection(str2_parts)
    total_parts = str1_parts.union(str2_parts)
    if not total_parts:
        return 0.0
    return len(common_parts) / len(total_parts)


def pair_items_with_improved_similarity(
    gt_dict: Dict[str, Dict[str, str]], pred_dict: Dict[str, Dict[str, str]]
) -> Dict[str, str]:
    """It pairs grounf_truth and prediction data based on similarity scrore between them

    Args:
        gt_dict (Dict[str, Dict[str, str]]): A dictionary containing ground truth data
        pred_dict (Dict[str, Dict[str, str]]): A dictionary containing predicton data

    Returns:
        Dict[str, str]: It contains type & best matched mention text
    """
    pairings = {}
    for gt_key, gt_values in gt_dict.items():
        gt_concat = " ".join(gt_values.values()).lower()
        best_match_key = None
        best_score = -1
        for pred_key, pred_values in pred_dict.items():
            pred_values_only = {k: v["value"] for k, v in pred_values.items()}
            pred_concat = " ".join(pred_values_only.values()).lower()
            score = improved_similarity_score(gt_concat, pred_concat)
            if score > best_score:
                best_score = score
                best_match_key = pred_key
        pairings[gt_key] = best_match_key if best_score > 0 else None
    return pairings


def process_documents(
    csv_file_path: str, doc_obj: documentai.Document, file_name: str
) -> Tuple[Dict[str, Dict[str, str]], Dict[str, Dict[str, str]], Dict[str, str]]:
    """
    It an a helper function to get line_items , grouped entities and paired entity type and
    its best match against ground truth

    Args:
        csv_file_path (str):
            CSV file path, It contains text's which need to annotated in doc-proto object
        doc_obj (documentai.Document): DocumentAI Doc proto object
        file_name (str): _description_

    Returns:
        Tuple[Dict[str, Dict[str, str]],Dict[str, Dict[str, str]],Dict[str, str]]:
            it returns line_items , grouped entities and paired entity type and
            its best match against ground truth
    """
    line_items_dict = {}
    entity_groups_dict = {}

    # Read and process the CSV file
    with open(csv_file_path, mode="r", newline="") as csv_file:
        csv_reader = csv.reader(csv_file)
        headers = next(csv_reader)
        for index, row in enumerate(csv_reader):
            row_dict = dict(zip(headers, row))
            if row_dict.get("FileNames") == file_name:
                line_item_details = {}
                has_line_item_values = False
                for header, value in row_dict.items():
                    if "line_item/" in header and value:
                        has_line_item_values = True
                        line_item_details[header] = value
                if line_item_details and has_line_item_values:
                    line_items_dict[f"gt_line_item_{index}"] = line_item_details

    n = 1
    for entity in doc_obj.entities:
        if entity.properties:
            entity_details = {}
            for prop in entity.properties:
                key = prop.type_
                value = prop.mention_text
                text_anchor, page_anchor = extract_anchors(prop)
                entity_details[key] = {
                    "value": value,
                    "text_anchor": text_anchor,
                    "page_anchor": page_anchor,
                }
            entity_groups_dict[f"pred_line_item_{n}"] = entity_details
            n += 1

    # Pair items using the improved similarity score
    improved_pairings = pair_items_with_improved_similarity(
        line_items_dict, entity_groups_dict
    )

    return line_items_dict, entity_groups_dict, improved_pairings


def extract_bounding_box_and_page(layout_info: str) -> Dict[str, Union[int, float]]:
    """It is used to get xy-coords and its page number from page_anchor

    Args:
        layout_info (str): DocumentAI token object page_anchor data in string format

    Returns:
        Dict[str, Union[int, float]]: It contains page_number and xy-coords of token
    """
    x_values = []
    y_values = []
    page = 0  # Default page number

    for line in layout_info.split("\n"):
        if "x:" in line:
            _, x_value = line.split(":")
            x_values.append(float(x_value.strip()))
        elif "y:" in line:
            _, y_value = line.split(":")
            y_values.append(float(y_value.strip()))
        elif line.startswith("page:"):
            _, page = line.split(":")
            page = int(page.strip())

    return {
        "page": page,
        "min_x": min(x_values),
        "max_x": max(x_values),
        "min_y": min(y_values),
        "max_y": max(y_values),
    }


def get_page_anc_line(line_dict_1: Dict[str, str]) -> Dict[str, Union[int, float]]:
    """It is used to get xy-coords and its page number from page_anchor

    Args:
        line_dict_1 (Dict[str, str]): Dictionary which holds page_anchor data

    Returns:
        Dict[str, Union[int, float]]:
            It returns page_number and xy-coords of based on page_anchor object
    """

    val_s = []
    for en1, val1 in line_dict_1.items():
        page_anc_dict = extract_bounding_box_and_page(val1["page_anchor"])
        val_s.append(page_anc_dict)
    page_line = {
        "page": val_s[0]["page"],
        "min_x": min(entry["min_x"] for entry in val_s),
        "max_x": max(entry["max_x"] for entry in val_s),
        "min_y": min(entry["min_y"] for entry in val_s),
        "max_y": max(entry["max_y"] for entry in val_s),
    }

    return page_line


def get_cleaned_text(text: str) -> str:
    """it removes spaces & newline characters from provided text

    Args:
        text (str): A text which need to be cleaned

    Returns:
        str: text without containing spaces & newline chars
    """
    return text.lower().replace(" ", "").replace("\n", "")


def get_match(gt_mt_split: List[str], mt_temp: str) -> Union[List[str], None]:
    """It returns best match of mention text from ground truth text

    Args:
        gt_mt_split (List[str]): It contains list of strings
        mt_temp (str): It is text, which need to be checked against gt_mt_split for best match

    Returns:
        Union[List[str], None]: It returns best match of mention text from ground truth text
    """
    flag_found = False
    for mt in gt_mt_split:
        if fuzz.ratio(get_cleaned_text(mt_temp), get_cleaned_text(mt)) > 75:
            gt_mt_split.remove(mt)
            flag_found = True
            return gt_mt_split
    if flag_found == False:
        return None


def get_new_entity(
    doc_obj: documentai.Document,
    page_anc_dict: Dict[str, Union[int, float]],
    gt_mt: str,
    type_en: str,
) -> Union[documentai.Document.Entity, None]:
    """It creates new entity based on provided page_anchor, mention text and entity type

    Args:
        doc_obj (documentai.Document): DocumentAI Doc proto object
        page_anc_dict (Dict[str, Union[int, float]]):
            It contains page_number and xy-coords of based on page_anchor object
        gt_mt (str): text which uses as mention_text for entity object
        type_en (str): text which uses as type_ for an entity object

    Returns:
        Union[documentai.Document.Entity, None]:
        It creates new entity based on provided page_anchor, mention text and entity type
    """
    text_anc = []
    page_anc = {"x": [], "y": []}
    mt_text = ""
    gt_mt_split = gt_mt.split()
    for page_num, _ in enumerate(doc_obj.pages):
        if page_num == int(page_anc_dict["page"]):
            for token in doc_obj.pages[page_num].tokens:
                vertices = token.layout.bounding_poly.normalized_vertices
                minx_token, miny_token = min(point.x for point in vertices), min(
                    point.y for point in vertices
                )
                maxx_token, maxy_token = max(point.x for point in vertices), max(
                    point.y for point in vertices
                )
                token_seg = token.layout.text_anchor.text_segments
                for seg in token_seg:
                    token_start, token_end = seg.start_index, seg.end_index
                if (
                    abs(miny_token - page_anc_dict["min_y"]) <= 0.02
                    and abs(maxy_token - page_anc_dict["max_y"]) <= 0.02
                ):
                    mt_temp = doc_obj.text[token_start:token_end]

                    if (
                        get_cleaned_text(mt_temp) in gt_mt.lower().replace(" ", "")
                        or fuzz.ratio(
                            get_cleaned_text(mt_temp), gt_mt.lower().replace(" ", "")
                        )
                        > 70
                    ):
                        if len(mt_temp) <= 2:
                            if (
                                fuzz.ratio(
                                    mt_temp.lower().replace(" ", "").replace("\n", ""),
                                    gt_mt.lower().replace(" ", ""),
                                )
                                > 80
                            ):
                                ts = documentai.Document.TextAnchor.TextSegment(
                                    start_index=token_start, end_index=token_end
                                )
                                text_anc.append(ts)
                                page_anc["x"].extend([minx_token, maxx_token])
                                page_anc["y"].extend([miny_token, maxy_token])
                                mt_text += mt_temp
                        else:
                            ts = documentai.Document.TextAnchor.TextSegment(
                                start_index=token_start, end_index=token_end
                            )
                            text_anc.append(ts)
                            page_anc["x"].extend([minx_token, maxx_token])
                            page_anc["y"].extend([miny_token, maxy_token])
                            mt_text += mt_temp
                    else:
                        match_mt = get_match(gt_mt_split, mt_temp)
                        if match_mt:
                            ts = documentai.Document.TextAnchor.TextSegment(
                                start_index=token_start, end_index=token_end
                            )
                            text_anc.append(ts)
                            page_anc["x"].extend([minx_token, maxx_token])
                            page_anc["y"].extend([miny_token, maxy_token])
                            mt_text += mt_temp

    try:
        x, y = page_anc.values()
        page_anc_new = [
            {"x": min(x), "y": min(y)},
            {"x": max(x), "y": min(y)},
            {"x": max(x), "y": max(y)},
            {"x": min(x), "y": max(y)},
        ]
        nvs = []
        for xy in page_anc_new:
            nv = documentai.NormalizedVertex(**xy)
            nvs.append(nv)
        new_entity = documentai.Document.Entity()
        new_entity.mention_text = mt_text
        new_entity.type_ = type_en
        ta = documentai.Document.TextAnchor(content=mt_text, text_segments=text_anc)
        new_entity.text_anchor = ta
        bp = documentai.BoundingPoly(normalized_vertices=nvs)
        page_ref = documentai.Document.PageAnchor.PageRef(
            page=str(page_anc_dict["page"]), bounding_poly=bp
        )
        new_entity.page_anchor.page_refs = [page_ref]
        return new_entity
    except ValueError:
        return None


def parse_page_anchor(page_anchor_str: str) -> documentai.Document.PageAnchor:
    """It creates page_anchor proto-object based on provided page_anchor text

    Args:
        page_anchor_str (str): page anchor data in string format

    Returns:
        documentai.Document.PageAnchor:
        newly created page_anchor proto-object based on provided page_anchor text
    """
    # Extract normalized vertices using the provided reference approach
    vertices = []
    page = "0"  # Default to page 0 if not specified
    lines = page_anchor_str.split("\n")
    for idx, line in enumerate(lines):
        if "x:" in line:
            x = float(line.split(":")[1].strip())
            # Ensure there's a corresponding 'y' line following 'x' line
            if idx + 1 < len(lines) and "y:" in lines[idx + 1]:
                y = float(lines[idx + 1].split(":")[1].strip())
                nv = documentai.NormalizedVertex(x=x, y=y)
                vertices.append(nv)
        elif line.startswith("page:"):
            page = line.split(":")[1].strip()

    bp = documentai.BoundingPoly(normalized_vertices=vertices)
    page_ref = documentai.Document.PageAnchor.PageRef(page=page, bounding_poly=bp)
    page_anchor = documentai.Document.PageAnchor(page_refs=[page_ref])
    return page_anchor


def parse_text_anchor(
    text_anchor_str: str, content_str: str
) -> documentai.Document.TextAnchor:
    """
    It builds DocAI text Anchor object based on provided text anchor and content in string format

    Args:
        text_anchor_str (str): DocAI text anchor object in string format
        content_str (str): text to add to text anchot object

    Returns:
        documentai.Document.TextAnchor:
        Text Anchor object created based on provided text anchor and content in string format
    """

    # Simplified parsing for 'text_segments' from 'text_anchor'
    segments_matches = re.findall(
        r"start_index: (\d+)\nend_index: (\d+)", text_anchor_str
    )
    text_segments = []
    for si, ei in segments_matches:
        ts = documentai.Document.TextAnchor.TextSegment(start_index=si, end_index=ei)
        text_segments.append(ts)

    ta = documentai.Document.TextAnchor(
        text_segments=text_segments, content=content_str
    )
    return ta


def construct_predicted_value_details(
    predicted_value: Dict[str, str], type_: str
) -> documentai.Document.Entity:
    """It will create new entinty object in doc-proto format

    Args:
        predicted_value (Dict[str, str]):
            it is dictionary wich contains text_anchor, page_anchor and value as string-object
        type_ (str): text which needs to be assigned to entity.type_

    Returns:
        documentai.Document.Entity: It returnds new entity object
    """
    page_anchor = parse_page_anchor(predicted_value["page_anchor"])
    text_anchor = parse_text_anchor(
        predicted_value["text_anchor"], predicted_value.get("value", "")
    )
    ent = documentai.Document.Entity()
    ent.mention_text = predicted_value.get("value", "")
    ent.page_anchor = page_anchor
    ent.text_anchor = text_anchor
    ent.type_ = type_
    return ent


df_schema = read_input_schema(READ_SCHEMA_FILENAME)
#  Group by 'FileNames'
grouped = df_schema.groupby("FileNames", as_index=False)
processed_rows = []

for name, group in grouped:
    # Get the total number of columns
    max_columns = len(group.columns)
    combined_row = []

    # Iterate over rows in the group
    for index, row in group.iterrows():
        row_list = row.tolist()
        row_filled = row_list + [np.nan] * (
            max_columns - len(row_list)
        )  # Extend with NaNs if less than max_columns
        combined_row.extend(row_filled)

    processed_rows.append(combined_row)


headers = [
    header.strip()
    for header in pd.read_csv(READ_SCHEMA_FILENAME, nrows=0).columns.tolist()
]
prefix = "line_item/"

# Extract the part after 'line_item/' for each item that starts with the prefix
unique_entities = [item.split("/")[-1] for item in headers if item.startswith(prefix)]

processed_files = set()  # Set to keep track of processed FileNames


client = storage.Client()
bucket = client.get_bucket(INPUT_BUCKET.split("/")[2])

for row in processed_rows:
    file_name = row[0]  # The first item is 'FileNames'

    if file_name not in processed_files:
        print("Processing:", file_name)
        file_name_path = INPUT_BUCKET + file_name
        file_name_path = "/".join(file_name_path.split("/")[3:])
        blob = bucket.blob(file_name_path)
        content = blob.download_as_bytes()
        res = utilities.process_document_sample(
            project_id=PROJECT_ID,
            location=LOCATION,
            processor_id=PROCESSOR_ID,
            pdf_bytes=content,
            processor_version=PROCESSOR_VERSION,
        )
        res_dict = res.document
        token_range = get_token_range(res_dict)

        # Add the file_name to the set of processed files
        processed_files.add(file_name)

    parser_entities = res_dict.entities

    # Process the line items
    line_items_dict, entity_groups_dict, improved_pairings = process_documents(
        READ_SCHEMA_FILENAME, res_dict, file_name
    )

    entities = []  # Initialize the list to hold all entities

    for match_key, match_value in improved_pairings.items():
        if match_value:
            entity = documentai.Document.Entity()
            mention_texts = []  # To hold all mention_text values for concatenation

            for gt_k, gt_v in line_items_dict[match_key].items():
                is_correct = True
                if gt_k in entity_groups_dict[match_value].keys():
                    predicted_value = entity_groups_dict[match_value][gt_k]
                    similarity = fuzz.ratio(gt_v, predicted_value["value"])
                    if similarity < 90:
                        is_correct = False
                    else:
                        predicted_value_details = construct_predicted_value_details(
                            predicted_value, gt_k
                        )
                        entity.type_ = gt_k.split("/")[
                            0
                        ]  # Assuming 'line_item' is the desired type
                        entity.properties.append(predicted_value_details)
                        mention_texts.append(predicted_value_details.mention_text)
                else:
                    is_correct = False

                if not is_correct:
                    page_line = get_page_anc_line(entity_groups_dict[match_value])
                    new_ent = get_new_entity(res_dict, page_line, gt_v, gt_k)
                    if new_ent is not None:
                        entity.type_ = gt_k.split("/")[
                            0
                        ]  # Assuming 'line_item' is the desired type
                        entity.properties.append(new_ent)
                        mention_texts.append(new_ent.mention_text)
                    else:
                        predicted_value_details = construct_predicted_value_details(
                            predicted_value, gt_k
                        )
                        entity.properties.append(predicted_value_details)
                        mention_texts.append(predicted_value_details.mention_text)

            # Concatenate all mention_texts for the parent entity
            entity.mention_text = " ".join(mention_texts).strip()

            # Add the entity to the list if it has been populated with properties
            if entity.properties:
                entities.append(entity)

    # Initialize containers for processed and unprocessed entities
    list_of_entities = []
    list_of_entities_not_mapped = []
    processed_entities = set()  # Set to track processed entities

    # Iterate over rows and headers
    for i in range(0, len(row), len(headers)):
        row_slice = row[i : i + len(headers)]
        for j in range(1, len(headers)):
            type_ = headers[j]
            mention_text = row_slice[j]
            if "/" in type_:
                continue  # Skip if type_ contains '/'
            # Check if entity matches parser_entities before using re.finditer
            matched = False
            for proc_ent in parser_entities:
                if proc_ent.type_ == type_ and proc_ent.mention_text == mention_text:
                    matched = True
                    # Directly append the parser entity
                    list_of_entities.append(proc_ent)
                    break  # Exit loop after match is found

            # If no match is found, proceed with finditer to create a new entity
            if not matched and mention_text:
                occurrences = re.finditer(
                    re.escape(str(mention_text)) + r"[ |\,|\n]", res_dict.text
                )
                for m in occurrences:
                    start, end = m.start(), m.end()
                    entity_id = (mention_text, start, end)
                    if entity_id not in processed_entities:
                        entity = create_entity(mention_text, type_, m)
                        try:
                            entity_modified = fix_page_anchor_entity(
                                entity, res_dict, token_range
                            )
                            processed_entities.add(entity_id)
                            list_of_entities.append(entity_modified)
                        except Exception as e:
                            print(
                                "Not able to find " + mention_text + " in the OCR:", e
                            )
                            continue

    # Update and write final output as in your existing code
    res_dict.entities = list_of_entities
    for entity in entities:
        res_dict.entities.append(entity)

    # Write the final output to GCS
    output_bucket_name = OUTPUT_BUCKET.split("/")[2]
    output_path_within_bucket = (
        "/".join(OUTPUT_BUCKET.split("/")[3:]) + file_name + ".json"
    )
    utilities.store_document_as_json(
        documentai.Document.to_json(res_dict),
        output_bucket_name,
        output_path_within_bucket,
    )
print("Process Completed!!!")

# 4. Output Details

As we can observe, data mentioned in csv is annotated in DocAI proto results
<img src='./images/output_sample.png' width=800 height=400></img>