# Line-Item Comparison Notebook

* Author: docai-incubator@google.com

# Disclaimer
This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

# Objective
This notebook is designed to compare JSON schemas using Google's Document AI and other processing tools. It includes functionality for fuzzy matching and schema comparison.

# Prerequisite
* Vertex AI Notebook
* Parsed json files in a GCS Folder
* GCS folders with Ground truth, parsed jsons and post processed jsons 

# Step by Step procedure

# 1. Imports

Import necessary libraries for processing.

In [None]:
# Download incubator-tools utilities module to present-working-directory
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
!pip install google-cloud-storage fuzzywuzzy pandas google-cloud-documentai -q

In [None]:
from google.cloud import storage
from fuzzywuzzy import fuzz
import pandas as pd
from pprint import pprint
import utilities
from google.cloud import documentai_v1beta3 as documentai

# 2. Input Details

* **project_id** : Give your GCP Project ID
* **gt_jsons_uri** : It is GCS path which contains ground-truth JSON files
* **parsed_jsons_uri** : It is GCS path which contains document-processed JSON results
* **post_processed_jsons_uri** : It is GCS path which contains document-processed JSON results

**NOTE**:
* Here all GCS paths should ends-with trailing-slash(`/`)
* The file names have to be same in all the folders which contains Ground truth, parsed and post processed jsons

In [None]:
project_id = "xxxx-xxxx-xxxx"
gt_jsons_uri = "gs://xx/xxx/xxxx/"
parsed_jsons_uri = "gs://xx/xxxx/xxxx/xx/"
post_processed_jsons_uri = "gs://xx/xxx/xxxx/xx/"

# 3. Script Execution

### Main Comparison Function

This function compares two document line_items and returns a DataFrame with the comparison results.


In [None]:
def get_comparision_dataframe(doc_gt, doc_pp):
    """
    Compares two document schemas and returns a DataFrame with the comparison results.

    Parameters:
    doc_gt (Document): Ground Truth document schema.
    doc_pp (Document): Post-processed document schema.

    Returns:
    DataFrame: A DataFrame containing the comparison results.
    """

    def get_line_items(doc):
        line_items = []
        line_dict = {}
        sub_items = []
        import pandas as pd

        df = pd.DataFrame(columns=["line", "id", "type", "mentionText", "ver"])

        for entity1 in doc.entities:
            if entity1.properties:
                if entity1.type == "line_item":
                    line_items.append(entity1)
                    for subitem in entity1.properties:
                        sub_items.append(subitem)

        for i in range(len(line_items)):
            for ent in line_items[i].properties:
                if i in line_dict.keys():
                    line_dict[i].append(ent)
                else:
                    line_dict[i] = [ent]

        def get_min_max_ver(ent1):
            x1 = []
            y1 = []
            p = 0
            try:
                if ent1.page_anchor.page_refs[0].page:
                    p = ent1.page_anchor.page_refs[0].page

            except:
                p = 0
            try:
                for ver in ent1.page_anchor.page_refs[
                    0
                ].bounding_poly.normalized_vertices:
                    x1.append(ver.x)
                    y1.append(ver.y)
                a = [
                    {"page": p},
                    {"x": min(x1), "y": min(y1)},
                    {"x": max(x1), "y": max(y1)},
                ]
            except:
                pass
            return a

        dict_line_1 = {}

        for line_1, entities_1 in line_dict.items():
            for entity in entities_1:
                try:
                    entity_id = entity.id
                except:
                    entity_id = ""

                if line_1 in dict_line_1.keys():
                    temp_df = pd.DataFrame(
                        [
                            [
                                line_1,
                                entity_id,
                                entity.type,
                                entity.mention_text,
                                get_min_max_ver(entity),
                            ]
                        ],
                        columns=df.columns,
                    )
                    df = pd.concat([df, temp_df])
                    dict_line_1[line_1].append(
                        {
                            entity_id: [
                                {"type": entity.type},
                                {"mentionText": entity.mention_text},
                                {"ver": get_min_max_ver(entity)},
                            ]
                        }
                    )
                else:
                    dict_line_1[line_1] = [
                        {
                            entity_id: [
                                {"type": entity.type},
                                {"mentionText": entity.mention_text},
                                {"ver": get_min_max_ver(entity)},
                            ]
                        }
                    ]
                    temp_df = pd.DataFrame(
                        [
                            [
                                line_1,
                                entity_id,
                                entity.type,
                                entity.mention_text,
                                get_min_max_ver(entity),
                            ]
                        ],
                        columns=df.columns,
                    )
                    df = pd.concat([df, temp_df])

        return dict_line_1, df, line_dict, sub_items

    def BBoxOverlap(entity1, entity2):
        def valid_bbox_iou(gt_bbox, pred_bbox) -> bool:
            """Returns true if two bbox overlap less than minimal_iou."""
            if len(gt_bbox.normalized_vertices) != 4:
                return False
            if len(pred_bbox.normalized_vertices) != 4:
                return True
            # bbox represent as [x_min, x_max, y_min, y_max]
            bbox1 = get_bounding_bbox(gt_bbox)
            bbox2 = get_bounding_bbox(pred_bbox)
            xmin = max(bbox1[0], bbox2[0])
            xmax = min(bbox1[1], bbox2[1])
            ymin = max(bbox1[2], bbox2[2])
            ymax = min(bbox1[3], bbox2[3])
            intersection_area = max(xmax - xmin, 0.0) * max(ymax - ymin, 0.0)
            union_area = (
                (bbox1[1] - bbox1[0]) * (bbox1[3] - bbox1[2])
                + (bbox2[1] - bbox2[0]) * (bbox2[3] - bbox2[2])
                - intersection_area
            )
            if union_area < 1e-10:
                return True
            iou = intersection_area / union_area
            return xmax > xmin and ymax > ymin and iou >= 0.2

        def get_bounding_bbox(bbox):
            """Returns the list representation for the bounding box."""
            x_coordinates = get_bounding_poly_x(bbox)
            y_coordinates = get_bounding_poly_y(bbox)
            # bbox represent as [x_min, x_max, y_min, y_max]
            return [
                min(x_coordinates),
                max(x_coordinates),
                min(y_coordinates),
                max(y_coordinates),
            ]

        def get_bounding_poly_x(bounding_poly):
            """Returns the list for x coordinates for the bounding poly."""
            return [
                normalized_vertices.x
                for normalized_vertices in bounding_poly.normalized_vertices
            ]

        def get_bounding_poly_y(bounding_poly):
            """Returns the list for y coordinates for the bounding poly."""
            return [
                normalized_vertices.y
                for normalized_vertices in bounding_poly.normalized_vertices
            ]

        gt_bbox = entity1.page_anchor.page_refs[0].bounding_poly
        pred_bbox = entity2.page_anchor.page_refs[0].bounding_poly
        return valid_bbox_iou(gt_bbox, pred_bbox)

    dict_line_GT, df1, line_dict_GT, sub_items_GT = get_line_items(doc_gt)
    dict_line_PP, df2, line_dict_PP, sub_items_pp = get_line_items(doc_pp)
    df1.to_csv("GT_line.csv")
    df2.to_csv("post_line.csv")

    def check_page_match(ent1, ent2):
        p1 = ""
        p2 = ""
        try:
            if ent1.page_anchor.page_refs[0].page:
                p1 = ent1.page_anchor.page_refs[0].page

        except:
            p1 = 0

        try:
            if ent2.page_anchor.page_refs[0].page:
                p2 = ent2.page_anchor.page_refs[0].page
        except:
            p2 = 0

        if p1 == p2:
            return True
        elif p1 != p2:
            return False

    entities_match = []
    entities_nomatch = {}
    ent_1 = []
    ent_gt_matched = []
    ent_pp_matched = []
    for line_gt, ent_gt in line_dict_GT.items():
        for ent1 in ent_gt:
            for line_pp, ent_pp in line_dict_PP.items():
                for ent2 in ent_pp:
                    # print(len(ent1),len(ent2))
                    if (
                        check_page_match(ent1, ent2) == True
                        and BBoxOverlap(ent1, ent2) == True
                        and ((fuzz.ratio(ent1.mention_text, ent2.mention_text)) / 100)
                        > 0.8
                    ):
                        gt = {str(line_gt) + "_GT": [ent1]}
                        pp = {str(line_pp) + "_PP": [ent2]}
                        entities_match.append([gt, pp])
                        ent_gt_matched.append(ent1)
                        ent_pp_matched.append(ent2)
                        # print(ent1['id'],ent2['id'])
                        # ent_1.append({line_gt:ent1['id'],line_pp:ent2['id']})
    df_merge = pd.DataFrame(
        columns=[
            "line_GT",
            "line_PP",
            "id_GT",
            "id_PP",
            "type_GT",
            "type_PP",
            "mentionText_GT",
            "mentionText_PP",
        ]
    )
    for item in entities_match:
        for entity in item:
            for line, ent in entity.items():
                if "_GT" in line:
                    line_GT = line
                    id_GT = ent[0].id
                    type_GT = ent[0].type
                    mentionText_GT = ent[0].mention_text
                elif "_PP" in line:
                    line_PP = line
                    try:
                        e_id = ent[0].id
                    except:
                        e_id = ""
                    id_PP = e_id
                    type_PP = ent[0].type
                    mentionText_PP = ent[0].mention_text
                    # print(line_GT)
        temp_df = pd.DataFrame(
            [
                [
                    line_GT,
                    line_PP,
                    id_GT,
                    id_PP,
                    type_GT,
                    type_PP,
                    mentionText_GT,
                    mentionText_PP,
                ]
            ],
            columns=df_merge.columns,
        )
        df_merge = pd.concat([df_merge, temp_df])
        left_over_GT = []
    for ent11 in sub_items_GT:
        if ent11 not in ent_gt_matched:
            left_over_GT.append(ent11)

    for line_gt, ent_gt in line_dict_GT.items():
        for ent1 in ent_gt:
            for ent2 in left_over_GT:
                if ent1 == ent2:
                    line_GT = str(line_gt) + "_GT"
                    line_PP = "_____"
                    id_GT = ent1.id
                    id_PP = "_____"
                    type_GT = ent1.type
                    type_PP = "_____"
                    mentionText_GT = ent1.mention_text
                    mentionText_PP = "_____"
                    temp_df = pd.DataFrame(
                        [
                            [
                                line_GT,
                                line_PP,
                                id_GT,
                                id_PP,
                                type_GT,
                                type_PP,
                                mentionText_GT,
                                mentionText_PP,
                            ]
                        ],
                        columns=df_merge.columns,
                    )
                    df_merge = pd.concat([df_merge, temp_df])
    left_over_pp = []
    for ent11 in sub_items_pp:
        if ent11 not in ent_pp_matched:
            left_over_pp.append(ent11)

    for line_pp, ent_pp in line_dict_PP.items():
        for ent1 in ent_pp:
            for ent2 in left_over_pp:
                if ent1 == ent2:
                    line_GT = "_____"
                    line_PP = str(line_pp) + "_PP"
                    id_GT = "_____"
                    try:
                        en_id = ent1.id
                    except:
                        en_id = ""
                    id_PP = en_id
                    type_GT = "_____"
                    type_PP = ent1.type
                    mentionText_GT = "_____"
                    mentionText_PP = ent1.mention_text
                    temp_df = pd.DataFrame(
                        [
                            [
                                line_GT,
                                line_PP,
                                id_GT,
                                id_PP,
                                type_GT,
                                type_PP,
                                mentionText_GT,
                                mentionText_PP,
                            ]
                        ],
                        columns=df_merge.columns,
                    )
                    df_merge = pd.concat([df_merge, temp_df])

    match = []
    for l1 in entities_match:
        k = []
        for item1 in l1:
            for lin1, en1 in item1.items():
                k.append(lin1)
        match.append(k)

    counts = {}
    for item in match:
        key = tuple(item)
        counts[key] = counts.get(key, 0) + 1

    line_change = {}
    for line_match, count in counts.items():
        l1 = int(line_match[0].split("_")[0])
        l2 = int(line_match[1].split("_")[0])
        # print(l1,l2)

        if line_match[0] in line_change.keys():
            if line_change[line_match[0]]["count"] >= count:
                pass
            else:
                line_change[line_match[0]] = {"pp": line_match[1], "count": count}
        else:
            line_change[line_match[0]] = {"pp": line_match[1], "count": count}

    for gt_line, value in line_change.items():
        df_merge["line_PP"] = df_merge["line_PP"].replace(value["pp"], gt_line)

    def check_match(row):
        if (
            row["line_GT"] == row["line_PP"]
            and row["line_GT"] != "_____"
            and row["line_PP"] != "_____"
        ):
            return "TP"
        elif (
            row["line_GT"] != row["line_PP"]
            and row["line_GT"] != "_____"
            and row["line_PP"] != "_____"
        ):
            return "FP"
        elif row["line_GT"] == "_____":
            return "FN"
        elif row["line_PP"] == "_____":
            return "FN"

    # Add a new column 'Match' indicating the match
    df_merge["Match"] = df_merge.apply(check_match, axis=1)

    return df_merge

Running the script to perform the Line-Item comparison using the defined functions.


In [None]:
GT_list, GT_file_dict = utilities.file_names(gt_jsons_uri)
parsed_list, parsed_file_dict = utilities.file_names(parsed_jsons_uri)
post_processed_list, post_processed_file_dict = utilities.file_names(
    post_processed_jsons_uri
)

GT_bucket = gt_jsons_uri.split("/")[2]
parsed_bucket = parsed_jsons_uri.split("/")[2]
post_processed_json_bucket = post_processed_jsons_uri.split("/")[2]

from fuzzywuzzy import fuzz

df_compare_all_files = pd.DataFrame()
df_compare_accuracy = pd.DataFrame()
for GT_file, GT_file_path in GT_file_dict.items():
    # print(GT_file," : ",GT_file_path)
    # print(GT_bucket)
    doc_gt = utilities.documentai_json_proto_downloader(
        GT_bucket, GT_file_dict[GT_file]
    )
    doc_parser = utilities.documentai_json_proto_downloader(
        parsed_bucket, GT_file_dict[GT_file]
    )
    doc_pp = utilities.documentai_json_proto_downloader(
        post_processed_json_bucket, GT_file_dict[GT_file]
    )

    # break
    df_compare_gt_pp = get_comparision_dataframe(doc_gt, doc_pp)
    df_compare_gt_parser = get_comparision_dataframe(doc_gt, doc_parser)
    file_accuracy_pp = (df_compare_gt_pp["Match"].value_counts().get("TP", 0)) / (
        (df_compare_gt_pp["Match"].value_counts().get("TP", 0))
        + (df_compare_gt_pp["Match"].value_counts().get("FP", 0))
        + (df_compare_gt_pp["Match"].value_counts().get("FN", 0))
    )
    temp_df_compare_gt_pp = pd.DataFrame(
        [
            [
                GT_file,
                "-",
                "Accuracy",
                "-",
                "GT",
                "/",
                "post-processed",
                "-",
                round(file_accuracy_pp, 3),
            ]
        ],
        columns=df_compare_gt_pp.columns,
    )
    df_compare_gt_pp = pd.concat([df_compare_gt_pp, temp_df_compare_gt_pp])
    file_accuracy_p = (df_compare_gt_parser["Match"].value_counts().get("TP", 0)) / (
        (df_compare_gt_parser["Match"].value_counts().get("TP", 0))
        + (df_compare_gt_parser["Match"].value_counts().get("FP", 0))
        + (df_compare_gt_parser["Match"].value_counts().get("FN", 0))
    )
    temp_df_compare_gt_parser = pd.DataFrame(
        [
            [
                GT_file,
                "-",
                "Accuracy",
                "-",
                "GT",
                "/",
                "parsed",
                "-",
                round(file_accuracy_p, 3),
            ]
        ],
        columns=df_compare_gt_parser.columns,
    )
    df_compare_gt_parser = pd.concat([df_compare_gt_parser, temp_df_compare_gt_parser])
    frames = [df_compare_all_files, df_compare_gt_pp, df_compare_gt_parser]
    df_compare_all_files = pd.concat(frames)
    # break
df_compare_all_files.to_csv("compare_all.csv")

# 4. Output

This gives the CSV file which shows the difference between Ground truth, parsed files and post processed files
<img src="./images/sample_output.png" width=800 height=400></img>


- **GT (Ground Truth):** Represents the actual, original data or information.
- **PP (Post Processed/Processed):** Refers to the data after it has undergone processing or post-processing.
- **line-GT and line-PP:** These are specific line item numbers used to compare whether they are assigned to the same line items.
- **Match:** Indicates whether the line items are correctly assigned.
    - **TP (True Positive):** When a match is found, indicating correct assignment.
    - **FP (False Positive):** When there is no match, indicating incorrect assignment.
    - **FN (False Negative):** Considered when a Ground Truth or Processed/Post Processed child item is missing, indicating a missing or overlooked item.