# Labeled Dataset Validation

* Author: docai-incubator@google.com

# Disclaimer
This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Purpose and Description
This tool uses labeled json files as an input and gives whether there are any labeling issues like blank entities, entities which have text anchor or page anchor issues and overlapping of entities as output in a csv and dictionary format for further use.

# Prerequisite
* Vertex AI Notebook
* Labeled json files in GCS Folder

# Step By Step Procedure

## 1. Import Modules/Packages

In [None]:
!pip install pandas
!pip install google-cloud-documentai

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [8]:
from collections import defaultdict
from typing import DefaultDict, List, Tuple

import pandas as pd
from google.cloud import documentai_v1beta3 as documentai

from utilities import (
    bb_intersection_over_union,
    documentai_json_proto_downloader,
    file_names,
)

## 2. Input Details

* **GCS_INPUT_PATH**: GCS folder path of labelled JSON files

In [9]:
GCS_INPUT_PATH = "gs://bucket/path_to/input"

## 3. Run Below Code Cell

In [None]:
def get_x_y_list(
    bounding_poly: documentai.BoundingPoly,
) -> Tuple[List[float], List[float]]:
    """It takes BoundingPoly object and separates it x & y normalized coordinates as lists

    Args:
        bounding_poly (documentai.BoundingPoly): A token of Document Page object

    Returns:
        Tuple[List[float], List[float]]: It returns x & y normalized coordinates as separate lists
    """

    x, y = [], []
    normalized_vertices = bounding_poly.normalized_vertices
    for nv in normalized_vertices:
        x.append(nv.x)
        y.append(nv.y)
    return x, y


def get_page_wise_entities(
    doc: documentai.Document,
) -> DefaultDict[int, List[documentai.Document.Entity]]:
    """It gives page-wise entites for all pages in Document object

    Args:
        doc (documentai.Document): Documnet Proto Object

    Returns:
        DefaultDict[int, List[documentai.Document.Entity]]: Dictionary which contains page number as key and list of all entities in corresponding page
    """

    entities_page = defaultdict(list)
    for entity in doc.entities:
        page = 0
        if entity.properties:
            for subentity in entity.properties:
                page = entity.page_anchor.page_refs[0].page
                entities_page[page].append(subentity)
            continue
        page = entity.page_anchor.page_refs[0].page
        entities_page[page].append(entity)

    return entities_page


# getting blank entities
def get_blank_entities(
    doc: documentai.Document,
) -> Tuple[List[str], List[documentai.Document.Entity]]:
    """It is helpful to identify blank entities in document-proto

    Args:
        doc (documentai.Document): Documnet Proto Object

    Returns:
        Tuple[List[str], List[documentai.Document.Entity]]: It returns entity-type and corresponding entity as list of strings and list of entities whose mention_text is empty
    """

    blank_space_ent_name = []
    blank_space_entities = []
    for entity in doc.entities:
        if not entity.mention_text:
            blank_space_entities.append(entity)
            blank_space_ent_name.append(entity.type_)
        if entity.properties:
            for subentity in entity.properties:
                if not subentity.mention_text:
                    blank_space_entities.append(entity)
                    blank_space_ent_name.append([entity.type_, subentity.type_])
    return blank_space_ent_name, blank_space_entities


# getting labelling issue entities(text and page anchor missing)
def get_labeling_issues(
    doc: documentai.Document,
) -> Tuple[List[str], List[documentai.Document.Entity]]:
    """It helps to identify labelling issue entities in which text-anchors or page-anchors missing

    Args:
        doc (documentai.Document): Documnet Proto Object

    Returns:
        Tuple[List[str], List[documentai.Document.Entity]]: It returns entity-type and corresponding entity as list of strings and list of entities whose text_anchor or page_anchor is empty
    """

    labeling_issue_ent_name = []
    labeling_issue_ent = []

    for entity in doc.entities:
        # page anchor issues
        ver = entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices
        if (not ver) or len(ver) != 4:
            labeling_issue_ent_name.append(entity.type_)
            labeling_issue_ent.append(entity)
        # text anchor issues
        index = entity.text_anchor.text_segments
        if not index:
            labeling_issue_ent_name.append(entity.type_)
            labeling_issue_ent.append(entity)

    return labeling_issue_ent_name, labeling_issue_ent


def get_overlapping_entities(
    doc: documentai.Document,
) -> Tuple[List[List[str]], List[List[documentai.Document.Entity]]]:
    """It helps to identify overlapping entities(same data with different entity-type)

    Args:
        doc (documentai.Document): Documnet Proto Object

    Returns:
        Tuple[List[List[str]],List[List[documentai.Document.Entity]]]: It returns entity-type and entity as list-of-lists whose entities have same data with different entity-type
    """

    entities_type_overlap = []
    entities_overlap = []
    entitites_page_wise = get_page_wise_entities(doc)
    for page, entities in entitites_page_wise.items():
        for entity1 in entities:
            for entity2 in entities:
                if entity1 != entity2:
                    x, y = get_x_y_list(entity1.page_anchor.page_refs[0].bounding_poly)
                    box1 = [min(x), min(y), max(x), max(y)]
                    x, y = get_x_y_list(entity2.page_anchor.page_refs[0].bounding_poly)
                    box2 = [min(x), min(y), max(x), max(y)]
                    iou = bb_intersection_over_union(box1, box2)
                    if iou > 0.5:
                        if [
                            entity1.type_,
                            entity2.type_,
                        ] not in entities_type_overlap and [
                            entity2.type_,
                            entity1.type_,
                        ] not in entities_type_overlap:
                            entities_type_overlap.append([entity1.type_, entity2.type_])
                            entities_overlap.append([entity1, entity2])
    return entities_type_overlap, entities_overlap


print("Process Started")
df = pd.DataFrame(
    columns=["File_Name", "Blank_Entities", "Labeling_Issues", "Overlapping_Issues"]
)
file_names_list, file_dict = file_names(GCS_INPUT_PATH)
file_wise_ent_type = {}
file_wise_entities = {}
for filename, filepath in file_dict.items():
    input_bucket_name = GCS_INPUT_PATH.split("/")[2]
    print("\tfilename: ", filename)
    doc = documentai_json_proto_downloader(input_bucket_name, filepath)
    blank_ent_name, blank_entities = get_blank_entities(doc)
    labeling_issue_ent_name, labeling_issue_ent = get_labeling_issues(doc)
    entities_type_overlap, entities_overlap = get_overlapping_entities(doc)
    blank_entities = labeling_issues = overlapping_issues = "No"
    if blank_ent_name:
        blank_entities = "Yes"
    if labeling_issue_ent_name:
        labeling_issues = "Yes"
    if entities_type_overlap:
        overlapping_issues = "Yes"
    new_row = {
        "File_Name": filename,
        "Blank_Entities": blank_entities,
        "Labeling_Issues": labeling_issues,
        "Overlapping_Issues": overlapping_issues,
    }
    df.loc[len(df)] = new_row
    file_wise_ent_type[filename] = {
        "Blank_ent_type": blank_ent_name,
        "Labeling_ent_type": labeling_issue_ent_name,
        "overlapping_ent_type": entities_type_overlap,
    }
    file_wise_entities[filename] = {
        "Blank_ent": blank_entities,
        "Labeling_ent": labeling_issue_ent,
        "overlapping_ent": entities_overlap,
    }

df_ent_type = pd.DataFrame.from_dict(file_wise_ent_type, orient="index")
df_ent_type.reset_index(inplace=True)
df_ent_type.rename(columns={"index": "file_name"}, inplace=True)
print("Writing Data to labeling_issues_with_entity_type.csv ")
df_ent_type.to_csv("./labeling_issues_with_entity_type.csv")
print("Writing data to labeling_issues.csv")
df.to_csv("./labeling_issues.csv")
print("Process Completed for all files")

## 4. Output Details

The output is in 2 CSV files and a dictionary. 

In the CSV files , the columns are as below  
**File_Name**: File name of the labeled json in GCP folder  
**Blank_Entities**: The entities which are labeled blank or which doesn't have anything in the mentionText of the entity  
**Yes**- Denotes there are Blank entities  
**No** - No Blank entities found in the json  

#### Labeling issues:

The entities which have issues in Text anchors or Page anchors are treated as labeling issues because of which you cannot convert into proto format.  
**Yes** - denotes there are few entities which have labeling issues  
**No** - denotes no labeling issues  


#### Overlapping issues:
The entities where it is labeled more than once with same or different entity type like below  
<img src="./images/overlapping_issue.png">  

### 1. labeling_issues.csv
<img src="./images/labeling_issues.csv.png">  

### 2. labeling_issues_with_entity_type.csv
This CSV file is the same as **labeling_issues.csv** but with a list of entity types which have issues.  

For Blank_entities and Labeling issues, the entity type needs to be provided in a list which have issues  

But for overlapping issues , it gives the entity name in the nested list where each list has 2 values which are labeled in the same area.  

<img src="./images/labeling_issues_with_entity_type.csv.png">  

**file_wise_entities** is a dictionary where it has all the entity details which issues and can be deleted if needed from the json.  
<img src="./images/file_wise_entities_dict.png">  

