# Combine Address Lines

* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description
This is a post processing script which combines the split address into one address. In the parsed sample json file it is observed that the single address_line item has been split into four multiple address_lines. This can be corrected by combining the address lines into a single address and removing other split address elements in the json. The json Entity keys Normalized Vertices and Text Segments indexes are to be updated properly with correct values when the address line is combined.

## Prerequisites

1. Vertex AI Notebook
2. Parsed json files in GCS Folder.
3. Output folder to upload the updated json files.

## Step by Step procedure 

### 1. Input details


In [3]:
# INPUT : storage bucket name
INPUT_PATH = "gs://xxxxx/xxxxxxxxxx/xxxxx"
# OUTPUT : storage bucket's path
OUTPUT_PATH = "gs://xxxxxx/xxxxxxxxxx/xxxx"
entity_names = [
    "ship_to_address_line",
    "billing_address_line",
]  # List of entities that needs to be combined individually


<ul>
    <li><b>input_path :</b> GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.</li>
    <li><b>output_path:</b> GCS URI of the folder, where the dataset is exported from the processor.</li>
    <li><b>Entity_names:</b>list of entity_names that needs to be combined. </li>
</ul>
<div style="background-color:#f5f569" ><i><b>Note:</b> List of pairs of entities that need to be splitted. Also, the entity name should be mentioned like this (small_entity,large_entity)</i><div>

### 2. Output

The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path. <br><hr>
<b>Comparison Between Input and Output File</b><br><br>
<i><h4>Post processing results<h4><i><br>
Upon running the post processing script against input data. The resultant output json data is obtained. The following table highlights the differences for following elements in the json document.<br>
<ul style="margin:5px">
    <li>Address</li>
    <li>Normalized Vertices</li>
    <li>Text Segment indexes</li>
<ul>


<img src="./Images/combine_address_lines_output_1.png" width=800 height=400 alt="Combine address line output image">
<img src="./Images/combine_address_lines_output_2.png" width=800 height=400 alt="Combine address line output image">
<img src="./Images/combine_address_lines_output_3.png" width=800 height=400 alt="Combine address line output image">
    
<span>When the output json document is imported into the processor, it is observed that the address is now a single entity and the bounding box as shown:</span><br><br>
<img src="./Images/combine_address_lines_output_5.png" width=800 height=400 alt="Combine address line output image">

### 3. Run the code

In [None]:
!pip install google.cloud
!pip install tqdm

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from io import BytesIO
import json
from google.cloud import storage
from google.cloud import documentai_v1beta3 as documentai
import copy
from tqdm.notebook import tqdm
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from utilities import (
    file_names,
    documentai_json_proto_downloader,
    store_document_as_json,
    bbox_maker,
)

input_bucket_name = INPUT_PATH.split("/")[2]
input_bucket_path_prefix = "/".join(INPUT_PATH.split("/")[3:])
output_bucket_name = OUTPUT_PATH.split("/")[2]
output_prefix_path = "/".join(OUTPUT_PATH.split("/")[3:])


dist_limit = 0.05  # threshold to check whether to combine address line or not. If two address lines are closer than 0.1 then they will be combined.


def combine_two_entities(
    entity1: documentai.Document.Entity,
    entity2: documentai.Document.Entity,
    js: documentai.Document,
) -> documentai.Document.Entity:
    """
    To combine two different entities into one with updated content, mention text, boundary box,text anchor and text segment.

    Parameters
    ----------
    entity1 : documentai.Document.Entity
        The first entity object from the input document which need to be merged in one.
    entity2 : documentai.Document.Entity
        The second entity object from the input document which need to be merged in one.
    js : documentai.Document.Entity
        The main document object where the merged entity need to be append.

    Returns
    -------
    documentai.Document.Entity
        Returns the new merged entity having updated information.
    """

    new_entity = documentai.Document.Entity()
    new_entity.type = entity1.type
    text_anchor = documentai.Document.TextAnchor()
    textAnchorList = []

    entity1.text_anchor.text_segments = sorted(
        entity1.text_anchor.text_segments, key=lambda x: int(x.start_index)
    )
    entity2.text_anchor.text_segments = sorted(
        entity2.text_anchor.text_segments, key=lambda x: int(x.start_index)
    )
    for j in entity1.text_anchor.text_segments:
        textAnchorList.append(j)

    for j in entity2.text_anchor.text_segments:
        textAnchorList.append(j)
    textAnchorList = sorted(textAnchorList, key=lambda x: int(x.start_index))
    mentionText = ""
    for j in textAnchorList:
        if (js.text[int(j.end_index) : int(j.end_index) + 1] == "\n") or (
            js.text[int(j.end_index) : int(j.end_index) + 1] == " "
        ):
            mentionText += js.text[int(j.start_index) : int(j.end_index) + 1]
        else:
            mentionText += js.text[int(j.start_index) : int(j.end_index)]
    new_entity.mention_text = mentionText
    text_anchor.content = mentionText
    # Add all the text anchor present in entity1 & entity2
    temp_text_anchor_list = []
    for i in range(len(entity1.text_anchor.text_segments)):
        temp_text_anchor_list.append(entity1.text_anchor.text_segments[i])
    for i in range(len(entity2.text_anchor.text_segments)):
        temp_text_anchor_list.append(entity2.text_anchor.text_segments[i])
    text_anchor.text_segments = temp_text_anchor_list
    new_entity.text_anchor = text_anchor

    entity1_coordinates_list = []
    for i in entity1.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
        entity1_coordinates_list.append({"x": i.x, "y": i.y})
    entity2_coordinates_list = []
    for i in entity2.page_anchor.page_refs[0].bounding_poly.normalized_vertices:
        entity2_coordinates_list.append({"x": i.x, "y": i.y})
    entity1_coordinates_list = bbox_maker(entity1_coordinates_list)
    entity2_coordinates_list = bbox_maker(entity2_coordinates_list)
    min_x = min(entity1_coordinates_list[0], entity2_coordinates_list[0])
    min_y = min(entity1_coordinates_list[1], entity2_coordinates_list[1])
    max_x = max(entity1_coordinates_list[2], entity2_coordinates_list[2])
    max_y = max(entity1_coordinates_list[3], entity2_coordinates_list[3])

    A = {"x": min_x, "y": min_y}
    B = {"x": max_x, "y": min_y}
    C = {"x": max_x, "y": max_y}
    D = {"x": min_x, "y": max_y}
    new_entity.page_anchor = entity1.page_anchor
    new_entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices = [A, B, C, D]
    return new_entity


def merge_address_lines(
    list_of_address_entities: List[documentai.Document.Entity],
    list_of_xy_coordinates: List[float],
    list_of_page_numbers: List[int],
    js: documentai.Document.Entity,
) -> Tuple[documentai.Document.Entity, float]:
    """
    This function is the collection of multiple functions which merges the bounding boxes, calculates the distance between entities.

    Parameters
    ----------
    list_of_address_entities : List[documentai.Document.Entity]
        The array having the address entities which matches with the entity name provided by user.
    list_of_xy_coordinates: List[float]
        The array of x/y coordinates of the entities in list_of_address_entities.
    list_of_page_numbers: List[int]
        The array of page number of the entities in list_of_address_entities.
    js: documentai.Document.Entity
        The entities from the original input document.

    Returns
    -------
    Tuple([documentai.Document.Entity,float])
        Returns the tuple with array of new merged entities and thier bounding boxes.
    """
    # Copy of the text and object arrays
    entities_copied = copy.deepcopy(list_of_address_entities)
    entities_boxes_copied = copy.deepcopy(list_of_xy_coordinates)
    entities_page_numbers_copied = copy.deepcopy(list_of_page_numbers)

    def merge_boxes(box1: List[float], box2: List[float]) -> List[float]:
        """
        Generate two text boxes a larger one that covers them
        Parameters
        ----------
        box1: List[float]
            Bounding box of the first entity
        box2: List[float]
            Bounding box of the second entity
        Returns
        -------
        List[float] :
            Bounding boxes of the both the merged entities.
        """
        return [
            min(box1[0], box2[0]),
            min(box1[1], box2[1]),
            max(box1[2], box2[2]),
            max(box1[3], box2[3]),
        ]

    def calc_sim(text: Tuple[float], obj: Tuple[float]) -> float:
        """
        Computer a Matrix similarity of distances of the first entity and the other entity .

        Parameters
        ----------
        text: Tuple(float)
            Bounding box of the first entity.
        obj: Tuple(float)
            Bounding box of the other entity.
        Returns
        -------
        float :
            Returns the distance similarity between the text and the object.
        """
        # text: ymin, xmin, ymax, xmax
        # obj: ymin, xmin, ymax, xmax
        text_ymin, text_xmin, text_ymax, text_xmax = text
        obj_ymin, obj_xmin, obj_ymax, obj_xmax = obj

        x_dist = min(
            abs(text_xmin - obj_xmin),
            abs(text_xmin - obj_xmax),
            abs(text_xmax - obj_xmin),
            abs(text_xmax - obj_xmax),
        )
        y_dist = min(
            abs(text_ymin - obj_ymin),
            abs(text_ymin - obj_ymax),
            abs(text_ymax - obj_ymin),
            abs(text_ymax - obj_ymax),
        )

        dist = x_dist + y_dist
        return dist

    def merge_algo(
        entities_copied: List[documentai.Document.Entity],
        entities_boxes_copied: List[float],
    ) -> Tuple[List[documentai.Document.Entity], List[float]]:
        """
        Principal algorithm for merge text and call other helper functions..

        Parameters
        ----------
        entities_copied : List[documentai.Document.Entity]
            The array having the merged entities .
        entities_boxes_copied: List[[float]
           The array having the coordinates of the entities in entities_copied array.
        Returns
        -------
        Tuple[List[documentai.Document.Entity],List[float]]
            Returns the tuple of the boolean value,merged entites and coordinates of the entities.
        """
        for i, (entity1, entity_box_1, page_ent_1) in enumerate(
            zip(entities_copied, entities_boxes_copied, entities_page_numbers_copied)
        ):
            for j, (entity2, entity_box_2, page_ent_2) in enumerate(
                zip(
                    entities_copied, entities_boxes_copied, entities_page_numbers_copied
                )
            ):
                if j <= i:
                    continue
                # Create a new box if a distances is less than distance limit defined
                if (
                    calc_sim(entity_box_1, entity_box_2) < dist_limit
                    and page_ent_1 == page_ent_2
                ):
                    # print(calc_sim(entity_box_1, entity_box_2))
                    # Create a new box
                    new_box = merge_boxes(entity_box_1, entity_box_2)
                    # Create a new entity
                    new_entity = combine_two_entities(entity1, entity2, js)
                    entities_copied[i] = new_entity
                    del entities_copied[j]
                    del entities_page_numbers_copied[j]
                    entities_boxes_copied[i] = new_box
                    # delete previous boxes
                    del entities_boxes_copied[j]
                    # return a new enity and combined bounding box
                    return True, entities_copied, entities_boxes_copied

        return False, entities_copied, entities_boxes_copied

    need_to_merge = True

    # Merge full text
    while need_to_merge:
        need_to_merge, entities_copied, entities_boxes_copied = merge_algo(
            entities_copied, entities_boxes_copied
        )

    for entity in entities_copied:
        entity.type = entity.type[:-5]
    return entities_copied, entities_boxes_copied


file_name_list = [
    i for i in list(file_names(INPUT_PATH)[1].values()) if i.endswith(".json")
]

for file_index in tqdm(range(0, len(file_name_list))):
    file_name = file_name_list[file_index]
    print("\nProcessing >>> ", file_name)
    try:
        document = documentai_json_proto_downloader(input_bucket_name, file_name)
        for entity_name in entity_names:
            document.entities = sorted(
                document.entities,
                key=lambda x: int(x.text_anchor.text_segments[0].start_index),
            )
            list_of_xy_coordinates = []
            list_of_address_entities = []
            list_of_page_numbers = []
            for entity in document.entities:
                if entity.type == entity_name:
                    print(" Processing >>>>>>>>>>>>>>>> ", entity.type)
                    list_of_address_entities.append(entity)
                    entity_coordinates_list = []
                    for i in entity.page_anchor.page_refs[
                        0
                    ].bounding_poly.normalized_vertices:
                        entity_coordinates_list.append({"x": i.x, "y": i.y})
                    entity_coordinates_list = bbox_maker(entity_coordinates_list)
                    x_min = entity_coordinates_list[0]
                    y_min = entity_coordinates_list[1]
                    x_max = entity_coordinates_list[2]
                    y_max = entity_coordinates_list[3]
                    list_of_xy_coordinates.append([y_min, x_min, y_max, x_max])
                    page = 0
                    if entity.page_anchor.page_refs[0].page:
                        page = int(entity.page_anchor.page_refs[0].page)
                    list_of_page_numbers.append(page)
            new_entities, new_entities_xy = merge_address_lines(
                list_of_address_entities,
                list_of_xy_coordinates,
                list_of_page_numbers,
                document,
            )
            for entity in document.entities:
                if entity.type != entity_name:
                    new_entities.append(entity)

            document.entities = new_entities

    except Exception as e:
        print(
            f"[x] {input_bucket_name}/{file_name} || Error : {str(e)}",
            "\t !!! Please review manually",
        )
        continue

    output_file_name = f"{output_prefix_path}/{file_name.split('/')[-1]}"
    store_document_as_json(
        documentai.Document.to_json(document), output_bucket_name, output_file_name
    )
    print(f"[âœ“] {output_bucket_name}/{output_file_name}")

print("\nCompleted")