incubator-tools/combine_address_line/combine_address_line.ipynb (530 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "72fd064f-24f5-4d61-b0ad-2b2f3fe9427d",
"metadata": {},
"source": [
"# Combine Address Lines"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f5756f1a-631f-4c8a-bba0-98c6821d31a9",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9b1d12ef-55dd-4fbd-8389-db14ed038eb1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "94527514-1ae2-470b-96e2-0f48e4aa5e81",
"metadata": {},
"source": [
"## Purpose and Description\n",
"This is a post processing script which combines the split address into one address. In the parsed sample json file it is observed that the single address_line item has been split into four multiple address_lines. This can be corrected by combining the address lines into a single address and removing other split address elements in the json. The json Entity keys Normalized Vertices and Text Segments indexes are to be updated properly with correct values when the address line is combined."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a8783f52-627b-4efa-b5d9-664ae2ca2564",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"1. Vertex AI Notebook\n",
"2. Parsed json files in GCS Folder.\n",
"3. Output folder to upload the updated json files."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "55cc5540-deb1-4449-8278-716488c54e5c",
"metadata": {},
"source": [
"## Step by Step procedure "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "63a5f9d9",
"metadata": {},
"source": [
"### 1. Input details\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "89120af5-c5f4-4897-a640-4ad9c5ce4739",
"metadata": {},
"outputs": [],
"source": [
"# INPUT : storage bucket name\n",
"INPUT_PATH = \"gs://xxxxx/xxxxxxxxxx/xxxxx\"\n",
"# OUTPUT : storage bucket's path\n",
"OUTPUT_PATH = \"gs://xxxxxx/xxxxxxxxxx/xxxx\"\n",
"entity_names = [\n",
" \"ship_to_address_line\",\n",
" \"billing_address_line\",\n",
"] # List of entities that needs to be combined individually"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "621e46f4",
"metadata": {},
"source": [
"\n",
"<ul>\n",
" <li><b>input_path :</b> GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.</li>\n",
" <li><b>output_path:</b> GCS URI of the folder, where the dataset is exported from the processor.</li>\n",
" <li><b>Entity_names:</b>list of entity_names that needs to be combined. </li>\n",
"</ul>\n",
"<div style=\"background-color:#f5f569\" ><i><b>Note:</b> List of pairs of entities that need to be splitted. Also, the entity name should be mentioned like this (small_entity,large_entity)</i><div>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "84356940-95a8-4489-bfc3-b85611f9558a",
"metadata": {},
"source": [
"### 2. Output"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "55b50561-6cd5-433b-88db-b1bcfbbaaacb",
"metadata": {},
"source": [
"The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path. <br><hr>\n",
"<b>Comparison Between Input and Output File</b><br><br>\n",
"<i><h4>Post processing results<h4><i><br>\n",
"Upon running the post processing script against input data. The resultant output json data is obtained. The following table highlights the differences for following elements in the json document.<br>\n",
"<ul style=\"margin:5px\">\n",
" <li>Address</li>\n",
" <li>Normalized Vertices</li>\n",
" <li>Text Segment indexes</li>\n",
"<ul>\n",
"\n",
"\n",
"<img src=\"./Images/combine_address_lines_output_1.png\" width=800 height=400 alt=\"Combine address line output image\">\n",
"<img src=\"./Images/combine_address_lines_output_2.png\" width=800 height=400 alt=\"Combine address line output image\">\n",
"<img src=\"./Images/combine_address_lines_output_3.png\" width=800 height=400 alt=\"Combine address line output image\">\n",
" \n",
"<span>When the output json document is imported into the processor, it is observed that the address is now a single entity and the bounding box as shown:</span><br><br>\n",
"<img src=\"./Images/combine_address_lines_output_5.png\" width=800 height=400 alt=\"Combine address line output image\">"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a6fa4328-3d99-4de4-a5eb-0f2033d78b79",
"metadata": {},
"source": [
"### 3. Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8efe6885-c7cd-4e6e-a5df-c3b5ad18df74",
"metadata": {},
"outputs": [],
"source": [
"!pip install google.cloud\n",
"!pip install tqdm"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b303ff29",
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e4cd9f2-ed3f-4df2-bd01-a2479a93bccc",
"metadata": {},
"outputs": [],
"source": [
"from io import BytesIO\n",
"import json\n",
"from google.cloud import storage\n",
"from google.cloud import documentai_v1beta3 as documentai\n",
"import copy\n",
"from tqdm.notebook import tqdm\n",
"from typing import Any, Dict, List, Optional, Sequence, Tuple, Union\n",
"from utilities import (\n",
" file_names,\n",
" documentai_json_proto_downloader,\n",
" store_document_as_json,\n",
" bbox_maker,\n",
")\n",
"\n",
"input_bucket_name = INPUT_PATH.split(\"/\")[2]\n",
"input_bucket_path_prefix = \"/\".join(INPUT_PATH.split(\"/\")[3:])\n",
"output_bucket_name = OUTPUT_PATH.split(\"/\")[2]\n",
"output_prefix_path = \"/\".join(OUTPUT_PATH.split(\"/\")[3:])\n",
"\n",
"\n",
"dist_limit = 0.05 # threshold to check whether to combine address line or not. If two address lines are closer than 0.1 then they will be combined.\n",
"\n",
"\n",
"def combine_two_entities(\n",
" entity1: documentai.Document.Entity,\n",
" entity2: documentai.Document.Entity,\n",
" js: documentai.Document,\n",
") -> documentai.Document.Entity:\n",
" \"\"\"\n",
" To combine two different entities into one with updated content, mention text, boundary box,text anchor and text segment.\n",
"\n",
" Parameters\n",
" ----------\n",
" entity1 : documentai.Document.Entity\n",
" The first entity object from the input document which need to be merged in one.\n",
" entity2 : documentai.Document.Entity\n",
" The second entity object from the input document which need to be merged in one.\n",
" js : documentai.Document.Entity\n",
" The main document object where the merged entity need to be append.\n",
"\n",
" Returns\n",
" -------\n",
" documentai.Document.Entity\n",
" Returns the new merged entity having updated information.\n",
" \"\"\"\n",
"\n",
" new_entity = documentai.Document.Entity()\n",
" new_entity.type = entity1.type\n",
" text_anchor = documentai.Document.TextAnchor()\n",
" textAnchorList = []\n",
"\n",
" entity1.text_anchor.text_segments = sorted(\n",
" entity1.text_anchor.text_segments, key=lambda x: int(x.start_index)\n",
" )\n",
" entity2.text_anchor.text_segments = sorted(\n",
" entity2.text_anchor.text_segments, key=lambda x: int(x.start_index)\n",
" )\n",
" for j in entity1.text_anchor.text_segments:\n",
" textAnchorList.append(j)\n",
"\n",
" for j in entity2.text_anchor.text_segments:\n",
" textAnchorList.append(j)\n",
" textAnchorList = sorted(textAnchorList, key=lambda x: int(x.start_index))\n",
" mentionText = \"\"\n",
" for j in textAnchorList:\n",
" if (js.text[int(j.end_index) : int(j.end_index) + 1] == \"\\n\") or (\n",
" js.text[int(j.end_index) : int(j.end_index) + 1] == \" \"\n",
" ):\n",
" mentionText += js.text[int(j.start_index) : int(j.end_index) + 1]\n",
" else:\n",
" mentionText += js.text[int(j.start_index) : int(j.end_index)]\n",
" new_entity.mention_text = mentionText\n",
" text_anchor.content = mentionText\n",
" # Add all the text anchor present in entity1 & entity2\n",
" temp_text_anchor_list = []\n",
" for i in range(len(entity1.text_anchor.text_segments)):\n",
" temp_text_anchor_list.append(entity1.text_anchor.text_segments[i])\n",
" for i in range(len(entity2.text_anchor.text_segments)):\n",
" temp_text_anchor_list.append(entity2.text_anchor.text_segments[i])\n",
" text_anchor.text_segments = temp_text_anchor_list\n",
" new_entity.text_anchor = text_anchor\n",
"\n",
" entity1_coordinates_list = []\n",
" for i in entity1.page_anchor.page_refs[0].bounding_poly.normalized_vertices:\n",
" entity1_coordinates_list.append({\"x\": i.x, \"y\": i.y})\n",
" entity2_coordinates_list = []\n",
" for i in entity2.page_anchor.page_refs[0].bounding_poly.normalized_vertices:\n",
" entity2_coordinates_list.append({\"x\": i.x, \"y\": i.y})\n",
" entity1_coordinates_list = bbox_maker(entity1_coordinates_list)\n",
" entity2_coordinates_list = bbox_maker(entity2_coordinates_list)\n",
" min_x = min(entity1_coordinates_list[0], entity2_coordinates_list[0])\n",
" min_y = min(entity1_coordinates_list[1], entity2_coordinates_list[1])\n",
" max_x = max(entity1_coordinates_list[2], entity2_coordinates_list[2])\n",
" max_y = max(entity1_coordinates_list[3], entity2_coordinates_list[3])\n",
"\n",
" A = {\"x\": min_x, \"y\": min_y}\n",
" B = {\"x\": max_x, \"y\": min_y}\n",
" C = {\"x\": max_x, \"y\": max_y}\n",
" D = {\"x\": min_x, \"y\": max_y}\n",
" new_entity.page_anchor = entity1.page_anchor\n",
" new_entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices = [A, B, C, D]\n",
" return new_entity\n",
"\n",
"\n",
"def merge_address_lines(\n",
" list_of_address_entities: List[documentai.Document.Entity],\n",
" list_of_xy_coordinates: List[float],\n",
" list_of_page_numbers: List[int],\n",
" js: documentai.Document.Entity,\n",
") -> Tuple[documentai.Document.Entity, float]:\n",
" \"\"\"\n",
" This function is the collection of multiple functions which merges the bounding boxes, calculates the distance between entities.\n",
"\n",
" Parameters\n",
" ----------\n",
" list_of_address_entities : List[documentai.Document.Entity]\n",
" The array having the address entities which matches with the entity name provided by user.\n",
" list_of_xy_coordinates: List[float]\n",
" The array of x/y coordinates of the entities in list_of_address_entities.\n",
" list_of_page_numbers: List[int]\n",
" The array of page number of the entities in list_of_address_entities.\n",
" js: documentai.Document.Entity\n",
" The entities from the original input document.\n",
"\n",
" Returns\n",
" -------\n",
" Tuple([documentai.Document.Entity,float])\n",
" Returns the tuple with array of new merged entities and thier bounding boxes.\n",
" \"\"\"\n",
" # Copy of the text and object arrays\n",
" entities_copied = copy.deepcopy(list_of_address_entities)\n",
" entities_boxes_copied = copy.deepcopy(list_of_xy_coordinates)\n",
" entities_page_numbers_copied = copy.deepcopy(list_of_page_numbers)\n",
"\n",
" def merge_boxes(box1: List[float], box2: List[float]) -> List[float]:\n",
" \"\"\"\n",
" Generate two text boxes a larger one that covers them\n",
" Parameters\n",
" ----------\n",
" box1: List[float]\n",
" Bounding box of the first entity\n",
" box2: List[float]\n",
" Bounding box of the second entity\n",
" Returns\n",
" -------\n",
" List[float] :\n",
" Bounding boxes of the both the merged entities.\n",
" \"\"\"\n",
" return [\n",
" min(box1[0], box2[0]),\n",
" min(box1[1], box2[1]),\n",
" max(box1[2], box2[2]),\n",
" max(box1[3], box2[3]),\n",
" ]\n",
"\n",
" def calc_sim(text: Tuple[float], obj: Tuple[float]) -> float:\n",
" \"\"\"\n",
" Computer a Matrix similarity of distances of the first entity and the other entity .\n",
"\n",
" Parameters\n",
" ----------\n",
" text: Tuple(float)\n",
" Bounding box of the first entity.\n",
" obj: Tuple(float)\n",
" Bounding box of the other entity.\n",
" Returns\n",
" -------\n",
" float :\n",
" Returns the distance similarity between the text and the object.\n",
" \"\"\"\n",
" # text: ymin, xmin, ymax, xmax\n",
" # obj: ymin, xmin, ymax, xmax\n",
" text_ymin, text_xmin, text_ymax, text_xmax = text\n",
" obj_ymin, obj_xmin, obj_ymax, obj_xmax = obj\n",
"\n",
" x_dist = min(\n",
" abs(text_xmin - obj_xmin),\n",
" abs(text_xmin - obj_xmax),\n",
" abs(text_xmax - obj_xmin),\n",
" abs(text_xmax - obj_xmax),\n",
" )\n",
" y_dist = min(\n",
" abs(text_ymin - obj_ymin),\n",
" abs(text_ymin - obj_ymax),\n",
" abs(text_ymax - obj_ymin),\n",
" abs(text_ymax - obj_ymax),\n",
" )\n",
"\n",
" dist = x_dist + y_dist\n",
" return dist\n",
"\n",
" def merge_algo(\n",
" entities_copied: List[documentai.Document.Entity],\n",
" entities_boxes_copied: List[float],\n",
" ) -> Tuple[List[documentai.Document.Entity], List[float]]:\n",
" \"\"\"\n",
" Principal algorithm for merge text and call other helper functions..\n",
"\n",
" Parameters\n",
" ----------\n",
" entities_copied : List[documentai.Document.Entity]\n",
" The array having the merged entities .\n",
" entities_boxes_copied: List[[float]\n",
" The array having the coordinates of the entities in entities_copied array.\n",
" Returns\n",
" -------\n",
" Tuple[List[documentai.Document.Entity],List[float]]\n",
" Returns the tuple of the boolean value,merged entites and coordinates of the entities.\n",
" \"\"\"\n",
" for i, (entity1, entity_box_1, page_ent_1) in enumerate(\n",
" zip(entities_copied, entities_boxes_copied, entities_page_numbers_copied)\n",
" ):\n",
" for j, (entity2, entity_box_2, page_ent_2) in enumerate(\n",
" zip(\n",
" entities_copied, entities_boxes_copied, entities_page_numbers_copied\n",
" )\n",
" ):\n",
" if j <= i:\n",
" continue\n",
" # Create a new box if a distances is less than distance limit defined\n",
" if (\n",
" calc_sim(entity_box_1, entity_box_2) < dist_limit\n",
" and page_ent_1 == page_ent_2\n",
" ):\n",
" # print(calc_sim(entity_box_1, entity_box_2))\n",
" # Create a new box\n",
" new_box = merge_boxes(entity_box_1, entity_box_2)\n",
" # Create a new entity\n",
" new_entity = combine_two_entities(entity1, entity2, js)\n",
" entities_copied[i] = new_entity\n",
" del entities_copied[j]\n",
" del entities_page_numbers_copied[j]\n",
" entities_boxes_copied[i] = new_box\n",
" # delete previous boxes\n",
" del entities_boxes_copied[j]\n",
" # return a new enity and combined bounding box\n",
" return True, entities_copied, entities_boxes_copied\n",
"\n",
" return False, entities_copied, entities_boxes_copied\n",
"\n",
" need_to_merge = True\n",
"\n",
" # Merge full text\n",
" while need_to_merge:\n",
" need_to_merge, entities_copied, entities_boxes_copied = merge_algo(\n",
" entities_copied, entities_boxes_copied\n",
" )\n",
"\n",
" for entity in entities_copied:\n",
" entity.type = entity.type[:-5]\n",
" return entities_copied, entities_boxes_copied\n",
"\n",
"\n",
"file_name_list = [\n",
" i for i in list(file_names(INPUT_PATH)[1].values()) if i.endswith(\".json\")\n",
"]\n",
"\n",
"for file_index in tqdm(range(0, len(file_name_list))):\n",
" file_name = file_name_list[file_index]\n",
" print(\"\\nProcessing >>> \", file_name)\n",
" try:\n",
" document = documentai_json_proto_downloader(input_bucket_name, file_name)\n",
" for entity_name in entity_names:\n",
" document.entities = sorted(\n",
" document.entities,\n",
" key=lambda x: int(x.text_anchor.text_segments[0].start_index),\n",
" )\n",
" list_of_xy_coordinates = []\n",
" list_of_address_entities = []\n",
" list_of_page_numbers = []\n",
" for entity in document.entities:\n",
" if entity.type == entity_name:\n",
" print(\" Processing >>>>>>>>>>>>>>>> \", entity.type)\n",
" list_of_address_entities.append(entity)\n",
" entity_coordinates_list = []\n",
" for i in entity.page_anchor.page_refs[\n",
" 0\n",
" ].bounding_poly.normalized_vertices:\n",
" entity_coordinates_list.append({\"x\": i.x, \"y\": i.y})\n",
" entity_coordinates_list = bbox_maker(entity_coordinates_list)\n",
" x_min = entity_coordinates_list[0]\n",
" y_min = entity_coordinates_list[1]\n",
" x_max = entity_coordinates_list[2]\n",
" y_max = entity_coordinates_list[3]\n",
" list_of_xy_coordinates.append([y_min, x_min, y_max, x_max])\n",
" page = 0\n",
" if entity.page_anchor.page_refs[0].page:\n",
" page = int(entity.page_anchor.page_refs[0].page)\n",
" list_of_page_numbers.append(page)\n",
" new_entities, new_entities_xy = merge_address_lines(\n",
" list_of_address_entities,\n",
" list_of_xy_coordinates,\n",
" list_of_page_numbers,\n",
" document,\n",
" )\n",
" for entity in document.entities:\n",
" if entity.type != entity_name:\n",
" new_entities.append(entity)\n",
"\n",
" document.entities = new_entities\n",
"\n",
" except Exception as e:\n",
" print(\n",
" f\"[x] {input_bucket_name}/{file_name} || Error : {str(e)}\",\n",
" \"\\t !!! Please review manually\",\n",
" )\n",
" continue\n",
"\n",
" output_file_name = f\"{output_prefix_path}/{file_name.split('/')[-1]}\"\n",
" store_document_as_json(\n",
" documentai.Document.to_json(document), output_bucket_name, output_file_name\n",
" )\n",
" print(f\"[✓] {output_bucket_name}/{output_file_name}\")\n",
"\n",
"print(\"\\nCompleted\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac547d60-69ea-4768-99ee-b43dbc001a3d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-root-py",
"name": "workbench-notebooks.m113",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/workbench-notebooks:m113"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}