incubator-tools/date_entities_annotation_tool/date_entities_annotation.ipynb (755 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "62584f2b-3301-459c-858c-d00ffdc876aa",
"metadata": {},
"source": [
"# Date Entities Annotation Tool\n"
]
},
{
"cell_type": "markdown",
"id": "bc6a80ac-d18f-44bf-9e7d-bd1dc872ae1e",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "b46902c4-ab4a-4896-8232-dfcea92ff9cf",
"metadata": {},
"source": [
"## Disclaimer"
]
},
{
"cell_type": "markdown",
"id": "20ef4270-c85b-4984-b42d-cd5e0e6652b4",
"metadata": {},
"source": [
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "87b95543-bdbe-4d7f-8a21-3e9e564eccd0",
"metadata": {},
"source": [
"## Objective"
]
},
{
"cell_type": "markdown",
"id": "00d8d3d6-f47a-47ee-a812-7d62e78735ca",
"metadata": {},
"source": [
"This tool helps you annotate or label date entities in a table as line items. The names of date entities are tagged based on the header of the line items table. You can modify the header text to suit your needs and annotate the date values as line items. The notebook script expects the input file containing the invoice parser json output. The dates gets annotated if they are present in the OCR , and the headers are also recognized by the OCR. The result is an output json file with the labeled date entities as line items and exported to a storage bucket path. These result json files can be imported into a processor to further check the annotations and can be used for training."
]
},
{
"cell_type": "markdown",
"id": "9e2d7dca-e8ab-4ce1-a853-7180e11677af",
"metadata": {},
"source": [
"## Prerequisites"
]
},
{
"cell_type": "markdown",
"id": "1e60fd9b-3144-4d3d-8c7b-87057d7a7a55",
"metadata": {},
"source": [
"* Vertex AI Notebook\n",
"* Input Json Files\n",
"* GCS bucket for processing of the input documents and writing the output.\n"
]
},
{
"cell_type": "markdown",
"id": "5d699412-71b2-4e9e-9768-b075da12c1d8",
"metadata": {},
"source": [
"## Step by Step procedure"
]
},
{
"cell_type": "markdown",
"id": "63ba572c-9d3a-4568-aac4-6ebe05b1e0b2",
"metadata": {},
"source": [
"### 1.Importing required modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a270e1b-a272-49d8-8329-46a0a3823b6e",
"metadata": {},
"outputs": [],
"source": [
"# Download incubator-tools utilities module to present-working-directory\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e611aaa-cc70-4ae0-a8d2-8d5cc8efd720",
"metadata": {},
"outputs": [],
"source": [
"!pip install google-cloud-storage google-cloud-documentai tqdm -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e11b6b93-03a1-4431-b335-cd643b35182a",
"metadata": {},
"outputs": [],
"source": [
"import utilities\n",
"from google.cloud import storage\n",
"from google.cloud import documentai_v1beta3 as documentai\n",
"from pprint import pprint\n",
"import traceback\n",
"import json\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "markdown",
"id": "ea27819e-5d79-4fdf-b9ad-dba242779788",
"metadata": {},
"source": [
"### 2.Input and Output Path\n",
"* **project_id**: This contains the project ID of the project.\n",
"* **gcs_input_path**: This contains the storage bucket path of the input files.\n",
"* **gcs_output_path**: This contains the storage bucket path of the output files.\n",
"* **headers_entities**: This contains the table's header as the key and the associated entity name as its value "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "113d0d54-8331-4776-91cd-8f431701ac7e",
"metadata": {},
"outputs": [],
"source": [
"# input details\n",
"project_id = \"xx-xx-xx\"\n",
"gcs_input_path = \"gs://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/\"\n",
"gcs_output_path = \"gs://xxxxxxxxxxxxxxxxxxxxxxxxxxxxx/\"\n",
"\n",
"headers_entities = {\n",
" \"From\": \"line_item/date_from\",\n",
" \"To\": \"line_item/date_to\",\n",
" \"DESCRIPTION\": \"line_item/purchase_date\",\n",
" \"Date\": \"line_item/purchase_date\",\n",
"}\n",
"unique_entities = []\n",
"for value in headers_entities.values():\n",
" if value not in unique_entities:\n",
" unique_entities.append(value)"
]
},
{
"cell_type": "markdown",
"id": "da2a4d5f-548e-48ae-a9df-c7250672e1c8",
"metadata": {},
"source": [
"For instance, in the header_entities, you should supply the table's header as the key and the associated entity name as its value, as illustrated below. This ensures that all the data under that specific header will be labeled with the relevant entity name.\n",
"\n",
"headers_entities={'From':'line_item/date_from'}"
]
},
{
"cell_type": "markdown",
"id": "3decc8cb-d7b9-45bc-a78b-7933aedb64fe",
"metadata": {},
"source": [
"<img src=\"./images/image_1.png\" width=800 height=400></img>"
]
},
{
"cell_type": "markdown",
"id": "23195600-a29c-4d30-8b38-e0ad71b96a73",
"metadata": {},
"source": [
"### 3.Run the Code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b98f8950-99f7-4c40-b146-8923bf1e836c",
"metadata": {},
"outputs": [],
"source": [
"def get_token(json_dict: object, page: str, text_anchors_check: list):\n",
" \"\"\"THIS FUNCTION USED LOADED JSON, PAGE NUMBER AND TEXT ANCHORS AS INPUT AND GIVES THE X AND Y COORDINATES\n",
"\n",
" Args:\n",
" json_dict (object) : The document object containing entities.\n",
" page (str) : The page number as a string where these entities are found.\n",
" text_anchors_check (list) : The list contains text anchors information which need to be checked.\n",
"\n",
" Returns:\n",
" A tuple with three elements : A dictionary with keys 'min_x', 'min_y', 'max_x', and 'max_y' ; list containing textanchors ; confidence\n",
"\n",
" \"\"\"\n",
"\n",
" min_x = \"\"\n",
" for token in json_dict.pages[page].tokens:\n",
" if token.layout.text_anchor.text_segments == text_anchors_check:\n",
" normalized_vertices = token.layout.bounding_poly\n",
" min_x = min(vertex.x for vertex in normalized_vertices.normalized_vertices)\n",
" min_y = min(vertex.y for vertex in normalized_vertices.normalized_vertices)\n",
" max_x = max(vertex.x for vertex in normalized_vertices.normalized_vertices)\n",
" max_y = max(vertex.y for vertex in normalized_vertices.normalized_vertices)\n",
" text_anc_token = token.layout.text_anchor.text_segments\n",
" confidence = token.layout.confidence\n",
" if min_x == \"\":\n",
" for token in json_dict.pages[page].tokens:\n",
" if not token.layout.text_anchor.text_segments[0].start_index:\n",
" token.layout.text_anchor.text_segments[0].start_index = \"0\"\n",
" if (\n",
" abs(\n",
" int(token.layout.text_anchor.text_segments[0].start_index)\n",
" - int(text_anchors_check[0][\"startIndex\"])\n",
" )\n",
" <= 2\n",
" and abs(\n",
" int(token.layout.text_anchor.text_segments[0].end_index)\n",
" - int(text_anchors_check[0][\"endIndex\"])\n",
" )\n",
" <= 2\n",
" ):\n",
" normalized_vertices = token.layout.bounding_poly\n",
" min_x = min(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" min_y = min(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_x = max(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_y = max(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" text_anc_token = token.layout.text_anchor.text_segments\n",
" confidence = token.layout.confidence\n",
" return (\n",
" {\"min_x\": min_x, \"min_y\": min_y, \"max_x\": max_x, \"max_y\": max_y},\n",
" text_anc_token,\n",
" confidence,\n",
" )\n",
"\n",
"\n",
"def get_page_wise_entities(json_dict: object) -> dict:\n",
" \"\"\"\n",
" THIS FUNCTION GIVES THE ENTITIES SPEPERATED FROM EACH PAGE IN DICTIONARY FORMAT\n",
"\n",
" Args:\n",
" json_dict (object) : This contains loaded document object.\n",
"\n",
" Returns:\n",
" Returns a dictionary having a structure of {page: [entities]}.\n",
" \"\"\"\n",
"\n",
" entities_page = {}\n",
" for entity in json_dict.entities:\n",
" page = \"0\"\n",
" try:\n",
" if not entity.page_anchor.page_refs[0].page:\n",
" page = entity.page_anchor.page_refs[0].page\n",
"\n",
" if page in entities_page.keys():\n",
" entities_page[page].append(entity)\n",
" else:\n",
" entities_page[page] = [entity]\n",
" except:\n",
" pass\n",
" return entities_page\n",
"\n",
"\n",
"def get_min_max_y_lineitem(json_dict: object, page: int, ent2: list):\n",
" \"\"\"\n",
" Extracts minimum and maximum Y-coordinates for line items from a JSON dictionary.\n",
"\n",
" Args:\n",
" json_dict (object): Documentobject containing the JSON structure.\n",
" page (int): Integer representing the page number.\n",
" ent2 (list): List of entities to be considered.\n",
"\n",
" Returns:\n",
" min_y_line (float): Minimum Y-coordinate for the line items.\n",
" max_y_line (float): Maximum Y-coordinate for the line items.\n",
" \"\"\"\n",
"\n",
" min_y_considered = 0\n",
" max_y_considered = 0\n",
"\n",
" line_items_all = []\n",
" for entity in ent2:\n",
" if entity.properties and entity.type == \"line_item\":\n",
" line_items_all.append(entity)\n",
" if line_items_all != []:\n",
" if len(line_items_all) > 1 or len(line_items_all[0].properties) > 2:\n",
" min_y_line = 1\n",
" max_y_line = 0\n",
" min_y_child = 1\n",
" min_y_child_Mt = \"\"\n",
" entity_mentiontext = \"\"\n",
" for line_item in line_items_all:\n",
" norm_ver = line_item.page_anchor.page_refs[\n",
" 0\n",
" ].bounding_poly.normalized_vertices\n",
" for ver in norm_ver:\n",
" min_y_temp = min(vertex.y for vertex in norm_ver)\n",
" max_y_temp = max(vertex.y for vertex in norm_ver)\n",
" if min_y_line > min_y_temp:\n",
" min_y_line = min_y_temp\n",
" entity_mentiontext = line_item.mention_text\n",
" for child_ent in line_item.properties:\n",
" norm_ver_child = child_ent.page_anchor.page_refs[\n",
" 0\n",
" ].bounding_poly.normalized_vertices\n",
" for ver_child in norm_ver_child:\n",
" min_y_child_temp = min(\n",
" vertex.y for vertex in norm_ver_child\n",
" )\n",
" if min_y_child > min_y_child_temp:\n",
" min_y_child = min_y_child_temp\n",
" try:\n",
" min_y_child_Mt = child_ent.mention_text\n",
" except:\n",
" pass\n",
" if max_y_line < max_y_temp:\n",
" max_y_line = max_y_temp\n",
" else:\n",
" pass\n",
"\n",
" return min_y_line, max_y_line\n",
"\n",
"\n",
"def get_date_entities(json_dict: object, headers_entities: dict) -> object:\n",
" \"\"\"\n",
" This function retrieves the dates from the document present in the line items.\n",
"\n",
" Args:\n",
" json_dict (object): Documentobject containing the JSON structure.\n",
" headers_entities (dict): Dictionary containing the headers to be annotated in the document.\n",
"\n",
" Returns:\n",
" Returns a document object after being annotated in the document.\n",
" \"\"\"\n",
"\n",
" def get_dates(json_dict):\n",
" import re\n",
"\n",
" date_pattern = r\"\\b(?:\\d{1,2}\\/\\d{1,2}\\/\\d{2,4}|\\d{1,2}-\\d{1,2}-\\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2} \\d{4})\\b\"\n",
" matches = list(re.finditer(date_pattern, json_dict.text))\n",
" match_dates_dict = {}\n",
" n = 0\n",
" for match in matches:\n",
" match_dates_dict[n] = {\n",
" \"match\": match.group(),\n",
" \"textanc\": [\n",
" {\"startIndex\": str(match.start()), \"endIndex\": str(match.end())}\n",
" ],\n",
" }\n",
" n = n + 1\n",
" return match_dates_dict\n",
"\n",
" def line_item_dates(\n",
" json_dict, page, match_dates_dict, min_y_lineitem, max_y_lineitem\n",
" ):\n",
" final_dates_dict = {}\n",
" m = 0\n",
"\n",
" for i, v in match_dates_dict.items():\n",
" try:\n",
" ver, text_anc_token, confidence = get_token(\n",
" json_dict, page, v[\"textanc\"]\n",
" )\n",
"\n",
" if (\n",
" ver[\"min_y\"] >= min_y_lineitem - 0.02\n",
" and ver[\"max_y\"] <= max_y_lineitem + 0.02\n",
" ):\n",
" match_dates_dict[i][\"ver\"] = ver\n",
" final_dates_dict[m] = {\n",
" \"date\": v[\"match\"],\n",
" \"textanc\": text_anc_token,\n",
" \"normalizedver\": ver,\n",
" \"confidence\": confidence,\n",
" }\n",
" m = m + 1\n",
" except:\n",
" continue\n",
" return final_dates_dict\n",
"\n",
" def get_headers_dict(\n",
" json_dict, page, headers_entities, min_y_lineitem, max_y_lineitem\n",
" ):\n",
" header_exist_dict = {}\n",
" header_btw = {}\n",
" list_headers = list(headers_entities.keys())\n",
" for token in json_dict.pages[page].tokens:\n",
" normalized_vertices = token.layout.bounding_poly\n",
" try:\n",
" min_x = min(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" min_y = min(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_x = max(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_y = max(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" if not token.layout.text_anchor.text_segments[0].start_index:\n",
" token.layout.text_anchor.text_segments[0].start_index = \"0\"\n",
" start_1 = token.layout.text_anchor.text_segments[0].start_index\n",
" end_1 = token.layout.text_anchor.text_segments[0].end_index\n",
" if min_y <= min_y_lineitem and abs(min_y - min_y_lineitem) <= 0.1:\n",
" text_1 = (\n",
" json_dict.text[int(start_1) : int(end_1)]\n",
" .replace(\"\\n\", \"\")\n",
" .replace(\" \", \"\")\n",
" )\n",
" temp_dict = {}\n",
" if text_1 in list_headers:\n",
" temp_dic_2 = {}\n",
" temp_dic_2[text_1] = {\n",
" \"pageanc\": {\n",
" \"min_x\": min_x,\n",
" \"min_y\": min_y,\n",
" \"max_x\": max_x,\n",
" \"max_y\": max_y,\n",
" },\n",
" \"text_anc\": {\"startIndex\": start_1, \"endIndex\": end_1},\n",
" }\n",
" header_exist_dict[b] = temp_dic_2\n",
" b += 1\n",
"\n",
" except:\n",
" pass\n",
" b = 0\n",
" for token in json_dict.pages[page].tokens:\n",
" normalized_vertices = token.layout.bounding_poly\n",
" try:\n",
" min_x = min(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" min_y = min(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_x = max(\n",
" vertex.x for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" max_y = max(\n",
" vertex.y for vertex in normalized_vertices.normalized_vertices\n",
" )\n",
" if not token.layout.text_anchor.text_segments[0].start_index:\n",
" token.layout.text_anchor.text_segments[0].start_index = \"0\"\n",
" start_1 = token.layout.text_anchor.text_segments[0].start_index\n",
" end_1 = token.layout.text_anchor.text_segments[0].end_index\n",
" if min_y > min_y_lineitem and max_y < max_y_lineitem:\n",
" text_1 = (\n",
" json_dict.text[int(start_1) : int(end_1)]\n",
" .replace(\"\\n\", \"\")\n",
" .replace(\" \", \"\")\n",
" )\n",
" temp_dict = {}\n",
" if text_1 in list_headers:\n",
" temp_dic_2 = {}\n",
" temp_dic_2[text_1] = {\n",
" \"pageanc\": {\n",
" \"min_x\": min_x,\n",
" \"min_y\": min_y,\n",
" \"max_x\": max_x,\n",
" \"max_y\": max_y,\n",
" },\n",
" \"text_anc\": {\"startIndex\": start_1, \"endIndex\": end_1},\n",
" }\n",
" header_exist_dict[b] = temp_dic_2\n",
" b += 1\n",
"\n",
" except:\n",
" pass\n",
" return header_exist_dict\n",
"\n",
" def create_entities(final_match, page, json_dict):\n",
" for i1, v1 in final_match.items():\n",
" if (\n",
" v1[\"considered_ent\"] != \"\"\n",
" and v1[\"considered_ent\"] != {}\n",
" and len(v1[\"considered_ent\"]) != 0\n",
" ):\n",
" parent_entity = documentai.Document.Entity()\n",
" new_entity = documentai.Document.Entity()\n",
" new_entity.confidence = v1[\"date_ent\"][\"confidence\"]\n",
" new_entity.mention_text = v1[\"date_ent\"][\"date\"]\n",
" min_x = v1[\"date_ent\"][\"normalizedver\"][\"min_x\"]\n",
" min_y = v1[\"date_ent\"][\"normalizedver\"][\"min_y\"]\n",
" max_x = v1[\"date_ent\"][\"normalizedver\"][\"max_x\"]\n",
" max_y = v1[\"date_ent\"][\"normalizedver\"][\"max_y\"]\n",
" ver_1 = [\n",
" {\"x\": min_x, \"y\": min_y},\n",
" {\"x\": min_x, \"y\": max_y},\n",
" {\"x\": max_x, \"y\": min_y},\n",
" {\"x\": max_x, \"y\": max_y},\n",
" ]\n",
" page_ref_1 = new_entity.page_anchor.PageRef()\n",
" page_ref_1.bounding_poly.normalized_vertices.extend(ver_1)\n",
" new_entity.page_anchor.page_refs.append(page_ref_1)\n",
" new_entity.text_anchor.text_segments = v1[\"date_ent\"][\"textanc\"]\n",
" new_entity.type = headers_entities[list(v1[\"considered_ent\"].keys())[0]]\n",
" parent_entity.properties.append(new_entity)\n",
" parent_entity.confidence = new_entity.confidence\n",
" parent_entity.mention_text = new_entity.mention_text\n",
" parent_entity.page_anchor = new_entity.page_anchor\n",
" parent_entity.text_anchor = new_entity.text_anchor\n",
" parent_entity.type = headers_entities[\n",
" list(v1[\"considered_ent\"].keys())[0]\n",
" ].split(\"/\")[0]\n",
" json_dict.entities.append(parent_entity)\n",
" return json_dict\n",
"\n",
" def final_match(dates_1, headers_1):\n",
" temp_dic = {}\n",
" final_match = {}\n",
" q = 0\n",
" duplicates = []\n",
" for i, j in dates_1.items():\n",
" diff = 1\n",
" considered_ent = \"\"\n",
" considered_ent_list = []\n",
" if j not in duplicates:\n",
" for k, l in headers_1.items():\n",
" for k2, l2 in l.items():\n",
" if (\n",
" abs(j[\"normalizedver\"][\"min_x\"] - l2[\"pageanc\"][\"min_x\"])\n",
" <= 0.05\n",
" and abs(\n",
" j[\"normalizedver\"][\"max_x\"] - l2[\"pageanc\"][\"max_x\"]\n",
" )\n",
" <= 0.1\n",
" ):\n",
" if (\n",
" j[\"normalizedver\"][\"max_y\"] - l2[\"pageanc\"][\"min_y\"]\n",
" >= 0\n",
" ):\n",
" if (\n",
" diff\n",
" > j[\"normalizedver\"][\"max_y\"]\n",
" - l2[\"pageanc\"][\"min_y\"]\n",
" ):\n",
" temp_dic = {}\n",
" diff = (\n",
" j[\"normalizedver\"][\"max_y\"]\n",
" - l2[\"pageanc\"][\"min_y\"]\n",
" )\n",
" temp_dic[k2] = l2\n",
" duplicates.append(j)\n",
" considered_ent = temp_dic\n",
" temp_dic = {}\n",
" final_match[q] = {\"date_ent\": j, \"considered_ent\": considered_ent}\n",
" q += 1\n",
"\n",
" return final_match\n",
"\n",
" date_entities_final = []\n",
"\n",
" try:\n",
" page_wise_ent = get_page_wise_entities(json_dict)\n",
" except Exception as e:\n",
" print(\"page wise entities issue--> \", e)\n",
" return json_dict\n",
"\n",
" for page, page_entities in page_wise_ent.items():\n",
" try:\n",
" min_y_lineitem, max_y_lineitem = get_min_max_y_lineitem(\n",
" json_dict, int(page), page_entities\n",
" )\n",
" except Exception as e:\n",
" print(\"COULDNT GET MIN Y AND MAX Y FOR LINE ITEMS--> \", e)\n",
" continue\n",
"\n",
" try:\n",
" match_dates_dict = get_dates(json_dict)\n",
" except Exception as e:\n",
" print(\"COULDNT FIND DATES IN THE TEXT--> \", e)\n",
" continue\n",
"\n",
" try:\n",
" final_dates_dict = line_item_dates(\n",
" json_dict,\n",
" int(page),\n",
" match_dates_dict,\n",
" min_y_lineitem,\n",
" max_y_lineitem,\n",
" )\n",
" except Exception as e:\n",
" print(\"NO DATES IN LINE ITEM RANGE--> \", e)\n",
" continue\n",
"\n",
" try:\n",
" header_exist_dict = get_headers_dict(\n",
" json_dict,\n",
" int(page),\n",
" headers_entities,\n",
" min_y_lineitem,\n",
" max_y_lineitem,\n",
" )\n",
" except Exception as e:\n",
" print(\"NO HEADERS FOUND MATCHING--> \", e)\n",
" continue\n",
"\n",
" try:\n",
" final_match = final_match(final_dates_dict, header_exist_dict)\n",
" except Exception as e:\n",
" print(\"Match not found-->\", e)\n",
" continue\n",
"\n",
" try:\n",
" date_entities = create_entities(final_match, page, json_dict)\n",
" except Exception as e:\n",
" print(\"COULDNT CREATE ENTITIES--> \", e)\n",
"\n",
" return json_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4d6093e-49b1-465c-b912-5e4948bb59e8",
"metadata": {},
"outputs": [],
"source": [
"def load_json_from_gcs(bucket_name, blob_name):\n",
" \"\"\"Load a JSON file from a Google Cloud Storage bucket.\"\"\"\n",
" storage_client = storage.Client()\n",
" bucket = storage_client.bucket(bucket_name)\n",
" blob = bucket.blob(blob_name)\n",
"\n",
" # Download the contents of the blob as a string and then parse it as JSON\n",
" json_data = json.loads(blob.download_as_text())\n",
" return json_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89262441-20da-435b-91b4-ea507de40a45",
"metadata": {},
"outputs": [],
"source": [
"file_names_list, file_dict = utilities.file_names(gcs_input_path)\n",
"for filename, filepath in tqdm(file_dict.items(), desc=\"Progress\"):\n",
" input_bucket_name = gcs_input_path.split(\"/\")[2]\n",
" print(input_bucket_name)\n",
" if \".json\" in filepath:\n",
" filepath = \"gs://\" + input_bucket_name + \"/\" + filepath\n",
" print(filepath)\n",
" # json_dict = utilities.load_json(filepath)\n",
" blob_name = \"/\".join(filepath.split(\"/\")[3:])\n",
" json_dict = load_json_from_gcs(input_bucket_name, blob_name)\n",
" json_dict = documentai.Document.from_json(json.dumps(json_dict))\n",
" json_updated = get_date_entities(json_dict, headers_entities)\n",
"\n",
" output_bucket_name = gcs_output_path.split(\"/\")[2]\n",
" output_path_within_bucket = \"/\".join(gcs_output_path.split(\"/\")[3:]) + filename\n",
" utilities.store_document_as_json(\n",
" documentai.Document.to_json(json_updated),\n",
" output_bucket_name,\n",
" output_path_within_bucket,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "e1fc626f-38ba-4332-b1b4-e038fa258e32",
"metadata": {},
"source": [
"#### **Note**: After running this **date entities code** we need to run the **line item improver code** inorder to combine this date entities with other line items in the respective row.\n"
]
},
{
"cell_type": "markdown",
"id": "d08142e9-1fe4-462c-88b9-5931473c59a6",
"metadata": {},
"source": [
"### 4.Output after running line item improver code\n"
]
},
{
"cell_type": "markdown",
"id": "9b94c4a1-94fc-4ed8-b84a-50d050e29a60",
"metadata": {},
"source": [
"<img src=\"./images/output_sample_1_2.png\" width=800 height=400></img>\n",
"<img src=\"./images/output_sample_1_1.png\" width=800 height=400></img>"
]
},
{
"cell_type": "markdown",
"id": "ffd22cdf-ab6d-47e7-ac31-205255602d3c",
"metadata": {},
"source": [
"### 5.Edge cases"
]
},
{
"cell_type": "markdown",
"id": "17db89a3-3de5-4f3d-9b45-50c0e8f63143",
"metadata": {},
"source": [
"The line item merging may not work as expected for some documents due to the layout. This is illustrated in the screenshot below.\n"
]
},
{
"cell_type": "markdown",
"id": "be29fd0c-1754-4cca-bd31-bdcfd10c1c5f",
"metadata": {},
"source": [
"<img src=\"./images/output_sample_2_2.png\" width=800 height=400></img>\n",
"<img src=\"./images/output_sample_2_1.png\" width=800 height=400></img>"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m104",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m104"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}