incubator-tools/lineitem_improver_crosspage/lineitem_improver_crosspage.ipynb (438 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee",
"metadata": {},
"source": [
"# Line Item Improver Crosspage"
]
},
{
"cell_type": "markdown",
"id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This tool uses parsed json files and merges the line items spanning across 2 pages which are supposed to be under a single line item and updates the json."
]
},
{
"cell_type": "markdown",
"id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Jupyter Platform to run Python code\n",
"* Parsed json files in GCS Folder\n",
"* Output GCS folder to upload the updated json files\n"
]
},
{
"cell_type": "markdown",
"id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02",
"metadata": {},
"source": [
"## Step by Step Procedure"
]
},
{
"cell_type": "markdown",
"id": "eadfe192-5691-4053-8670-32ef3e5bea8c",
"metadata": {},
"source": [
"### 1. Import Modules/Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install google-cloud-documentai google-cloud-storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6495b161-ac1b-40b2-a668-fe835da2950a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from google.cloud import storage\n",
"from pathlib import Path\n",
"from tqdm import tqdm\n",
"from google.cloud import documentai_v1beta3 as documentai\n",
"from typing import Any, Dict, List, Tuple, Set\n",
"\n",
"from utilities import file_names, store_document_as_json"
]
},
{
"cell_type": "markdown",
"id": "1709ea5e-bf14-44ab-9963-cda14b092dc2",
"metadata": {},
"source": [
"### 2. Input Details"
]
},
{
"cell_type": "markdown",
"id": "7ce030a1-663b-445e-a38e-24dc415942c2",
"metadata": {},
"source": [
"* **gcs_input_path**: Provide the gcs path of the parent folder where the sub-folders contain input files. Please follow the folder structure described earlier.\n",
"* **project_id**: Project ID/Number of the Project.\n",
"* **line_item_name**: Parent entity name of the document which needs to be get merge.\n",
"* **gcs_output_path**: Provide gcs path where the output json files have to be saved"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17",
"metadata": {},
"outputs": [],
"source": [
"gcs_input_path = \"gs://<<bucket_name>>/<<subfolder_path>>/\" # Parsed json files path , end '/' is mandatory\n",
"project_id = \"project_id\" # project ID\n",
"line_item_name = \"parent_entity_name\" # Name of the line item entity (parent entity) to be merged as per processor schema\n",
"gcs_output_path = \"gs://<<bucket_name>>/<<subfolder_path>>/\" # output path where the updated jsons to be saved, end '/' is mandatory"
]
},
{
"cell_type": "markdown",
"id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef",
"metadata": {},
"source": [
"### 3. Run the required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def get_page_wise_sorted_line_items_and_schema(\n",
" json_dict: Dict[str, Any]\n",
") -> Tuple[Dict[str, List[Dict[str, Any]]], List[str]]:\n",
" \"\"\"\n",
" Extracts and sorts line items by page and gathers a schema of unique property types.\n",
"\n",
" Args:\n",
" - json_dict (Dict[str, Any]): The JSON dictionary containing entity data.\n",
"\n",
" Returns:\n",
" - Tuple[Dict[str, List[Dict[str, Any]]], List[str]]:\n",
" - A dictionary where keys are page numbers and values are lists of sorted line items for that page.\n",
" - A sorted list of unique property types across all line items.\n",
" \"\"\"\n",
" line_items_page = {}\n",
" for entity in json_dict[\"entities\"]:\n",
" page = \"0\"\n",
" try:\n",
" if \"page\" in entity[\"pageAnchor\"][\"pageRefs\"][0].keys():\n",
" page = entity[\"pageAnchor\"][\"pageRefs\"][0][\"page\"]\n",
"\n",
" if entity[\"type\"] == line_item_name:\n",
" if \"properties\" in entity.keys() and entity[\"properties\"]:\n",
" if page in line_items_page:\n",
" line_items_page[page].append(entity)\n",
" else:\n",
" line_items_page[page] = [entity]\n",
" except Exception:\n",
" pass\n",
"\n",
" sorted_line_items = {}\n",
" schema_types = set() # Collect all unique types across line items\n",
" for page, line_items in line_items_page.items():\n",
" sorted_line_items[page] = sort_line_items_y(\n",
" line_items\n",
" ) # Sort line items by y-coordinate\n",
" for item in line_items:\n",
" for property_item in item[\"properties\"]:\n",
" schema_types.add(\n",
" property_item.get(\"type\")\n",
" ) # Collect unique property types\n",
"\n",
" return sorted_line_items, sorted(schema_types)\n",
"\n",
"\n",
"def sort_line_items_y(line_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:\n",
" \"\"\"\n",
" Sorts line items by the minimum y-coordinate of their bounding polygons.\n",
"\n",
" Args:\n",
" - line_items (List[Dict[str, Any]]): A list of line items to be sorted.\n",
"\n",
" Returns:\n",
" - List[Dict[str, Any]]: A sorted list of line items by their y-coordinate.\n",
" \"\"\"\n",
" return sorted(\n",
" line_items,\n",
" key=lambda item: min(\n",
" vertex[\"y\"]\n",
" for vertex in item[\"pageAnchor\"][\"pageRefs\"][0][\"boundingPoly\"][\n",
" \"normalizedVertices\"\n",
" ]\n",
" ),\n",
" )\n",
"\n",
"\n",
"def merge_line_items_cross_page(\n",
" line_items_cur_page: List[Dict[str, Any]],\n",
" line_items_next_page: List[Dict[str, Any]],\n",
" li_schema: List[str],\n",
") -> Dict[str, Any]:\n",
" \"\"\"\n",
" Merges line items across pages if they belong to the same logical entity.\n",
"\n",
" Args:\n",
" - line_items_cur_page (List[Dict[str, Any]]): List of line items from the current page.\n",
" - line_items_next_page (List[Dict[str, Any]]): List of line items from the next page.\n",
" - li_schema (List[str]): List of unique property types in the schema.\n",
"\n",
" Returns:\n",
" - Dict[str, Any]: The merged line item entity, or None if no merge is required.\n",
" \"\"\"\n",
" last_li_cur_page = line_items_cur_page[-1]\n",
" first_li_next_page = line_items_next_page[0]\n",
"\n",
" # Check for common entities between last and first line items\n",
" value_counts = {li_type: 0 for li_type in li_schema}\n",
" for child_i, child_j in zip(\n",
" last_li_cur_page[\"properties\"], first_li_next_page[\"properties\"]\n",
" ):\n",
" value_counts[child_i[\"type\"]] += 1\n",
" value_counts[child_j[\"type\"]] += 1\n",
"\n",
" for count in value_counts.values():\n",
" if count > 1:\n",
" return None # No merge required\n",
"\n",
" # Collect missing child entities from the schema\n",
" missing_child_entities = []\n",
" for li_type in li_schema:\n",
" found = False\n",
" for child in last_li_cur_page[\"properties\"]:\n",
" if li_type == child[\"type\"]:\n",
" found = True\n",
" break\n",
" if not found:\n",
" missing_child_entities.append(li_type)\n",
"\n",
" # Merge the line items\n",
" merged_entity = dict(last_li_cur_page)\n",
" for li_type in missing_child_entities:\n",
" for child in first_li_next_page[\"properties\"]:\n",
" if li_type == child[\"type\"]:\n",
" merged_entity[\"properties\"].append(child)\n",
" break\n",
"\n",
" # Update bounding polygon, text anchors, and mention text\n",
" merged_entity[\"pageAnchor\"][\"pageRefs\"].extend(\n",
" first_li_next_page[\"pageAnchor\"][\"pageRefs\"]\n",
" )\n",
" merged_entity[\"textAnchor\"][\"textSegments\"].extend(\n",
" first_li_next_page[\"textAnchor\"][\"textSegments\"]\n",
" )\n",
" merged_entity[\"textAnchor\"][\"content\"] += (\n",
" \"\\n\" + first_li_next_page[\"textAnchor\"][\"content\"]\n",
" )\n",
" merged_entity[\"mentionText\"] += \" \" + first_li_next_page[\"mentionText\"]\n",
"\n",
" return merged_entity"
]
},
{
"cell_type": "markdown",
"id": "ac1d85cf-7529-469b-be7e-3e00a778af87",
"metadata": {},
"source": [
"### 4. Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49dab42f-6c98-4aac-b804-cd3ec81127f8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" file_names_list, file_dict = file_names(gcs_input_path)\n",
" storage_client = storage.Client()\n",
" source_bucket = storage_client.bucket(gcs_input_path.split(\"/\")[2])\n",
"\n",
" for filename, filepath in tqdm(file_dict.items(), desc=\"Progress\"):\n",
" input_bucket_name = gcs_input_path.split(\"/\")[2]\n",
" if \".json\" in filepath:\n",
" output_bucket_name = gcs_output_path.split(\"/\")[2]\n",
" print(filename)\n",
" json_dict = json.loads(\n",
" source_bucket.blob(filepath).download_as_string().decode(\"utf-8\")\n",
" )\n",
"\n",
" # Flag to track whether or not to update the json_dict after merging, if no line items are merged, this flag will remain False\n",
" is_entity_updated = False\n",
"\n",
" # get lineitems by pages\n",
" (\n",
" line_items_pages,\n",
" line_items_schema,\n",
" ) = get_page_wise_sorted_line_items_and_schema(json_dict)\n",
" print(\"Schema: \", line_items_schema)\n",
"\n",
" # Merging the cross page line items\n",
" for i in range(len(line_items_pages)):\n",
" try:\n",
" cur_page_li = line_items_pages[str(i)]\n",
" next_page_li = line_items_pages[str(i + 1)]\n",
" merged_li = merge_line_items_cross_page(\n",
" cur_page_li, next_page_li, line_items_schema\n",
" )\n",
"\n",
" if merged_li is not None:\n",
" is_entity_updated = True\n",
" line_items_pages[str(i)].pop()\n",
" line_items_pages[str(i)].append(merged_li)\n",
"\n",
" line_items_pages[str(i + 1)].pop(0)\n",
" except:\n",
" pass\n",
" # print(\"No line items in page : \", i)\n",
"\n",
" if is_entity_updated:\n",
" # Updating the entities in original json\n",
" updated_line_items = []\n",
"\n",
" for i in range(len(line_items_pages)):\n",
" try:\n",
" updated_line_items.extend(line_items_pages[str(i)])\n",
" except:\n",
" pass\n",
" # print(\"No line items to update in page : \", i)\n",
"\n",
" updated_entities = []\n",
" for entity in json_dict[\"entities\"]:\n",
" if entity[\"type\"] != line_item_name:\n",
" updated_entities.append(entity)\n",
"\n",
" updated_entities.extend(updated_line_items)\n",
"\n",
" updated_entities = sort_line_items_y(updated_entities)\n",
" json_dict[\"entities\"] = updated_entities\n",
" print(\"Line items merged successfully\")\n",
" else:\n",
" print(\"No line items were merged\")\n",
" store_document_as_json(\n",
" json.dumps(json_dict),\n",
" output_bucket_name,\n",
" \"/\".join(gcs_output_path.split(\"/\")[3:]) + filename,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b",
"metadata": {},
"source": [
"### Output Details"
]
},
{
"cell_type": "markdown",
"id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f",
"metadata": {
"tags": []
},
"source": [
"### Before\n",
"<img src='./images/before_1.png' width=600 height=800 alt=\"Sample Output\"></img>\n",
"<img src='./images/before_2.png' width=600 height=800 alt=\"Sample Output\"></img>\n",
"### After\n",
"<img src='./images/after_1.png' width=600 height=800 alt=\"Sample Output\"></img>\n",
"<img src='./images/after_2.png' width=600 height=800 alt=\"Sample Output\"></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9e83ee1-d2dc-48e1-8403-0e4a0b372626",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m127",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}