incubator-tools/lineitem_improver_crosspage/lineitem_improver

{ "cells": [ { "cell_type": "markdown", "id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee", "metadata": {}, "source": [ "# Line Item Improver Crosspage" ] }, { "cell_type": "markdown", "id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55", "metadata": {}, "source": [ "## Objective\n", "\n", "This tool uses parsed json files and merges the line items spanning across 2 pages which are supposed to be under a single line item and updates the json." ] }, { "cell_type": "markdown", "id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50", "metadata": {}, "source": [ "## Prerequisites\n", "* Jupyter Platform to run Python code\n", "* Parsed json files in GCS Folder\n", "* Output GCS folder to upload the updated json files\n" ] }, { "cell_type": "markdown", "id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02", "metadata": {}, "source": [ "## Step by Step Procedure" ] }, { "cell_type": "markdown", "id": "eadfe192-5691-4053-8670-32ef3e5bea8c", "metadata": {}, "source": [ "### 1. Import Modules/Packages" ] }, { "cell_type": "code", "execution_count": null, "id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install google-cloud-documentai google-cloud-storage" ] }, { "cell_type": "code", "execution_count": null, "id": "6495b161-ac1b-40b2-a668-fe835da2950a", "metadata": { "tags": [] }, "outputs": [], "source": [ "from google.cloud import storage\n", "from pathlib import Path\n", "from tqdm import tqdm\n", "from google.cloud import documentai_v1beta3 as documentai\n", "from typing import Any, Dict, List, Tuple, Set\n", "\n", "from utilities import file_names, store_document_as_json" ] }, { "cell_type": "markdown", "id": "1709ea5e-bf14-44ab-9963-cda14b092dc2", "metadata": {}, "source": [ "### 2. Input Details" ] }, { "cell_type": "markdown", "id": "7ce030a1-663b-445e-a38e-24dc415942c2", "metadata": {}, "source": [ "* **gcs_input_path**: Provide the gcs path of the parent folder where the sub-folders contain input files. Please follow the folder structure described earlier.\n", "* **project_id**: Project ID/Number of the Project.\n", "* **line_item_name**: Parent entity name of the document which needs to be get merge.\n", "* **gcs_output_path**: Provide gcs path where the output json files have to be saved" ] }, { "cell_type": "code", "execution_count": null, "id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17", "metadata": {}, "outputs": [], "source": [ "gcs_input_path = \"gs://<<bucket_name>>/<<subfolder_path>>/\" # Parsed json files path , end '/' is mandatory\n", "project_id = \"project_id\" # project ID\n", "line_item_name = \"parent_entity_name\" # Name of the line item entity (parent entity) to be merged as per processor schema\n", "gcs_output_path = \"gs://<<bucket_name>>/<<subfolder_path>>/\" # output path where the updated jsons to be saved, end '/' is mandatory" ] }, { "cell_type": "markdown", "id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef", "metadata": {}, "source": [ "### 3. Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_page_wise_sorted_line_items_and_schema(\n", " json_dict: Dict[str, Any]\n", ") -> Tuple[Dict[str, List[Dict[str, Any]]], List[str]]:\n", " \"\"\"\n", " Extracts and sorts line items by page and gathers a schema of unique property types.\n", "\n", " Args:\n", " - json_dict (Dict[str, Any]): The JSON dictionary containing entity data.\n", "\n", " Returns:\n", " - Tuple[Dict[str, List[Dict[str, Any]]], List[str]]:\n", " - A dictionary where keys are page numbers and values are lists of sorted line items for that page.\n", " - A sorted list of unique property types across all line items.\n", " \"\"\"\n", " line_items_page = {}\n", " for entity in json_dict[\"entities\"]:\n", " page = \"0\"\n", " try:\n", " if \"page\" in entity[\"pageAnchor\"][\"pageRefs\"][0].keys():\n", " page = entity[\"pageAnchor\"][\"pageRefs\"][0][\"page\"]\n", "\n", " if entity[\"type\"] == line_item_name:\n", " if \"properties\" in entity.keys() and entity[\"properties\"]:\n", " if page in line_items_page:\n", " line_items_page[page].append(entity)\n", " else:\n", " line_items_page[page] = [entity]\n", " except Exception:\n", " pass\n", "\n", " sorted_line_items = {}\n", " schema_types = set() # Collect all unique types across line items\n", " for page, line_items in line_items_page.items():\n", " sorted_line_items[page] = sort_line_items_y(\n", " line_items\n", " ) # Sort line items by y-coordinate\n", " for item in line_items:\n", " for property_item in item[\"properties\"]:\n", " schema_types.add(\n", " property_item.get(\"type\")\n", " ) # Collect unique property types\n", "\n", " return sorted_line_items, sorted(schema_types)\n", "\n", "\n", "def sort_line_items_y(line_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:\n", " \"\"\"\n", " Sorts line items by the minimum y-coordinate of their bounding polygons.\n", "\n", " Args:\n", " - line_items (List[Dict[str, Any]]): A list of line items to be sorted.\n", "\n", " Returns:\n", " - List[Dict[str, Any]]: A sorted list of line items by their y-coordinate.\n", " \"\"\"\n", " return sorted(\n", " line_items,\n", " key=lambda item: min(\n", " vertex[\"y\"]\n", " for vertex in item[\"pageAnchor\"][\"pageRefs\"][0][\"boundingPoly\"][\n", " \"normalizedVertices\"\n", " ]\n", " ),\n", " )\n", "\n", "\n", "def merge_line_items_cross_page(\n", " line_items_cur_page: List[Dict[str, Any]],\n", " line_items_next_page: List[Dict[str, Any]],\n", " li_schema: List[str],\n", ") -> Dict[str, Any]:\n", " \"\"\"\n", " Merges line items across pages if they belong to the same logical entity.\n", "\n", " Args:\n", " - line_items_cur_page (List[Dict[str, Any]]): List of line items from the current page.\n", " - line_items_next_page (List[Dict[str, Any]]): List of line items from the next page.\n", " - li_schema (List[str]): List of unique property types in the schema.\n", "\n", " Returns:\n", " - Dict[str, Any]: The merged line item entity, or None if no merge is required.\n", " \"\"\"\n", " last_li_cur_page = line_items_cur_page[-1]\n", " first_li_next_page = line_items_next_page[0]\n", "\n", " # Check for common entities between last and first line items\n", " value_counts = {li_type: 0 for li_type in li_schema}\n", " for child_i, child_j in zip(\n", " last_li_cur_page[\"properties\"], first_li_next_page[\"properties\"]\n", " ):\n", " value_counts[child_i[\"type\"]] += 1\n", " value_counts[child_j[\"type\"]] += 1\n", "\n", " for count in value_counts.values():\n", " if count > 1:\n", " return None # No merge required\n", "\n", " # Collect missing child entities from the schema\n", " missing_child_entities = []\n", " for li_type in li_schema:\n", " found = False\n", " for child in last_li_cur_page[\"properties\"]:\n", " if li_type == child[\"type\"]:\n", " found = True\n", " break\n", " if not found:\n", " missing_child_entities.append(li_type)\n", "\n", " # Merge the line items\n", " merged_entity = dict(last_li_cur_page)\n", " for li_type in missing_child_entities:\n", " for child in first_li_next_page[\"properties\"]:\n", " if li_type == child[\"type\"]:\n", " merged_entity[\"properties\"].append(child)\n", " break\n", "\n", " # Update bounding polygon, text anchors, and mention text\n", " merged_entity[\"pageAnchor\"][\"pageRefs\"].extend(\n", " first_li_next_page[\"pageAnchor\"][\"pageRefs\"]\n", " )\n", " merged_entity[\"textAnchor\"][\"textSegments\"].extend(\n", " first_li_next_page[\"textAnchor\"][\"textSegments\"]\n", " )\n", " merged_entity[\"textAnchor\"][\"content\"] += (\n", " \"\\n\" + first_li_next_page[\"textAnchor\"][\"content\"]\n", " )\n", " merged_entity[\"mentionText\"] += \" \" + first_li_next_page[\"mentionText\"]\n", "\n", " return merged_entity" ] }, { "cell_type": "markdown", "id": "ac1d85cf-7529-469b-be7e-3e00a778af87", "metadata": {}, "source": [ "### 4. Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "49dab42f-6c98-4aac-b804-cd3ec81127f8", "metadata": { "tags": [] }, "outputs": [], "source": [ "if __name__ == \"__main__\":\n", " file_names_list, file_dict = file_names(gcs_input_path)\n", " storage_client = storage.Client()\n", " source_bucket = storage_client.bucket(gcs_input_path.split(\"/\")[2])\n", "\n", " for filename, filepath in tqdm(file_dict.items(), desc=\"Progress\"):\n", " input_bucket_name = gcs_input_path.split(\"/\")[2]\n", " if \".json\" in filepath:\n", " output_bucket_name = gcs_output_path.split(\"/\")[2]\n", " print(filename)\n", " json_dict = json.loads(\n", " source_bucket.blob(filepath).download_as_string().decode(\"utf-8\")\n", " )\n", "\n", " # Flag to track whether or not to update the json_dict after merging, if no line items are merged, this flag will remain False\n", " is_entity_updated = False\n", "\n", " # get lineitems by pages\n", " (\n", " line_items_pages,\n", " line_items_schema,\n", " ) = get_page_wise_sorted_line_items_and_schema(json_dict)\n", " print(\"Schema: \", line_items_schema)\n", "\n", " # Merging the cross page line items\n", " for i in range(len(line_items_pages)):\n", " try:\n", " cur_page_li = line_items_pages[str(i)]\n", " next_page_li = line_items_pages[str(i + 1)]\n", " merged_li = merge_line_items_cross_page(\n", " cur_page_li, next_page_li, line_items_schema\n", " )\n", "\n", " if merged_li is not None:\n", " is_entity_updated = True\n", " line_items_pages[str(i)].pop()\n", " line_items_pages[str(i)].append(merged_li)\n", "\n", " line_items_pages[str(i + 1)].pop(0)\n", " except:\n", " pass\n", " # print(\"No line items in page : \", i)\n", "\n", " if is_entity_updated:\n", " # Updating the entities in original json\n", " updated_line_items = []\n", "\n", " for i in range(len(line_items_pages)):\n", " try:\n", " updated_line_items.extend(line_items_pages[str(i)])\n", " except:\n", " pass\n", " # print(\"No line items to update in page : \", i)\n", "\n", " updated_entities = []\n", " for entity in json_dict[\"entities\"]:\n", " if entity[\"type\"] != line_item_name:\n", " updated_entities.append(entity)\n", "\n", " updated_entities.extend(updated_line_items)\n", "\n", " updated_entities = sort_line_items_y(updated_entities)\n", " json_dict[\"entities\"] = updated_entities\n", " print(\"Line items merged successfully\")\n", " else:\n", " print(\"No line items were merged\")\n", " store_document_as_json(\n", " json.dumps(json_dict),\n", " output_bucket_name,\n", " \"/\".join(gcs_output_path.split(\"/\")[3:]) + filename,\n", " )" ] }, { "cell_type": "markdown", "id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b", "metadata": {}, "source": [ "### Output Details" ] }, { "cell_type": "markdown", "id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f", "metadata": { "tags": [] }, "source": [ "### Before\n", "<img src='./images/before_1.png' width=600 height=800 alt=\"Sample Output\"></img>\n", "<img src='./images/before_2.png' width=600 height=800 alt=\"Sample Output\"></img>\n", "### After\n", "<img src='./images/after_1.png' width=600 height=800 alt=\"Sample Output\"></img>\n", "<img src='./images/after_2.png' width=600 height=800 alt=\"Sample Output\"></img>" ] }, { "cell_type": "code", "execution_count": null, "id": "c9e83ee1-d2dc-48e1-8403-0e4a0b372626", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m127", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }

incubator-tools/lineitem_improver_crosspage/lineitem_improver_crosspage.ipynb (438 lines of code) (raw):