incubator-tools/header_footer_entities/header_footer_entities.ipynb (434 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "29399724", "metadata": {}, "source": [ "# Header and Footer entities" ] }, { "cell_type": "markdown", "id": "1e4a6def", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "6029ebef", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "18bbfedd", "metadata": {}, "source": [ "# Objective\n", "The objective is to extract headers and footers from documents using OCR by leveraging bounding boxes at the top and bottom of each page. " ] }, { "cell_type": "markdown", "id": "4a77d9e9", "metadata": {}, "source": [ "# Prerequisites\n", "* Vertex AI Notebook\n", "* GCS Folder Path\n", "* DocumentAI Parsed JSONs" ] }, { "cell_type": "markdown", "id": "cad788f5", "metadata": {}, "source": [ "# Step-by-Step Procedure" ] }, { "cell_type": "markdown", "id": "3e511f7a", "metadata": {}, "source": [ "## 1. Import Modules/Packages" ] }, { "cell_type": "code", "execution_count": 7, "id": "80ecc19e-d0fb-4435-9284-44318db1937c", "metadata": {}, "outputs": [], "source": [ "# !pip install google-cloud-documentai" ] }, { "cell_type": "code", "execution_count": 8, "id": "10e90cfe", "metadata": {}, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": 1, "id": "2d4b7289", "metadata": { "tags": [] }, "outputs": [], "source": [ "from google.cloud import documentai_v1beta3 as documentai\n", "\n", "from utilities import (\n", " documentai_json_proto_downloader,\n", " file_names,\n", " store_document_as_json,\n", ")" ] }, { "cell_type": "markdown", "id": "4aa4caea", "metadata": {}, "source": [ "## 2. Input Details" ] }, { "cell_type": "markdown", "id": "717bb8cc", "metadata": {}, "source": [ "* **GCS_INPUT_URI** : It is input GCS folder path which contains DocumentAI processor JSON results\n", "* **GCS_OUTPUT_URI** : It is a GCS folder path to store post-processing results\n", "* **Y_STOP_GAP_HEADER** : Minimum gap to check between the header and general text\n", "* **Y_STOP_GAP_FOOTER**: Minimum gap to check between the footer and general text\n", "* **Y_HEADER_BORDER** : Header text end maximum position (eg: 0.06)\n", "* **Y_FOOTER_BORDER**: Footer text start position (eg: 0.90)" ] }, { "cell_type": "code", "execution_count": 6, "id": "bfdd838b", "metadata": { "tags": [] }, "outputs": [], "source": [ "GCS_INPUT_URI = \"gs://BUCKET/header_footer_entities/input/\"\n", "GCS_OUTPUT_URI = \"gs://BUCKET/header_footer_entities/output/\"\n", "\n", "# configurable parameters\n", "Y_STOP_GAP_HEADER = 0.0260\n", "Y_STOP_GAP_FOOTER = 0.001\n", "# OR\n", "Y_HEADER_BORDER = None # 0.06\n", "Y_FOOTER_BORDER = None # 0.90" ] }, { "cell_type": "code", "execution_count": 7, "id": "6a84711f-b166-4d2b-adca-2a62b4077c10", "metadata": { "tags": [] }, "outputs": [], "source": [ "GCS_INPUT_URI = \"gs://siddamv/tools/header_footer_entities/input/\"\n", "GCS_OUTPUT_URI = \"gs://siddamv/tools/header_footer_entities/output/\"\n", "\n", "# configurable parameters\n", "Y_STOP_GAP_HEADER = 0.0260 # minimum gap to check between the header and general text\n", "Y_STOP_GAP_FOOTER = 0.001 # minimum gap to check between the footer and general text\n", "# OR\n", "Y_HEADER_BORDER = None # 0.06 #Header text end maximum position\n", "Y_FOOTER_BORDER = None # 0.90 #Footer text start position" ] }, { "cell_type": "markdown", "id": "3347e5d1", "metadata": {}, "source": [ "## 3. Run Below Code-Cells" ] }, { "cell_type": "code", "execution_count": 9, "id": "73ea5afe-dee7-4a17-a256-b67f852e79a7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "file_name: AWS DPA-0.json\n", "\tUploading file AWS DPA-0.json to GCS gs://siddamv/tools/header_footer_entities/output/\n" ] } ], "source": [ "def add_header_footer_entities(json_data: documentai.Document) -> documentai.Document:\n", " \"\"\"\n", " It will create header and footer entities if any text falls under predefined variables region\n", "\n", " Args:\n", " json_data (documentai.Document): DocumentAI Processor result in Document object format\n", "\n", " Returns:\n", " documentai.Document: Post processed Document object\n", " \"\"\"\n", "\n", " for page in json_data.pages:\n", " page_num = page.page_number\n", " y_list = []\n", " for token in page.tokens:\n", " vertices = token.layout.bounding_poly.normalized_vertices\n", " minx_b, miny_b = min(point.x for point in vertices), min(\n", " point.y for point in vertices\n", " )\n", " maxx_b, maxy_b = max(point.x for point in vertices), max(\n", " point.y for point in vertices\n", " )\n", " y_list.append(miny_b)\n", " y_list_sorted = sorted(y_list)\n", " max_y_header = None\n", " header_text_anchors = []\n", " header_page_x = []\n", " header_page_y = []\n", " footer_text_anchors = []\n", " footer_page_x = []\n", " footer_page_y = []\n", " for i in range(len(y_list_sorted) - 1):\n", " if y_list_sorted[i + 1] - y_list_sorted[i] > Y_STOP_GAP_HEADER:\n", " max_y_header = y_list_sorted[i + 1]\n", " break\n", " min_y_header = None\n", " for i in range(len(y_list_sorted) - 1, 0, -1):\n", " if y_list_sorted[i] - y_list_sorted[i - 1] > Y_STOP_GAP_FOOTER:\n", " min_y_header = y_list_sorted[i - 1]\n", " break\n", " header_text = \"\"\n", " footer_text = \"\"\n", " if Y_HEADER_BORDER != None:\n", " max_y_header = Y_HEADER_BORDER\n", " if Y_FOOTER_BORDER != None:\n", " min_y_header = Y_FOOTER_BORDER\n", " for token in page.tokens:\n", " vertices = token.layout.bounding_poly.normalized_vertices\n", " minx_b, miny_b = min(point.x for point in vertices), min(\n", " point.y for point in vertices\n", " )\n", " maxx_b, maxy_b = max(point.x for point in vertices), max(\n", " point.y for point in vertices\n", " )\n", " if miny_b < max_y_header:\n", " anc = token.layout.text_anchor.text_segments\n", " for text_anc in anc:\n", " start_index = text_anc.start_index\n", " end_index = text_anc.end_index\n", " header_text += json_data.text[start_index:end_index]\n", " header_ts = documentai.Document.TextAnchor.TextSegment(\n", " start_index=start_index, end_index=end_index\n", " )\n", " header_text_anchors.append(header_ts)\n", " header_page_x.extend([minx_b, maxx_b])\n", " header_page_y.extend([miny_b, maxy_b])\n", " if miny_b > min_y_header:\n", " anc = token.layout.text_anchor.text_segments\n", " for text_anc in anc:\n", " start_index = text_anc.start_index\n", " end_index = text_anc.end_index\n", " footer_text += json_data.text[start_index:end_index]\n", " footer_ts = documentai.Document.TextAnchor.TextSegment(\n", " start_index=start_index, end_index=end_index\n", " )\n", " footer_text_anchors.append(footer_ts)\n", " footer_page_x.extend([minx_b, maxx_b])\n", " footer_page_y.extend([miny_b, maxy_b])\n", " sorted_footer_text_anchors = sorted(\n", " footer_text_anchors, key=lambda x: x.end_index\n", " )\n", " sorted_header_text_anchors = sorted(\n", " header_text_anchors, key=lambda x: x.end_index\n", " )\n", " header_mention_text = \"\"\n", " for an1 in sorted_header_text_anchors:\n", " header_mention_text += json_data.text[an1.start_index : an1.end_index]\n", " footer_mention_text = \"\"\n", " for an1 in sorted_footer_text_anchors:\n", " footer_mention_text += json_data.text[an1.start_index : an1.end_index]\n", " try:\n", " normalized_vertex_0 = documentai.NormalizedVertex(\n", " x=min(header_page_x), y=min(header_page_y)\n", " )\n", " normalized_vertex_1 = documentai.NormalizedVertex(\n", " x=max(header_page_x), y=min(header_page_y)\n", " )\n", " normalized_vertex_2 = documentai.NormalizedVertex(\n", " x=min(header_page_x), y=max(header_page_y)\n", " )\n", " normalized_vertex_3 = documentai.NormalizedVertex(\n", " x=max(header_page_x), y=max(header_page_y)\n", " )\n", " header_norm_ver = [\n", " normalized_vertex_0,\n", " normalized_vertex_1,\n", " normalized_vertex_2,\n", " normalized_vertex_3,\n", " ]\n", " header_entity = documentai.Document.Entity()\n", " header_entity.mention_text = header_mention_text\n", " header_entity.type = \"header\"\n", " bp = documentai.BoundingPoly(normalized_vertices=header_norm_ver)\n", " pr = documentai.Document.PageAnchor.PageRef(\n", " page=str(page_num - 1), bounding_poly=bp\n", " )\n", " pa = documentai.Document.PageAnchor()\n", " pa.page_refs = [pr]\n", " header_entity.page_anchor = pa\n", " header_entity.text_anchor.text_segments = header_text_anchors\n", " json_data.entities.append(header_entity)\n", "\n", " except ValueError:\n", " print(\"NO HEADER page_number\", page_num)\n", " continue\n", " try:\n", " normalized_vertex_0 = documentai.NormalizedVertex(\n", " x=min(footer_page_x), y=min(footer_page_y)\n", " )\n", " normalized_vertex_1 = documentai.NormalizedVertex(\n", " x=max(footer_page_x), y=min(footer_page_y)\n", " )\n", " normalized_vertex_2 = documentai.NormalizedVertex(\n", " x=min(footer_page_x), y=max(footer_page_y)\n", " )\n", " normalized_vertex_3 = documentai.NormalizedVertex(\n", " x=max(footer_page_x), y=max(footer_page_y)\n", " )\n", " footer_norm_ver = [\n", " normalized_vertex_0,\n", " normalized_vertex_1,\n", " normalized_vertex_2,\n", " normalized_vertex_3,\n", " ]\n", " footer_entity = documentai.Document.Entity()\n", " footer_entity.mention_text = footer_mention_text\n", " footer_entity.type = \"footer\"\n", " bp = documentai.BoundingPoly(normalized_vertices=footer_norm_ver)\n", " pr = documentai.Document.PageAnchor.PageRef(\n", " page=str(page_num - 1), bounding_poly=bp\n", " )\n", " pa = documentai.Document.PageAnchor()\n", " pa.page_refs = [pr]\n", " footer_entity.page_anchor = pa\n", " footer_entity.text_anchor.text_segments = footer_text_anchors\n", " json_data.entities.append(footer_entity)\n", " except ValueError:\n", " print(\"NO FOOTER page_number\", page_num)\n", " continue\n", " return json_data\n", "\n", "\n", "# getting list of files and path from GCS\n", "file_names_list, file_names_dict = file_names(GCS_INPUT_URI)\n", "# looping each file for adding footer and header entities\n", "for i in range(len(file_names_list)):\n", " print(\"file_name: \", file_names_list[i])\n", " file_name = file_names_list[i]\n", " json_data = documentai_json_proto_downloader(\n", " GCS_INPUT_URI.split(\"/\")[2], file_names_dict[file_names_list[i]]\n", " )\n", " y = 0\n", " json_data = add_header_footer_entities(json_data)\n", " print(f\"\\tUploading file {file_name} to GCS {GCS_OUTPUT_URI}\")\n", " store_document_as_json(\n", " documentai.Document.to_json(json_data),\n", " GCS_OUTPUT_URI.split(\"/\")[2],\n", " (\"/\").join(GCS_OUTPUT_URI.split(\"/\")[3:]) + file_name,\n", " )" ] }, { "cell_type": "markdown", "id": "a9eda18a", "metadata": {}, "source": [ "# 4. Output Details\n", "\n", "Refer below images for postprocessed results" ] }, { "cell_type": "markdown", "id": "7347f175", "metadata": {}, "source": [ "<img src='./images/output.png' width=1000 height=800></img>" ] }, { "cell_type": "code", "execution_count": null, "id": "c224936d-df20-49b8-a63d-78c0d37a9018", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4011f8aa-4fe1-4a4a-acf5-c173aae78d66", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m125", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }