incubator-tools/header_footer_entities/header_footer_entities.ipynb (434 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "29399724",
"metadata": {},
"source": [
"# Header and Footer entities"
]
},
{
"cell_type": "markdown",
"id": "1e4a6def",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "6029ebef",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "18bbfedd",
"metadata": {},
"source": [
"# Objective\n",
"The objective is to extract headers and footers from documents using OCR by leveraging bounding boxes at the top and bottom of each page. "
]
},
{
"cell_type": "markdown",
"id": "4a77d9e9",
"metadata": {},
"source": [
"# Prerequisites\n",
"* Vertex AI Notebook\n",
"* GCS Folder Path\n",
"* DocumentAI Parsed JSONs"
]
},
{
"cell_type": "markdown",
"id": "cad788f5",
"metadata": {},
"source": [
"# Step-by-Step Procedure"
]
},
{
"cell_type": "markdown",
"id": "3e511f7a",
"metadata": {},
"source": [
"## 1. Import Modules/Packages"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "80ecc19e-d0fb-4435-9284-44318db1937c",
"metadata": {},
"outputs": [],
"source": [
"# !pip install google-cloud-documentai"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "10e90cfe",
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2d4b7289",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from google.cloud import documentai_v1beta3 as documentai\n",
"\n",
"from utilities import (\n",
" documentai_json_proto_downloader,\n",
" file_names,\n",
" store_document_as_json,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4aa4caea",
"metadata": {},
"source": [
"## 2. Input Details"
]
},
{
"cell_type": "markdown",
"id": "717bb8cc",
"metadata": {},
"source": [
"* **GCS_INPUT_URI** : It is input GCS folder path which contains DocumentAI processor JSON results\n",
"* **GCS_OUTPUT_URI** : It is a GCS folder path to store post-processing results\n",
"* **Y_STOP_GAP_HEADER** : Minimum gap to check between the header and general text\n",
"* **Y_STOP_GAP_FOOTER**: Minimum gap to check between the footer and general text\n",
"* **Y_HEADER_BORDER** : Header text end maximum position (eg: 0.06)\n",
"* **Y_FOOTER_BORDER**: Footer text start position (eg: 0.90)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "bfdd838b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"GCS_INPUT_URI = \"gs://BUCKET/header_footer_entities/input/\"\n",
"GCS_OUTPUT_URI = \"gs://BUCKET/header_footer_entities/output/\"\n",
"\n",
"# configurable parameters\n",
"Y_STOP_GAP_HEADER = 0.0260\n",
"Y_STOP_GAP_FOOTER = 0.001\n",
"# OR\n",
"Y_HEADER_BORDER = None # 0.06\n",
"Y_FOOTER_BORDER = None # 0.90"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6a84711f-b166-4d2b-adca-2a62b4077c10",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"GCS_INPUT_URI = \"gs://siddamv/tools/header_footer_entities/input/\"\n",
"GCS_OUTPUT_URI = \"gs://siddamv/tools/header_footer_entities/output/\"\n",
"\n",
"# configurable parameters\n",
"Y_STOP_GAP_HEADER = 0.0260 # minimum gap to check between the header and general text\n",
"Y_STOP_GAP_FOOTER = 0.001 # minimum gap to check between the footer and general text\n",
"# OR\n",
"Y_HEADER_BORDER = None # 0.06 #Header text end maximum position\n",
"Y_FOOTER_BORDER = None # 0.90 #Footer text start position"
]
},
{
"cell_type": "markdown",
"id": "3347e5d1",
"metadata": {},
"source": [
"## 3. Run Below Code-Cells"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "73ea5afe-dee7-4a17-a256-b67f852e79a7",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"file_name: AWS DPA-0.json\n",
"\tUploading file AWS DPA-0.json to GCS gs://siddamv/tools/header_footer_entities/output/\n"
]
}
],
"source": [
"def add_header_footer_entities(json_data: documentai.Document) -> documentai.Document:\n",
" \"\"\"\n",
" It will create header and footer entities if any text falls under predefined variables region\n",
"\n",
" Args:\n",
" json_data (documentai.Document): DocumentAI Processor result in Document object format\n",
"\n",
" Returns:\n",
" documentai.Document: Post processed Document object\n",
" \"\"\"\n",
"\n",
" for page in json_data.pages:\n",
" page_num = page.page_number\n",
" y_list = []\n",
" for token in page.tokens:\n",
" vertices = token.layout.bounding_poly.normalized_vertices\n",
" minx_b, miny_b = min(point.x for point in vertices), min(\n",
" point.y for point in vertices\n",
" )\n",
" maxx_b, maxy_b = max(point.x for point in vertices), max(\n",
" point.y for point in vertices\n",
" )\n",
" y_list.append(miny_b)\n",
" y_list_sorted = sorted(y_list)\n",
" max_y_header = None\n",
" header_text_anchors = []\n",
" header_page_x = []\n",
" header_page_y = []\n",
" footer_text_anchors = []\n",
" footer_page_x = []\n",
" footer_page_y = []\n",
" for i in range(len(y_list_sorted) - 1):\n",
" if y_list_sorted[i + 1] - y_list_sorted[i] > Y_STOP_GAP_HEADER:\n",
" max_y_header = y_list_sorted[i + 1]\n",
" break\n",
" min_y_header = None\n",
" for i in range(len(y_list_sorted) - 1, 0, -1):\n",
" if y_list_sorted[i] - y_list_sorted[i - 1] > Y_STOP_GAP_FOOTER:\n",
" min_y_header = y_list_sorted[i - 1]\n",
" break\n",
" header_text = \"\"\n",
" footer_text = \"\"\n",
" if Y_HEADER_BORDER != None:\n",
" max_y_header = Y_HEADER_BORDER\n",
" if Y_FOOTER_BORDER != None:\n",
" min_y_header = Y_FOOTER_BORDER\n",
" for token in page.tokens:\n",
" vertices = token.layout.bounding_poly.normalized_vertices\n",
" minx_b, miny_b = min(point.x for point in vertices), min(\n",
" point.y for point in vertices\n",
" )\n",
" maxx_b, maxy_b = max(point.x for point in vertices), max(\n",
" point.y for point in vertices\n",
" )\n",
" if miny_b < max_y_header:\n",
" anc = token.layout.text_anchor.text_segments\n",
" for text_anc in anc:\n",
" start_index = text_anc.start_index\n",
" end_index = text_anc.end_index\n",
" header_text += json_data.text[start_index:end_index]\n",
" header_ts = documentai.Document.TextAnchor.TextSegment(\n",
" start_index=start_index, end_index=end_index\n",
" )\n",
" header_text_anchors.append(header_ts)\n",
" header_page_x.extend([minx_b, maxx_b])\n",
" header_page_y.extend([miny_b, maxy_b])\n",
" if miny_b > min_y_header:\n",
" anc = token.layout.text_anchor.text_segments\n",
" for text_anc in anc:\n",
" start_index = text_anc.start_index\n",
" end_index = text_anc.end_index\n",
" footer_text += json_data.text[start_index:end_index]\n",
" footer_ts = documentai.Document.TextAnchor.TextSegment(\n",
" start_index=start_index, end_index=end_index\n",
" )\n",
" footer_text_anchors.append(footer_ts)\n",
" footer_page_x.extend([minx_b, maxx_b])\n",
" footer_page_y.extend([miny_b, maxy_b])\n",
" sorted_footer_text_anchors = sorted(\n",
" footer_text_anchors, key=lambda x: x.end_index\n",
" )\n",
" sorted_header_text_anchors = sorted(\n",
" header_text_anchors, key=lambda x: x.end_index\n",
" )\n",
" header_mention_text = \"\"\n",
" for an1 in sorted_header_text_anchors:\n",
" header_mention_text += json_data.text[an1.start_index : an1.end_index]\n",
" footer_mention_text = \"\"\n",
" for an1 in sorted_footer_text_anchors:\n",
" footer_mention_text += json_data.text[an1.start_index : an1.end_index]\n",
" try:\n",
" normalized_vertex_0 = documentai.NormalizedVertex(\n",
" x=min(header_page_x), y=min(header_page_y)\n",
" )\n",
" normalized_vertex_1 = documentai.NormalizedVertex(\n",
" x=max(header_page_x), y=min(header_page_y)\n",
" )\n",
" normalized_vertex_2 = documentai.NormalizedVertex(\n",
" x=min(header_page_x), y=max(header_page_y)\n",
" )\n",
" normalized_vertex_3 = documentai.NormalizedVertex(\n",
" x=max(header_page_x), y=max(header_page_y)\n",
" )\n",
" header_norm_ver = [\n",
" normalized_vertex_0,\n",
" normalized_vertex_1,\n",
" normalized_vertex_2,\n",
" normalized_vertex_3,\n",
" ]\n",
" header_entity = documentai.Document.Entity()\n",
" header_entity.mention_text = header_mention_text\n",
" header_entity.type = \"header\"\n",
" bp = documentai.BoundingPoly(normalized_vertices=header_norm_ver)\n",
" pr = documentai.Document.PageAnchor.PageRef(\n",
" page=str(page_num - 1), bounding_poly=bp\n",
" )\n",
" pa = documentai.Document.PageAnchor()\n",
" pa.page_refs = [pr]\n",
" header_entity.page_anchor = pa\n",
" header_entity.text_anchor.text_segments = header_text_anchors\n",
" json_data.entities.append(header_entity)\n",
"\n",
" except ValueError:\n",
" print(\"NO HEADER page_number\", page_num)\n",
" continue\n",
" try:\n",
" normalized_vertex_0 = documentai.NormalizedVertex(\n",
" x=min(footer_page_x), y=min(footer_page_y)\n",
" )\n",
" normalized_vertex_1 = documentai.NormalizedVertex(\n",
" x=max(footer_page_x), y=min(footer_page_y)\n",
" )\n",
" normalized_vertex_2 = documentai.NormalizedVertex(\n",
" x=min(footer_page_x), y=max(footer_page_y)\n",
" )\n",
" normalized_vertex_3 = documentai.NormalizedVertex(\n",
" x=max(footer_page_x), y=max(footer_page_y)\n",
" )\n",
" footer_norm_ver = [\n",
" normalized_vertex_0,\n",
" normalized_vertex_1,\n",
" normalized_vertex_2,\n",
" normalized_vertex_3,\n",
" ]\n",
" footer_entity = documentai.Document.Entity()\n",
" footer_entity.mention_text = footer_mention_text\n",
" footer_entity.type = \"footer\"\n",
" bp = documentai.BoundingPoly(normalized_vertices=footer_norm_ver)\n",
" pr = documentai.Document.PageAnchor.PageRef(\n",
" page=str(page_num - 1), bounding_poly=bp\n",
" )\n",
" pa = documentai.Document.PageAnchor()\n",
" pa.page_refs = [pr]\n",
" footer_entity.page_anchor = pa\n",
" footer_entity.text_anchor.text_segments = footer_text_anchors\n",
" json_data.entities.append(footer_entity)\n",
" except ValueError:\n",
" print(\"NO FOOTER page_number\", page_num)\n",
" continue\n",
" return json_data\n",
"\n",
"\n",
"# getting list of files and path from GCS\n",
"file_names_list, file_names_dict = file_names(GCS_INPUT_URI)\n",
"# looping each file for adding footer and header entities\n",
"for i in range(len(file_names_list)):\n",
" print(\"file_name: \", file_names_list[i])\n",
" file_name = file_names_list[i]\n",
" json_data = documentai_json_proto_downloader(\n",
" GCS_INPUT_URI.split(\"/\")[2], file_names_dict[file_names_list[i]]\n",
" )\n",
" y = 0\n",
" json_data = add_header_footer_entities(json_data)\n",
" print(f\"\\tUploading file {file_name} to GCS {GCS_OUTPUT_URI}\")\n",
" store_document_as_json(\n",
" documentai.Document.to_json(json_data),\n",
" GCS_OUTPUT_URI.split(\"/\")[2],\n",
" (\"/\").join(GCS_OUTPUT_URI.split(\"/\")[3:]) + file_name,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "a9eda18a",
"metadata": {},
"source": [
"# 4. Output Details\n",
"\n",
"Refer below images for postprocessed results"
]
},
{
"cell_type": "markdown",
"id": "7347f175",
"metadata": {},
"source": [
"<img src='./images/output.png' width=1000 height=800></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c224936d-df20-49b8-a63d-78c0d37a9018",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "4011f8aa-4fe1-4a4a-acf5-c173aae78d66",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m125",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}