incubator-tools/Label_Section_Headers/Label_Section_Headers.ipynb (738 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "21c9480d-5b4e-4119-b13c-2ee5aee673ed",
"metadata": {},
"source": [
"# Label Section Headers"
]
},
{
"cell_type": "markdown",
"id": "63f388a5-6ad4-430d-86e0-15c3834d4d99",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "2ffbe40b-977d-4fd6-b72a-d3f3e80aaef1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "d284197f-200c-4cc8-92e7-3ee0fd6a5c07",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This tool is helpful to label headers, sub-headers and sub-sub-headers based on the provided range of font-sizes for each header-type. It creates new entities for each header type.\n",
"Added few parameters so that User/Customer can able to tune this tool as per their layout \n",
"headers - text only with bold is considered for headers \n",
"[section]_text - text only with non-bold/plain is considered for headers\n",
"\n",
"based on sections_count 1 or 2 or 3 \n",
" * it will select entity labels as (h) or (h, sh) or (h,sh,ssh) \n",
"\n",
"based bold_text_threshold range - [0,1] \n",
" * it will discard font_sizes which falls under mentioned threshold, then it will group all fonts which are above threshold creates groups based on `section_count`\n",
"\n",
"\n",
"footer_threshold & header_threshold - [0-1] i.e, normalized y-vertices point \n",
"* most data in footer and header sections are in bold-text, if we utilize these thresholds to discard all tokens which fall in these regions, we improve bold fonts distribution. \n",
"NOTE: if user utilizing footer_threshold & header_threshold, it is recommended to keep bold_text_threshold to zero, so that all tokens are used to create entities"
]
},
{
"cell_type": "markdown",
"id": "e3d578b7-e284-42b1-aa22-e4a83de19f2d",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Access to vertex AI Notebook or Google Colab\n",
"* Python\n",
"* Vertex AI Notebook\n",
"* GCS Folder Path\n",
"* Document OCR Processor"
]
},
{
"cell_type": "markdown",
"id": "40b217d0-dd1f-4c8a-9f86-2d62b1571d2f",
"metadata": {},
"source": [
"## Step by Step procedure "
]
},
{
"cell_type": "markdown",
"id": "19cc2724-2b2a-4ba5-90f4-c6f660d5ff4e",
"metadata": {},
"source": [
"### 1.Importing Required Modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d32f550e-62c5-4744-9586-c9ccdb83871f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a601674d-2625-48ec-b0f2-c0b8c8ae3008",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import math\n",
"from collections import Counter\n",
"from typing import Dict, List, Sequence, Tuple, Optional, Union\n",
"\n",
"from google.api_core.client_options import ClientOptions\n",
"from google.cloud import documentai_v1beta3 as documentai\n",
"from google.cloud import storage\n",
"from utilities import file_names, store_document_as_json"
]
},
{
"cell_type": "markdown",
"id": "6a52ba38-da1d-4e32-b536-fdd763ff2b3f",
"metadata": {},
"source": [
"### 2.Setup the inputs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0b345d1-3c93-4e69-b44f-43af9091dbd7",
"metadata": {},
"outputs": [],
"source": [
"project_id = \"xx-xx-xx\"\n",
"location = \"us\" # Format is \"us\" or \"eu\"\n",
"processor_id = \"xx-xx-xx\"\n",
"processor_version = \"pretrained-ocr-v2.0-2023-06-02\"\n",
"mime_type = \"application/pdf\"\n",
"input_gcs_path = \"gs://BUCKET_NAME/headers_layout_detection/samples/\"\n",
"output_gcs_path = \"gs://BUCKET_NAME/headers_layout_detection/output/\"\n",
"# to (h) or (h, sh) or (h,sh,ssh) <- section_count\n",
"sections_count = 2\n",
"# (0 - 1 range) 0.1 -> leassthan 10% of bold fonts converage is discarded\n",
"# if you are not willing discard any text in bold then also provide zero\n",
"bold_text_threshold = 0\n",
"# 0-1 range, 0.1 -> 10% of page bottom to exclude for token,\n",
"# i.e, it covers (0 - 0.90 normalized y coord)\n",
"footer_offset = 0.1\n",
"# 0-1 range, 0.1 -> 10% of page bottom to exclude for token\n",
"# i.e, it covers (0.10 - 1.0 normalized y coord)\n",
"header_offset = 0.07"
]
},
{
"cell_type": "markdown",
"id": "dafd90e3-ee0c-4a1f-b251-2d90b7416152",
"metadata": {},
"source": [
"`processor_id`: DocumentAI OCR Processor Id\n",
"\n",
"`location`: Processor Location\n",
"\n",
"`processor_version`: OCR processor version of V2.X\n",
"\n",
"`input_gcs_path`: GCS folder path containing input samples\n",
"\n",
"`output_gcs_path`: GCS folder path containing to store post-processed results\n",
"\n",
"`mime_type`: mime type of input files\n",
"\n",
"`sections_count` : Number of header/section levels required (1/2/3)\n",
" * 1/2/3 -> (h) or (h, sh) or (h,sh,ssh)\n",
" \n",
"`bold_text_threshold` : Threshold value to discard fonts which falls less than this value\n",
" * [0 - 1] -> 0.1 means less than 10% covered bold fonts are discarded \n",
" \n",
"`footer_offset` : Threshold value(normalized y-coord) to discard tokens which falls in this range i.e, y coord [footer_offest, 1]\n",
" * [0-1] -> 0.1 means 10% of page bottom to exclude for tokens\n",
" \n",
"`header_offset` : Threshold value(normalized y-coord) to discard tokens which falls in this range, i.e, y coord [0, header_offset]\n",
" * [0-1] -> 0.1 means 10% of page header to exclude for tokens\n"
]
},
{
"cell_type": "markdown",
"id": "9c1f4963-da7f-4c7b-8f6d-373e15486c59",
"metadata": {},
"source": [
"### 3.Run the required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06dfee6b-231f-4aa3-bb42-34da7d2d2f77",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def process_document_ocr_sample(\n",
" project_id: str,\n",
" location: str,\n",
" processor_id: str,\n",
" processor_version: str,\n",
" pdf_bytes: bytes,\n",
" mime_type: str,\n",
") -> documentai.Document:\n",
" \"\"\"\n",
" Processes a PDF document using the Document AI OCR processor.\n",
"\n",
" Args:\n",
" project_id (str): The Google Cloud project ID.\n",
" location (str): The location where the Document AI processor is hosted.\n",
" processor_id (str): The ID of the Document AI processor to use.\n",
" processor_version (str): The version of the processor to use (e.g., \"pretrained-ocr-v1.1\").\n",
" pdf_bytes (bytes): The PDF file content as a byte stream.\n",
" mime_type (str): The MIME type of the file (e.g., 'application/pdf').\n",
"\n",
" Returns:\n",
" documentai.Document: A Document object containing the OCR-processed results,\n",
" including text and layout information from the document.\n",
" \"\"\"\n",
"\n",
" client_opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n",
" client = documentai.DocumentProcessorServiceClient(client_options=client_opts)\n",
" name = client.processor_version_path(\n",
" project_id, location, processor_id, processor_version\n",
" )\n",
" raw_doc = documentai.RawDocument(content=pdf_bytes, mime_type=mime_type)\n",
" premium_features = documentai.OcrConfig.PremiumFeatures(compute_style_info=True)\n",
" ocr_config = documentai.OcrConfig(\n",
" enable_native_pdf_parsing=False, premium_features=premium_features\n",
" )\n",
" process_options = documentai.ProcessOptions(ocr_config=ocr_config)\n",
" request = documentai.ProcessRequest(\n",
" name=name,\n",
" raw_document=raw_doc,\n",
" # Only supported for Document OCR processor\n",
" process_options=process_options,\n",
" )\n",
" result = client.process_document(request=request)\n",
" return result.document\n",
"\n",
"\n",
"def get_headers(\n",
" tokens: Sequence[documentai.Document.Page.Token],\n",
" fonts_list: List[int],\n",
" is_bold: bool = True,\n",
" footer_threshold: float = 0.05,\n",
" header_threshold: float = 0.05,\n",
"):\n",
" \"\"\"\n",
" Filters and retrieves header tokens from a sequence of document tokens based on font size, boldness,\n",
" and threshold settings for headers and footers.\n",
"\n",
" Args:\n",
" tokens (Sequence[documentai.Document.Page.Token]): A sequence of token objects from the document.\n",
" fonts_list (List[int]): A list of font sizes that are considered for header detection.\n",
" is_bold (bool): A flag to filter tokens based on whether they are bold or not. Defaults to True (bold).\n",
" footer_threshold (float): The vertical position threshold for discarding tokens as footers. Defaults to 0.05.\n",
" header_threshold (float): The vertical position threshold for discarding tokens as headers. Defaults to 0.05.\n",
"\n",
" Returns:\n",
" List[documentai.Document.Page.Token]: A list of tokens that qualify as headers based on the specified conditions.\n",
" \"\"\"\n",
" headers = []\n",
" for token in tokens:\n",
" token: documentai.Document.Page.Token = token\n",
" if is_token_need_to_discard(token, footer_threshold, header_threshold):\n",
" continue\n",
" font_size = token.style_info.font_size\n",
" bold = token.style_info.bold\n",
" if is_bold:\n",
" if font_size in fonts_list and bold:\n",
" headers.append(token)\n",
" else:\n",
" if font_size in fonts_list and not bold:\n",
" headers.append(token)\n",
" return headers\n",
"\n",
"\n",
"def get_groups(\n",
" headers: List[documentai.Document.Page.Token],\n",
") -> List[List[documentai.Document.Page.Token]]:\n",
" \"\"\"\n",
" Groups tokens (headers) based on their text anchor indices.\n",
"\n",
" Args:\n",
" headers (List[documentai.Document.Page.Token]): List of token headers.\n",
"\n",
" Returns:\n",
" List[List[documentai.Document.Page.Token]]: A list of grouped token headers.\n",
" \"\"\"\n",
"\n",
" group = []\n",
" groups = []\n",
" for idx, header in enumerate(headers[:-1]):\n",
" curr_ts = header.layout.text_anchor.text_segments[0]\n",
" next_ts = headers[idx + 1].layout.text_anchor.text_segments[0]\n",
" curr = curr_ts.start_index, curr_ts.end_index\n",
" nxt = next_ts.start_index, next_ts.end_index\n",
" if len(set([*curr, *nxt])) == 3:\n",
" group.append(header)\n",
" if len(set([*curr, *nxt])) == 4:\n",
" group.append(header)\n",
" groups.append(group)\n",
" group = []\n",
" else:\n",
" group.append(headers[-1])\n",
" groups.append(group)\n",
" return groups\n",
"\n",
"\n",
"def get_new_entity_attrs(\n",
" tokens: List[documentai.Document.Page.Token],\n",
") -> Tuple[List[int], List[int], List[float], List[float], List[float]]:\n",
" \"\"\"\n",
" Extracts attributes from a list of tokens, including start and end indices, x and y coordinates, and confidences.\n",
"\n",
" Args:\n",
" tokens (List[Document.Page.Token]): List of tokens from which to extract attributes.\n",
"\n",
" Returns:\n",
" Tuple[List[int], List[int], List[float], List[float], List[float]]:\n",
" - List of start indices (si) for the tokens.\n",
" - List of end indices (ei) for the tokens.\n",
" - List of x coordinates for the token bounding polygons.\n",
" - List of y coordinates for the token bounding polygons.\n",
" - List of confidence scores for each token's layout.\n",
" \"\"\"\n",
" si = []\n",
" ei = []\n",
" x, y = [], []\n",
" confidences = []\n",
" for token in tokens:\n",
" ts = token.layout.text_anchor.text_segments[0]\n",
" si.append(ts.start_index)\n",
" ei.append(ts.end_index)\n",
" token_x, token_y = get_x_y_list(token.layout.bounding_poly)\n",
" x.extend(token_x)\n",
" y.extend(token_y)\n",
" confidences.append(token.layout.confidence)\n",
" return si, ei, x, y, confidences\n",
"\n",
"\n",
"def get_x_y_list(\n",
" bounding_poly: documentai.BoundingPoly,\n",
") -> Tuple[List[float], List[float]]:\n",
" \"\"\"\n",
" Extracts the x and y coordinates from a BoundingPoly's normalized vertices.\n",
"\n",
" Args:\n",
" bounding_poly (BoundingPoly): A bounding polygon object that contains the normalized vertices.\n",
"\n",
" Returns:\n",
" Tuple[List[float], List[float]]:\n",
" - A list of x coordinates.\n",
" - A list of y coordinates.\n",
" \"\"\"\n",
" x, y = [], []\n",
" for nvs in bounding_poly.normalized_vertices:\n",
" x.append(nvs.x)\n",
" y.append(nvs.y)\n",
" return x, y\n",
"\n",
"\n",
"def get_normalized_vertices(\n",
" x: List[float], y: List[float]\n",
") -> List[documentai.NormalizedVertex]:\n",
" \"\"\"\n",
" Creates a list of normalized vertices (x, y coordinates) based on the bounding box\n",
" formed by the minimum and maximum values from the input x and y coordinates.\n",
"\n",
" Args:\n",
" x (List[float]): A list of x coordinates.\n",
" y (List[float]): A list of y coordinates.\n",
"\n",
" Returns:\n",
" List[NormalizedVertex]: A list of four `NormalizedVertex` objects representing\n",
" the corners of the bounding box defined by the input x and y coordinates.\n",
" \"\"\"\n",
" nvs = []\n",
" xy = [[min(x), min(y)], [max(x), min(y)], [max(x), max(y)], [min(x), max(y)]]\n",
" for _x, _y in xy:\n",
" nv = documentai.NormalizedVertex(x=_x, y=_y)\n",
" nvs.append(nv)\n",
" return nvs\n",
"\n",
"\n",
"def create_headers_ents(\n",
" ent_type: str,\n",
" text: str,\n",
" page_num: int,\n",
" groups: List[List[documentai.Document.Page.Token]],\n",
") -> List[documentai.Document.Entity]:\n",
" \"\"\"\n",
" Creates entities for header sections from token groups.\n",
"\n",
" Args:\n",
" ent_type (str): The type of entity being created.\n",
" text (str): The full text of the document.\n",
" page_num (int): The page number where the headers are found.\n",
" groups (List[List[Document.Page.Token]]): A list of groups of tokens representing headers.\n",
"\n",
" Returns:\n",
" List[Document.Entity]: A list of Document.Entity objects created from the token groups.\n",
" \"\"\"\n",
" entities = []\n",
" for tokens in groups:\n",
" si, ei, x, y, confidences = get_new_entity_attrs(tokens)\n",
" _confidence = sum(confidences) / len(confidences)\n",
" _type = ent_type\n",
" _ts = documentai.Document.TextAnchor.TextSegment(\n",
" start_index=min(si), end_index=max(ei)\n",
" )\n",
" _text_anchor = documentai.Document.TextAnchor(text_segments=[_ts])\n",
" _mention_text = text[_ts.start_index : _ts.end_index]\n",
" _bounding_poly = documentai.BoundingPoly()\n",
" _bounding_poly.normalized_vertices = get_normalized_vertices(x, y)\n",
" _page_ref = documentai.Document.PageAnchor.PageRef(\n",
" page=page_num, bounding_poly=_bounding_poly\n",
" )\n",
" _page_anchor = documentai.Document.PageAnchor(page_refs=[_page_ref])\n",
" entity = documentai.Document.Entity(\n",
" confidence=_confidence,\n",
" type_=_type,\n",
" mention_text=_mention_text,\n",
" text_anchor=_text_anchor,\n",
" page_anchor=_page_anchor,\n",
" )\n",
" entities.append(entity)\n",
" return entities\n",
"\n",
"\n",
"def is_token_need_to_discard(\n",
" token: documentai.Document.Page.Token,\n",
" footer_threshold: float = 0.05,\n",
" header_threshold: float = 0.05,\n",
") -> bool:\n",
" \"\"\"\n",
" Determines whether a token should be discarded based on its position\n",
" relative to header and footer thresholds.\n",
"\n",
" Args:\n",
" token (Document.Page.Token): The token to evaluate.\n",
" footer_threshold (float): The threshold for the footer position (0.0 to 1.0).\n",
" header_threshold (float): The threshold for the header position (0.0 to 1.0).\n",
"\n",
" Returns:\n",
" bool: True if the token should be discarded, False otherwise.\n",
" \"\"\"\n",
" footer_threshold = 1 - footer_threshold\n",
" bp = token.layout.bounding_poly\n",
" _, y = get_x_y_list(bp)\n",
" y_max = max(y, default=1)\n",
" y_min = min(y, default=0)\n",
" # to discard footer-section\n",
" if y_max > footer_threshold:\n",
" return True\n",
" # to discard top header-section\n",
" if y_min < header_threshold:\n",
" return True\n",
" return False\n",
"\n",
"\n",
"# to get document wise fonts distribution\n",
"def get_fonts_distribution(\n",
" doc: documentai.Document,\n",
" is_bold: bool = True,\n",
" footer_threshold: float = 0.05,\n",
" header_threshold: float = 0.05,\n",
") -> dict:\n",
" \"\"\"\n",
" Analyzes the font distribution in a Document object.\n",
"\n",
" Args:\n",
" doc (Document): The Document object containing pages and tokens.\n",
" is_bold (bool): Flag to include only bold fonts if True.\n",
" footer_threshold (float): The threshold for footer position (0.0 to 1.0).\n",
" header_threshold (float): The threshold for header position (0.0 to 1.0).\n",
"\n",
" Returns:\n",
" dict: A dictionary containing font sizes as keys and their frequency\n",
" distribution as values, sorted in descending order of frequency.\n",
" \"\"\"\n",
" fonts = []\n",
" for page in doc.pages:\n",
" for token in page.tokens:\n",
" if is_token_need_to_discard(token, footer_threshold, header_threshold):\n",
" continue\n",
" if is_bold:\n",
" if token.style_info and token.style_info.bold:\n",
" fonts.append(token.style_info.font_size)\n",
" else:\n",
" fonts.append(token.style_info.font_size)\n",
" font_freqs = Counter(fonts)\n",
" print(f\"\\tbold = {is_bold} tokens count - {len(fonts)}\")\n",
" count = sum(font_freqs.values())\n",
" fonts_freq_dist = {k: v / count for k, v in font_freqs.items()}\n",
" fonts_freq_dist = sorted(\n",
" fonts_freq_dist.items(), key=lambda item: item[1], reverse=True\n",
" )\n",
" fonts_freq_dist = dict(fonts_freq_dist)\n",
" return fonts_freq_dist\n",
"\n",
"\n",
"def bag_fonts_desc(\n",
" fonts_dist: Dict[str, float], no_of_groups: int, threshold: float = 0.1\n",
") -> List[List[str]]:\n",
" \"\"\"\n",
" Groups font sizes based on their frequency distribution.\n",
"\n",
" Args:\n",
" fonts_dist (Dict[str, float]): A dictionary containing font sizes as keys\n",
" and their frequency distribution as values.\n",
" no_of_groups (int): The number of groups to create.\n",
" threshold (float): The minimum frequency a font size must have to be included.\n",
" Defaults to 0.1.\n",
"\n",
" Returns:\n",
" List[List[str]]: A list of groups, where each group is a list of font sizes.\n",
" \"\"\"\n",
" fonts_dist = {k: v for k, v in fonts_dist.items() if v > threshold}\n",
" fonts = list(fonts_dist.keys())\n",
" fonts_cnt = len(fonts)\n",
" fonts_per_group = math.ceil(fonts_cnt / no_of_groups)\n",
" groups = []\n",
" for idx in range(no_of_groups):\n",
" start = idx * fonts_per_group\n",
" # end = start + fonts_per_group\n",
" end = (idx + 1) * fonts_per_group\n",
" group = fonts[start:end]\n",
" groups.append(group)\n",
" # h, sh, ssh\n",
" return groups[::-1]\n",
"\n",
"\n",
"def adjust_empty_header_font_tags(\n",
" fonts_list: List[Union[List[str], None]]\n",
") -> List[List[str]]:\n",
" \"\"\"\n",
" Adjusts the font tags for headers based on the number of sections present.\n",
"\n",
" Args:\n",
" fonts_list (List[Union[List[str], None]]): A list containing font sections\n",
" for headers, where each section\n",
" is a list of font sizes or None.\n",
"\n",
" Returns:\n",
" List[List[str]]: A list of three sections, each a list of font sizes,\n",
" ensuring that empty sections are added as needed.\n",
" \"\"\"\n",
" counter = 0\n",
" updated_list = []\n",
" for section in fonts_list:\n",
" if section:\n",
" counter += 1\n",
" updated_list.append(section)\n",
" if counter == 3:\n",
" return updated_list\n",
" elif counter == 2:\n",
" updated_list.append([])\n",
" return updated_list\n",
" elif counter == 1:\n",
" updated_list.extend([[], []])\n",
" return updated_list\n",
" elif not counter:\n",
" # to handle if page has no tokens\n",
" return [[], [], []]\n",
" return fonts_list\n",
"\n",
"\n",
"def tag_text_with_section_type(doc: documentai.Document) -> documentai.Document:\n",
" \"\"\"\n",
" Tags entities in a Document with their respective section types.\n",
"\n",
" Args:\n",
" doc (documentai.Document): The Document object containing entities to be tagged.\n",
"\n",
" Returns:\n",
" documentai.Document: The updated Document object with tagged entities.\n",
" \"\"\"\n",
"\n",
" updated_entities = []\n",
"\n",
" def ent_sort_key(ent: documentai.Document.Entity) -> tuple:\n",
" \"\"\"\n",
" Sort key for entities based on page number and y-coordinate.\n",
"\n",
" Args:\n",
" ent (documentai.Document.Entity): The entity to extract sorting keys from.\n",
"\n",
" Returns:\n",
" tuple: A tuple containing the page number and the y-coordinate.\n",
" \"\"\"\n",
" pg_no = ent.page_anchor.page_refs[0].page\n",
" y_coord = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices[0].y\n",
" return pg_no, y_coord\n",
"\n",
" pre_ent_type = \"\"\n",
" old_entities = sorted(doc.entities, key=ent_sort_key)\n",
" for ent in old_entities:\n",
" if ent.type_ == \"text\" and pre_ent_type not in (\"\", \"text\"):\n",
" ent.type_ = f\"{pre_ent_type.rstrip('_text')}_text\"\n",
" pre_ent_type = ent.type_\n",
" updated_entities.append(ent)\n",
" doc.entities = updated_entities\n",
" return doc"
]
},
{
"cell_type": "markdown",
"id": "b22ed8ac-2e39-4266-8953-e1ca180e2bd4",
"metadata": {},
"source": [
"### 4.Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "485296ce-cf37-4017-a3ce-aee7405e035e",
"metadata": {},
"outputs": [],
"source": [
"file_list, files_dict = file_names(input_gcs_path)\n",
"input_bucket = input_gcs_path.split(\"/\")[2]\n",
"output_splits = output_gcs_path.split(\"/\")\n",
"output_bucket = output_splits[2]\n",
"output_dir = \"/\".join(output_splits[3:]).strip(\"/\")\n",
"storage_client = storage.Client()\n",
"bucket = storage_client.get_bucket(input_bucket)\n",
"for fn, fp in files_dict.items():\n",
" print(f\"File: {fn}\")\n",
" pdf_bytes = bucket.blob(fp).download_as_bytes()\n",
" try:\n",
" res = process_document_ocr_sample(\n",
" project_id, location, processor_id, processor_version, pdf_bytes, mime_type\n",
" )\n",
" except Exception as e:\n",
" print(f\"Unable to parse document due to {type(e)}, {str(e)}\")\n",
" continue\n",
" text = res.text\n",
" bold_dist = get_fonts_distribution(res, True, footer_offset, header_offset)\n",
" plain_dist = get_fonts_distribution(res, False, footer_offset, header_offset)\n",
" # 3 -> to group headers into 3 types i.e, headers, sub_headers, sub_sub_headers\n",
" bagged_fonts = bag_fonts_desc(\n",
" bold_dist, sections_count, threshold=bold_text_threshold\n",
" )\n",
" # print(f\"\\th, sh, ssh = {bagged_fonts}\")\n",
" # to move all empty fonts lists to end\n",
" h_fonts, sh_fonts, ssh_fonts = bag = adjust_empty_header_font_tags(bagged_fonts)\n",
" print(f\"\\th, sh, ssh = {bag}\")\n",
" text_fonts = list(plain_dist.keys())\n",
" print(\"\\tentities\", len(res.entities), end=\" \")\n",
" for page in res.pages:\n",
" page_num = page.page_number\n",
" ent_page_num = page_num - 1\n",
" headers = get_headers(page.tokens, h_fonts, True, footer_offset, header_offset)\n",
" sub_headers = get_headers(\n",
" page.tokens, sh_fonts, True, footer_offset, header_offset\n",
" )\n",
" sub_sub_headers = get_headers(\n",
" page.tokens, ssh_fonts, True, footer_offset, header_offset\n",
" )\n",
" text_headers = get_headers(\n",
" page.tokens, text_fonts, False, footer_offset, header_offset\n",
" )\n",
" if headers:\n",
" header_groups = get_groups(headers)\n",
" header_ents = create_headers_ents(\n",
" \"header\", text, ent_page_num, header_groups\n",
" )\n",
" res.entities.extend(header_ents)\n",
" if sub_headers:\n",
" sub_header_groups = get_groups(sub_headers)\n",
" sub_header_ents = create_headers_ents(\n",
" \"sub_header\", text, ent_page_num, sub_header_groups\n",
" )\n",
" res.entities.extend(sub_header_ents)\n",
" if sub_sub_headers:\n",
" sub_sub_header_groups = get_groups(sub_sub_headers)\n",
" sub_sub_header_ents = create_headers_ents(\n",
" \"sub_sub_header\", text, ent_page_num, sub_sub_header_groups\n",
" )\n",
" res.entities.extend(sub_sub_header_ents)\n",
" if text_headers:\n",
" text_header_groups = get_groups(text_headers)\n",
" sub_sub_header_ents = create_headers_ents(\n",
" \"text\", text, ent_page_num, text_header_groups\n",
" )\n",
" res.entities.extend(sub_sub_header_ents)\n",
" print(\"\\tentities\", len(res.entities))\n",
" res = tag_text_with_section_type(res)\n",
" json_str = documentai.Document.to_json(res, including_default_value_fields=False)\n",
" out_fn = fn.rsplit(\".\", 1)[0]\n",
" out_fp = f\"{output_dir}/{out_fn}.json\"\n",
" print(f\"\\tWriting data to gs://{output_bucket}/{out_fp}\")\n",
" store_document_as_json(json_str, output_bucket, out_fp)\n",
"print(\"Process Completed!!!\")"
]
},
{
"cell_type": "markdown",
"id": "d84fa2e6-1a60-49fd-8a75-7feba0e7da04",
"metadata": {},
"source": [
"### 5.Output\n",
"\n",
"The provided code identifies and classifies text elements as label headers, sub-headers, or sub-sub-headers within a document. It does this by analyzing the font size of the text and comparing it to predefined ranges for each header type. As a result, new entities are created that correspond to each identified header type.\n",
"\n",
"<img src=\"./Images/tool_ouput.png\"></img>"
]
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m125",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}