incubator-tools/Label_Section_Headers/Label_Section

{ "cells": [ { "cell_type": "markdown", "id": "21c9480d-5b4e-4119-b13c-2ee5aee673ed", "metadata": {}, "source": [ "# Label Section Headers" ] }, { "cell_type": "markdown", "id": "63f388a5-6ad4-430d-86e0-15c3834d4d99", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "2ffbe40b-977d-4fd6-b72a-d3f3e80aaef1", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "d284197f-200c-4cc8-92e7-3ee0fd6a5c07", "metadata": {}, "source": [ "## Objective\n", "\n", "This tool is helpful to label headers, sub-headers and sub-sub-headers based on the provided range of font-sizes for each header-type. It creates new entities for each header type.\n", "Added few parameters so that User/Customer can able to tune this tool as per their layout \n", "headers - text only with bold is considered for headers \n", "[section]_text - text only with non-bold/plain is considered for headers\n", "\n", "based on sections_count 1 or 2 or 3 \n", " * it will select entity labels as (h) or (h, sh) or (h,sh,ssh) \n", "\n", "based bold_text_threshold range - [0,1] \n", " * it will discard font_sizes which falls under mentioned threshold, then it will group all fonts which are above threshold creates groups based on `section_count`\n", "\n", "\n", "footer_threshold & header_threshold - [0-1] i.e, normalized y-vertices point \n", "* most data in footer and header sections are in bold-text, if we utilize these thresholds to discard all tokens which fall in these regions, we improve bold fonts distribution. \n", "NOTE: if user utilizing footer_threshold & header_threshold, it is recommended to keep bold_text_threshold to zero, so that all tokens are used to create entities" ] }, { "cell_type": "markdown", "id": "e3d578b7-e284-42b1-aa22-e4a83de19f2d", "metadata": {}, "source": [ "## Prerequisites\n", "* Access to vertex AI Notebook or Google Colab\n", "* Python\n", "* Vertex AI Notebook\n", "* GCS Folder Path\n", "* Document OCR Processor" ] }, { "cell_type": "markdown", "id": "40b217d0-dd1f-4c8a-9f86-2d62b1571d2f", "metadata": {}, "source": [ "## Step by Step procedure " ] }, { "cell_type": "markdown", "id": "19cc2724-2b2a-4ba5-90f4-c6f660d5ff4e", "metadata": {}, "source": [ "### 1.Importing Required Modules" ] }, { "cell_type": "code", "execution_count": null, "id": "d32f550e-62c5-4744-9586-c9ccdb83871f", "metadata": { "tags": [] }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "a601674d-2625-48ec-b0f2-c0b8c8ae3008", "metadata": { "tags": [] }, "outputs": [], "source": [ "import math\n", "from collections import Counter\n", "from typing import Dict, List, Sequence, Tuple, Optional, Union\n", "\n", "from google.api_core.client_options import ClientOptions\n", "from google.cloud import documentai_v1beta3 as documentai\n", "from google.cloud import storage\n", "from utilities import file_names, store_document_as_json" ] }, { "cell_type": "markdown", "id": "6a52ba38-da1d-4e32-b536-fdd763ff2b3f", "metadata": {}, "source": [ "### 2.Setup the inputs" ] }, { "cell_type": "code", "execution_count": null, "id": "b0b345d1-3c93-4e69-b44f-43af9091dbd7", "metadata": {}, "outputs": [], "source": [ "project_id = \"xx-xx-xx\"\n", "location = \"us\" # Format is \"us\" or \"eu\"\n", "processor_id = \"xx-xx-xx\"\n", "processor_version = \"pretrained-ocr-v2.0-2023-06-02\"\n", "mime_type = \"application/pdf\"\n", "input_gcs_path = \"gs://BUCKET_NAME/headers_layout_detection/samples/\"\n", "output_gcs_path = \"gs://BUCKET_NAME/headers_layout_detection/output/\"\n", "# to (h) or (h, sh) or (h,sh,ssh) <- section_count\n", "sections_count = 2\n", "# (0 - 1 range) 0.1 -> leassthan 10% of bold fonts converage is discarded\n", "# if you are not willing discard any text in bold then also provide zero\n", "bold_text_threshold = 0\n", "# 0-1 range, 0.1 -> 10% of page bottom to exclude for token,\n", "# i.e, it covers (0 - 0.90 normalized y coord)\n", "footer_offset = 0.1\n", "# 0-1 range, 0.1 -> 10% of page bottom to exclude for token\n", "# i.e, it covers (0.10 - 1.0 normalized y coord)\n", "header_offset = 0.07" ] }, { "cell_type": "markdown", "id": "dafd90e3-ee0c-4a1f-b251-2d90b7416152", "metadata": {}, "source": [ "`processor_id`: DocumentAI OCR Processor Id\n", "\n", "`location`: Processor Location\n", "\n", "`processor_version`: OCR processor version of V2.X\n", "\n", "`input_gcs_path`: GCS folder path containing input samples\n", "\n", "`output_gcs_path`: GCS folder path containing to store post-processed results\n", "\n", "`mime_type`: mime type of input files\n", "\n", "`sections_count` : Number of header/section levels required (1/2/3)\n", " * 1/2/3 -> (h) or (h, sh) or (h,sh,ssh)\n", " \n", "`bold_text_threshold` : Threshold value to discard fonts which falls less than this value\n", " * [0 - 1] -> 0.1 means less than 10% covered bold fonts are discarded \n", " \n", "`footer_offset` : Threshold value(normalized y-coord) to discard tokens which falls in this range i.e, y coord [footer_offest, 1]\n", " * [0-1] -> 0.1 means 10% of page bottom to exclude for tokens\n", " \n", "`header_offset` : Threshold value(normalized y-coord) to discard tokens which falls in this range, i.e, y coord [0, header_offset]\n", " * [0-1] -> 0.1 means 10% of page header to exclude for tokens\n" ] }, { "cell_type": "markdown", "id": "9c1f4963-da7f-4c7b-8f6d-373e15486c59", "metadata": {}, "source": [ "### 3.Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "06dfee6b-231f-4aa3-bb42-34da7d2d2f77", "metadata": { "tags": [] }, "outputs": [], "source": [ "def process_document_ocr_sample(\n", " project_id: str,\n", " location: str,\n", " processor_id: str,\n", " processor_version: str,\n", " pdf_bytes: bytes,\n", " mime_type: str,\n", ") -> documentai.Document:\n", " \"\"\"\n", " Processes a PDF document using the Document AI OCR processor.\n", "\n", " Args:\n", " project_id (str): The Google Cloud project ID.\n", " location (str): The location where the Document AI processor is hosted.\n", " processor_id (str): The ID of the Document AI processor to use.\n", " processor_version (str): The version of the processor to use (e.g., \"pretrained-ocr-v1.1\").\n", " pdf_bytes (bytes): The PDF file content as a byte stream.\n", " mime_type (str): The MIME type of the file (e.g., 'application/pdf').\n", "\n", " Returns:\n", " documentai.Document: A Document object containing the OCR-processed results,\n", " including text and layout information from the document.\n", " \"\"\"\n", "\n", " client_opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n", " client = documentai.DocumentProcessorServiceClient(client_options=client_opts)\n", " name = client.processor_version_path(\n", " project_id, location, processor_id, processor_version\n", " )\n", " raw_doc = documentai.RawDocument(content=pdf_bytes, mime_type=mime_type)\n", " premium_features = documentai.OcrConfig.PremiumFeatures(compute_style_info=True)\n", " ocr_config = documentai.OcrConfig(\n", " enable_native_pdf_parsing=False, premium_features=premium_features\n", " )\n", " process_options = documentai.ProcessOptions(ocr_config=ocr_config)\n", " request = documentai.ProcessRequest(\n", " name=name,\n", " raw_document=raw_doc,\n", " # Only supported for Document OCR processor\n", " process_options=process_options,\n", " )\n", " result = client.process_document(request=request)\n", " return result.document\n", "\n", "\n", "def get_headers(\n", " tokens: Sequence[documentai.Document.Page.Token],\n", " fonts_list: List[int],\n", " is_bold: bool = True,\n", " footer_threshold: float = 0.05,\n", " header_threshold: float = 0.05,\n", "):\n", " \"\"\"\n", " Filters and retrieves header tokens from a sequence of document tokens based on font size, boldness,\n", " and threshold settings for headers and footers.\n", "\n", " Args:\n", " tokens (Sequence[documentai.Document.Page.Token]): A sequence of token objects from the document.\n", " fonts_list (List[int]): A list of font sizes that are considered for header detection.\n", " is_bold (bool): A flag to filter tokens based on whether they are bold or not. Defaults to True (bold).\n", " footer_threshold (float): The vertical position threshold for discarding tokens as footers. Defaults to 0.05.\n", " header_threshold (float): The vertical position threshold for discarding tokens as headers. Defaults to 0.05.\n", "\n", " Returns:\n", " List[documentai.Document.Page.Token]: A list of tokens that qualify as headers based on the specified conditions.\n", " \"\"\"\n", " headers = []\n", " for token in tokens:\n", " token: documentai.Document.Page.Token = token\n", " if is_token_need_to_discard(token, footer_threshold, header_threshold):\n", " continue\n", " font_size = token.style_info.font_size\n", " bold = token.style_info.bold\n", " if is_bold:\n", " if font_size in fonts_list and bold:\n", " headers.append(token)\n", " else:\n", " if font_size in fonts_list and not bold:\n", " headers.append(token)\n", " return headers\n", "\n", "\n", "def get_groups(\n", " headers: List[documentai.Document.Page.Token],\n", ") -> List[List[documentai.Document.Page.Token]]:\n", " \"\"\"\n", " Groups tokens (headers) based on their text anchor indices.\n", "\n", " Args:\n", " headers (List[documentai.Document.Page.Token]): List of token headers.\n", "\n", " Returns:\n", " List[List[documentai.Document.Page.Token]]: A list of grouped token headers.\n", " \"\"\"\n", "\n", " group = []\n", " groups = []\n", " for idx, header in enumerate(headers[:-1]):\n", " curr_ts = header.layout.text_anchor.text_segments[0]\n", " next_ts = headers[idx + 1].layout.text_anchor.text_segments[0]\n", " curr = curr_ts.start_index, curr_ts.end_index\n", " nxt = next_ts.start_index, next_ts.end_index\n", " if len(set([*curr, *nxt])) == 3:\n", " group.append(header)\n", " if len(set([*curr, *nxt])) == 4:\n", " group.append(header)\n", " groups.append(group)\n", " group = []\n", " else:\n", " group.append(headers[-1])\n", " groups.append(group)\n", " return groups\n", "\n", "\n", "def get_new_entity_attrs(\n", " tokens: List[documentai.Document.Page.Token],\n", ") -> Tuple[List[int], List[int], List[float], List[float], List[float]]:\n", " \"\"\"\n", " Extracts attributes from a list of tokens, including start and end indices, x and y coordinates, and confidences.\n", "\n", " Args:\n", " tokens (List[Document.Page.Token]): List of tokens from which to extract attributes.\n", "\n", " Returns:\n", " Tuple[List[int], List[int], List[float], List[float], List[float]]:\n", " - List of start indices (si) for the tokens.\n", " - List of end indices (ei) for the tokens.\n", " - List of x coordinates for the token bounding polygons.\n", " - List of y coordinates for the token bounding polygons.\n", " - List of confidence scores for each token's layout.\n", " \"\"\"\n", " si = []\n", " ei = []\n", " x, y = [], []\n", " confidences = []\n", " for token in tokens:\n", " ts = token.layout.text_anchor.text_segments[0]\n", " si.append(ts.start_index)\n", " ei.append(ts.end_index)\n", " token_x, token_y = get_x_y_list(token.layout.bounding_poly)\n", " x.extend(token_x)\n", " y.extend(token_y)\n", " confidences.append(token.layout.confidence)\n", " return si, ei, x, y, confidences\n", "\n", "\n", "def get_x_y_list(\n", " bounding_poly: documentai.BoundingPoly,\n", ") -> Tuple[List[float], List[float]]:\n", " \"\"\"\n", " Extracts the x and y coordinates from a BoundingPoly's normalized vertices.\n", "\n", " Args:\n", " bounding_poly (BoundingPoly): A bounding polygon object that contains the normalized vertices.\n", "\n", " Returns:\n", " Tuple[List[float], List[float]]:\n", " - A list of x coordinates.\n", " - A list of y coordinates.\n", " \"\"\"\n", " x, y = [], []\n", " for nvs in bounding_poly.normalized_vertices:\n", " x.append(nvs.x)\n", " y.append(nvs.y)\n", " return x, y\n", "\n", "\n", "def get_normalized_vertices(\n", " x: List[float], y: List[float]\n", ") -> List[documentai.NormalizedVertex]:\n", " \"\"\"\n", " Creates a list of normalized vertices (x, y coordinates) based on the bounding box\n", " formed by the minimum and maximum values from the input x and y coordinates.\n", "\n", " Args:\n", " x (List[float]): A list of x coordinates.\n", " y (List[float]): A list of y coordinates.\n", "\n", " Returns:\n", " List[NormalizedVertex]: A list of four `NormalizedVertex` objects representing\n", " the corners of the bounding box defined by the input x and y coordinates.\n", " \"\"\"\n", " nvs = []\n", " xy = [[min(x), min(y)], [max(x), min(y)], [max(x), max(y)], [min(x), max(y)]]\n", " for _x, _y in xy:\n", " nv = documentai.NormalizedVertex(x=_x, y=_y)\n", " nvs.append(nv)\n", " return nvs\n", "\n", "\n", "def create_headers_ents(\n", " ent_type: str,\n", " text: str,\n", " page_num: int,\n", " groups: List[List[documentai.Document.Page.Token]],\n", ") -> List[documentai.Document.Entity]:\n", " \"\"\"\n", " Creates entities for header sections from token groups.\n", "\n", " Args:\n", " ent_type (str): The type of entity being created.\n", " text (str): The full text of the document.\n", " page_num (int): The page number where the headers are found.\n", " groups (List[List[Document.Page.Token]]): A list of groups of tokens representing headers.\n", "\n", " Returns:\n", " List[Document.Entity]: A list of Document.Entity objects created from the token groups.\n", " \"\"\"\n", " entities = []\n", " for tokens in groups:\n", " si, ei, x, y, confidences = get_new_entity_attrs(tokens)\n", " _confidence = sum(confidences) / len(confidences)\n", " _type = ent_type\n", " _ts = documentai.Document.TextAnchor.TextSegment(\n", " start_index=min(si), end_index=max(ei)\n", " )\n", " _text_anchor = documentai.Document.TextAnchor(text_segments=[_ts])\n", " _mention_text = text[_ts.start_index : _ts.end_index]\n", " _bounding_poly = documentai.BoundingPoly()\n", " _bounding_poly.normalized_vertices = get_normalized_vertices(x, y)\n", " _page_ref = documentai.Document.PageAnchor.PageRef(\n", " page=page_num, bounding_poly=_bounding_poly\n", " )\n", " _page_anchor = documentai.Document.PageAnchor(page_refs=[_page_ref])\n", " entity = documentai.Document.Entity(\n", " confidence=_confidence,\n", " type_=_type,\n", " mention_text=_mention_text,\n", " text_anchor=_text_anchor,\n", " page_anchor=_page_anchor,\n", " )\n", " entities.append(entity)\n", " return entities\n", "\n", "\n", "def is_token_need_to_discard(\n", " token: documentai.Document.Page.Token,\n", " footer_threshold: float = 0.05,\n", " header_threshold: float = 0.05,\n", ") -> bool:\n", " \"\"\"\n", " Determines whether a token should be discarded based on its position\n", " relative to header and footer thresholds.\n", "\n", " Args:\n", " token (Document.Page.Token): The token to evaluate.\n", " footer_threshold (float): The threshold for the footer position (0.0 to 1.0).\n", " header_threshold (float): The threshold for the header position (0.0 to 1.0).\n", "\n", " Returns:\n", " bool: True if the token should be discarded, False otherwise.\n", " \"\"\"\n", " footer_threshold = 1 - footer_threshold\n", " bp = token.layout.bounding_poly\n", " _, y = get_x_y_list(bp)\n", " y_max = max(y, default=1)\n", " y_min = min(y, default=0)\n", " # to discard footer-section\n", " if y_max > footer_threshold:\n", " return True\n", " # to discard top header-section\n", " if y_min < header_threshold:\n", " return True\n", " return False\n", "\n", "\n", "# to get document wise fonts distribution\n", "def get_fonts_distribution(\n", " doc: documentai.Document,\n", " is_bold: bool = True,\n", " footer_threshold: float = 0.05,\n", " header_threshold: float = 0.05,\n", ") -> dict:\n", " \"\"\"\n", " Analyzes the font distribution in a Document object.\n", "\n", " Args:\n", " doc (Document): The Document object containing pages and tokens.\n", " is_bold (bool): Flag to include only bold fonts if True.\n", " footer_threshold (float): The threshold for footer position (0.0 to 1.0).\n", " header_threshold (float): The threshold for header position (0.0 to 1.0).\n", "\n", " Returns:\n", " dict: A dictionary containing font sizes as keys and their frequency\n", " distribution as values, sorted in descending order of frequency.\n", " \"\"\"\n", " fonts = []\n", " for page in doc.pages:\n", " for token in page.tokens:\n", " if is_token_need_to_discard(token, footer_threshold, header_threshold):\n", " continue\n", " if is_bold:\n", " if token.style_info and token.style_info.bold:\n", " fonts.append(token.style_info.font_size)\n", " else:\n", " fonts.append(token.style_info.font_size)\n", " font_freqs = Counter(fonts)\n", " print(f\"\\tbold = {is_bold} tokens count - {len(fonts)}\")\n", " count = sum(font_freqs.values())\n", " fonts_freq_dist = {k: v / count for k, v in font_freqs.items()}\n", " fonts_freq_dist = sorted(\n", " fonts_freq_dist.items(), key=lambda item: item[1], reverse=True\n", " )\n", " fonts_freq_dist = dict(fonts_freq_dist)\n", " return fonts_freq_dist\n", "\n", "\n", "def bag_fonts_desc(\n", " fonts_dist: Dict[str, float], no_of_groups: int, threshold: float = 0.1\n", ") -> List[List[str]]:\n", " \"\"\"\n", " Groups font sizes based on their frequency distribution.\n", "\n", " Args:\n", " fonts_dist (Dict[str, float]): A dictionary containing font sizes as keys\n", " and their frequency distribution as values.\n", " no_of_groups (int): The number of groups to create.\n", " threshold (float): The minimum frequency a font size must have to be included.\n", " Defaults to 0.1.\n", "\n", " Returns:\n", " List[List[str]]: A list of groups, where each group is a list of font sizes.\n", " \"\"\"\n", " fonts_dist = {k: v for k, v in fonts_dist.items() if v > threshold}\n", " fonts = list(fonts_dist.keys())\n", " fonts_cnt = len(fonts)\n", " fonts_per_group = math.ceil(fonts_cnt / no_of_groups)\n", " groups = []\n", " for idx in range(no_of_groups):\n", " start = idx * fonts_per_group\n", " # end = start + fonts_per_group\n", " end = (idx + 1) * fonts_per_group\n", " group = fonts[start:end]\n", " groups.append(group)\n", " # h, sh, ssh\n", " return groups[::-1]\n", "\n", "\n", "def adjust_empty_header_font_tags(\n", " fonts_list: List[Union[List[str], None]]\n", ") -> List[List[str]]:\n", " \"\"\"\n", " Adjusts the font tags for headers based on the number of sections present.\n", "\n", " Args:\n", " fonts_list (List[Union[List[str], None]]): A list containing font sections\n", " for headers, where each section\n", " is a list of font sizes or None.\n", "\n", " Returns:\n", " List[List[str]]: A list of three sections, each a list of font sizes,\n", " ensuring that empty sections are added as needed.\n", " \"\"\"\n", " counter = 0\n", " updated_list = []\n", " for section in fonts_list:\n", " if section:\n", " counter += 1\n", " updated_list.append(section)\n", " if counter == 3:\n", " return updated_list\n", " elif counter == 2:\n", " updated_list.append([])\n", " return updated_list\n", " elif counter == 1:\n", " updated_list.extend([[], []])\n", " return updated_list\n", " elif not counter:\n", " # to handle if page has no tokens\n", " return [[], [], []]\n", " return fonts_list\n", "\n", "\n", "def tag_text_with_section_type(doc: documentai.Document) -> documentai.Document:\n", " \"\"\"\n", " Tags entities in a Document with their respective section types.\n", "\n", " Args:\n", " doc (documentai.Document): The Document object containing entities to be tagged.\n", "\n", " Returns:\n", " documentai.Document: The updated Document object with tagged entities.\n", " \"\"\"\n", "\n", " updated_entities = []\n", "\n", " def ent_sort_key(ent: documentai.Document.Entity) -> tuple:\n", " \"\"\"\n", " Sort key for entities based on page number and y-coordinate.\n", "\n", " Args:\n", " ent (documentai.Document.Entity): The entity to extract sorting keys from.\n", "\n", " Returns:\n", " tuple: A tuple containing the page number and the y-coordinate.\n", " \"\"\"\n", " pg_no = ent.page_anchor.page_refs[0].page\n", " y_coord = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices[0].y\n", " return pg_no, y_coord\n", "\n", " pre_ent_type = \"\"\n", " old_entities = sorted(doc.entities, key=ent_sort_key)\n", " for ent in old_entities:\n", " if ent.type_ == \"text\" and pre_ent_type not in (\"\", \"text\"):\n", " ent.type_ = f\"{pre_ent_type.rstrip('_text')}_text\"\n", " pre_ent_type = ent.type_\n", " updated_entities.append(ent)\n", " doc.entities = updated_entities\n", " return doc" ] }, { "cell_type": "markdown", "id": "b22ed8ac-2e39-4266-8953-e1ca180e2bd4", "metadata": {}, "source": [ "### 4.Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "485296ce-cf37-4017-a3ce-aee7405e035e", "metadata": {}, "outputs": [], "source": [ "file_list, files_dict = file_names(input_gcs_path)\n", "input_bucket = input_gcs_path.split(\"/\")[2]\n", "output_splits = output_gcs_path.split(\"/\")\n", "output_bucket = output_splits[2]\n", "output_dir = \"/\".join(output_splits[3:]).strip(\"/\")\n", "storage_client = storage.Client()\n", "bucket = storage_client.get_bucket(input_bucket)\n", "for fn, fp in files_dict.items():\n", " print(f\"File: {fn}\")\n", " pdf_bytes = bucket.blob(fp).download_as_bytes()\n", " try:\n", " res = process_document_ocr_sample(\n", " project_id, location, processor_id, processor_version, pdf_bytes, mime_type\n", " )\n", " except Exception as e:\n", " print(f\"Unable to parse document due to {type(e)}, {str(e)}\")\n", " continue\n", " text = res.text\n", " bold_dist = get_fonts_distribution(res, True, footer_offset, header_offset)\n", " plain_dist = get_fonts_distribution(res, False, footer_offset, header_offset)\n", " # 3 -> to group headers into 3 types i.e, headers, sub_headers, sub_sub_headers\n", " bagged_fonts = bag_fonts_desc(\n", " bold_dist, sections_count, threshold=bold_text_threshold\n", " )\n", " # print(f\"\\th, sh, ssh = {bagged_fonts}\")\n", " # to move all empty fonts lists to end\n", " h_fonts, sh_fonts, ssh_fonts = bag = adjust_empty_header_font_tags(bagged_fonts)\n", " print(f\"\\th, sh, ssh = {bag}\")\n", " text_fonts = list(plain_dist.keys())\n", " print(\"\\tentities\", len(res.entities), end=\" \")\n", " for page in res.pages:\n", " page_num = page.page_number\n", " ent_page_num = page_num - 1\n", " headers = get_headers(page.tokens, h_fonts, True, footer_offset, header_offset)\n", " sub_headers = get_headers(\n", " page.tokens, sh_fonts, True, footer_offset, header_offset\n", " )\n", " sub_sub_headers = get_headers(\n", " page.tokens, ssh_fonts, True, footer_offset, header_offset\n", " )\n", " text_headers = get_headers(\n", " page.tokens, text_fonts, False, footer_offset, header_offset\n", " )\n", " if headers:\n", " header_groups = get_groups(headers)\n", " header_ents = create_headers_ents(\n", " \"header\", text, ent_page_num, header_groups\n", " )\n", " res.entities.extend(header_ents)\n", " if sub_headers:\n", " sub_header_groups = get_groups(sub_headers)\n", " sub_header_ents = create_headers_ents(\n", " \"sub_header\", text, ent_page_num, sub_header_groups\n", " )\n", " res.entities.extend(sub_header_ents)\n", " if sub_sub_headers:\n", " sub_sub_header_groups = get_groups(sub_sub_headers)\n", " sub_sub_header_ents = create_headers_ents(\n", " \"sub_sub_header\", text, ent_page_num, sub_sub_header_groups\n", " )\n", " res.entities.extend(sub_sub_header_ents)\n", " if text_headers:\n", " text_header_groups = get_groups(text_headers)\n", " sub_sub_header_ents = create_headers_ents(\n", " \"text\", text, ent_page_num, text_header_groups\n", " )\n", " res.entities.extend(sub_sub_header_ents)\n", " print(\"\\tentities\", len(res.entities))\n", " res = tag_text_with_section_type(res)\n", " json_str = documentai.Document.to_json(res, including_default_value_fields=False)\n", " out_fn = fn.rsplit(\".\", 1)[0]\n", " out_fp = f\"{output_dir}/{out_fn}.json\"\n", " print(f\"\\tWriting data to gs://{output_bucket}/{out_fp}\")\n", " store_document_as_json(json_str, output_bucket, out_fp)\n", "print(\"Process Completed!!!\")" ] }, { "cell_type": "markdown", "id": "d84fa2e6-1a60-49fd-8a75-7feba0e7da04", "metadata": {}, "source": [ "### 5.Output\n", "\n", "The provided code identifies and classifies text elements as label headers, sub-headers, or sub-sub-headers within a document. It does this by analyzing the font size of the text and comparing it to predefined ranges for each header type. As a result, new entities are created that correspond to each identified header type.\n", "\n", "<img src=\"./Images/tool_ouput.png\"></img>" ] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m125", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }

incubator-tools/Label_Section_Headers/Label_Section_Headers.ipynb (738 lines of code) (raw):