incubator-tools/PDF_Table_Identification/PDF_Table

{ "cells": [ { "cell_type": "markdown", "id": "88ca6d56-4016-4aec-8db8-847c3afbf5b5", "metadata": {}, "source": [ "# PDF Table Identification" ] }, { "cell_type": "markdown", "id": "8160ab36-c48f-4813-8d55-13e0239bcb97", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "55da172d-3c2c-4c47-999e-ce4858321d77", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "9fb9881c-a602-4981-aa00-1bff93119301", "metadata": {}, "source": [ "## Objective\n", "\n", "This Python script automates the identification and extraction of pages containing tables from PDF documents, streamlining the pre-processing workflow." ] }, { "cell_type": "markdown", "id": "7054c694-8e7d-473b-9b07-a1e14e6cee67", "metadata": {}, "source": [ "## Considerations and Limitations\n", "\n", "The efficacy of the script may vary depending on the complexity of the documents. Image quality, lighting, and contrast variations can also affect performance. Manual intervention or specialized techniques may be required for optimal results.\n", "There's a potential risk of data misinterpretation during the table identification process. A page containing a table may be missed, or a text dense page may be interpreted as containing a table due to the rectangular text shape.\n" ] }, { "cell_type": "markdown", "id": "c84c769c-750d-4f25-8ddc-f4449670fe10", "metadata": {}, "source": [ "## Prerequisites\n", "* Access to vertex AI Notebook or Google Colab\n", "* GCS bucket for processing of the input files and output files" ] }, { "cell_type": "markdown", "id": "08ad90e8-2e28-4c1d-b2d8-8c0770b3c786", "metadata": { "tags": [] }, "source": [ "## Step by Step procedure" ] }, { "cell_type": "markdown", "id": "bf784758-0586-4e1c-a264-57663fae795c", "metadata": {}, "source": [ "### 1.Importing Required Modules" ] }, { "cell_type": "code", "execution_count": null, "id": "7cf64a0c-3a8a-4988-8a6a-43b304c74abd", "metadata": { "tags": [] }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": 12, "id": "6b88928f-74c6-4e35-81d0-d8e19ca4ee60", "metadata": { "tags": [] }, "outputs": [], "source": [ "import cv2\n", "import numpy as np\n", "import pymupdf as fitz\n", "from google.cloud import storage\n", "import io\n", "from typing import Any, Dict, List, Optional, Sequence, Tuple, Union" ] }, { "cell_type": "markdown", "id": "735b9e3c-9166-43fa-be66-c0b44a00de90", "metadata": { "tags": [] }, "source": [ "### 2.Setup the inputs" ] }, { "cell_type": "code", "execution_count": null, "id": "7a383e47-0393-4b81-a047-190833d0d730", "metadata": { "tags": [] }, "outputs": [], "source": [ "input_bucket_name = \"<input-bucket>\"\n", "output_bucket_name = \"<output-bucket>\"\n", "document_path = \"Table Identification Test/Test_Table_ID.pdf\"\n", "output_full_path = \"Table Identification Test/output\"" ] }, { "cell_type": "markdown", "id": "9c76c0d0-1b0c-456a-921d-078a98a5d92f", "metadata": {}, "source": [ "### 3.Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "539a0f64-ed79-4731-b02e-922d61dc2877", "metadata": { "tags": [] }, "outputs": [], "source": [ "def find_tables_in_page(image: np.ndarray) -> list:\n", " \"\"\"\n", " Finds tables in an image and returns their contours.\n", "\n", " Args:\n", " image (np.ndarray): The input image to process (in BGR format).\n", "\n", " Returns:\n", " list: A list of contours that potentially represent tables in the image.\n", " \"\"\"\n", " gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n", " thresh = cv2.adaptiveThreshold(\n", " gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, 11, 4\n", " )\n", "\n", " # Detect horizontal lines\n", " horizontal_kernel = cv2.getStructuringElement(\n", " cv2.MORPH_RECT, (40, 1)\n", " ) # Reduced kernel width\n", " detect_horizontal = cv2.morphologyEx(\n", " thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2\n", " )\n", "\n", " # Detect vertical lines\n", " vertical_kernel = cv2.getStructuringElement(\n", " cv2.MORPH_RECT, (1, 40)\n", " ) # Reduced kernel height\n", " detect_vertical = cv2.morphologyEx(\n", " thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2\n", " )\n", "\n", " combined = cv2.bitwise_or(detect_horizontal, detect_vertical)\n", "\n", " # Dilate the image to close gaps between lines\n", " kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))\n", " combined = cv2.dilate(\n", " combined, kernel, iterations=3\n", " ) # Reduced iterations for finer control\n", "\n", " contours, hierarchy = cv2.findContours(\n", " combined, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE\n", " )\n", "\n", " table_contours = []\n", " for contour in contours:\n", " area = cv2.contourArea(contour)\n", " x, y, w, h = cv2.boundingRect(contour)\n", " aspect_ratio = float(w) / h if h != 0 else float(\"inf\")\n", "\n", " # Improved table contour filtering:\n", " if (\n", " area > 1000 and 0.5 <= aspect_ratio <= 5\n", " ): # Adjusted area threshold, reasonable aspect ratio\n", " table_contours.append(contour)\n", "\n", " return table_contours\n", "\n", "\n", "def identify_table_pages(doc: fitz.Document, dpi: int = 300) -> list:\n", " \"\"\"\n", " Identifies pages in a PDF document that contain tables.\n", "\n", " Args:\n", " doc (fitz.Document): The PDF document to analyze.\n", " dpi (int, optional): Dots per inch for the resolution of the images extracted from the PDF. Defaults to 300.\n", "\n", " Returns:\n", " list: A list of page indices where tables are found.\n", " \"\"\"\n", " table_pages = []\n", "\n", " for page_index in range(len(doc)):\n", " page = doc.load_page(page_index)\n", " mat = fitz.Matrix(dpi / 72, dpi / 72)\n", " pix = page.get_pixmap(matrix=mat)\n", " img_bytes = pix.tobytes(\"jpeg\")\n", " image = cv2.imdecode(np.frombuffer(img_bytes, np.uint8), cv2.IMREAD_COLOR)\n", " table_contours = find_tables_in_page(image)\n", " if table_contours: # Simplified condition\n", " print(f\"Table found on page {page_index + 1}\")\n", " table_pages.append(page_index)\n", " return table_pages\n", "\n", "\n", "def read_image_from_bytecode(img_bytes: bytes) -> np.ndarray:\n", " \"\"\"\n", " Reads an image from a bytecode and returns it as a NumPy array.\n", "\n", " Args:\n", " img_bytes (bytes): The image data in bytecode format.\n", "\n", " Returns:\n", " np.ndarray: The decoded image in BGR format, suitable for OpenCV processing.\n", " \"\"\"\n", " return cv2.imdecode(np.frombuffer(img_bytes, np.uint8), cv2.IMREAD_COLOR)\n", "\n", "\n", "def document_downloader(bucket_name: str, blob_name_with_prefix_path: str) -> bytes:\n", " \"\"\"\n", " Downloads a document from a Google Cloud Storage bucket and returns its byte content.\n", "\n", " Args:\n", " bucket_name (str): The name of the Cloud Storage bucket.\n", " blob_name_with_prefix_path (str): The path and name of the blob in the bucket.\n", "\n", " Returns:\n", " bytes: The downloaded document in byte format.\n", " \"\"\"\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(blob_name_with_prefix_path)\n", " doc = blob.download_as_bytes()\n", " return doc\n", "\n", "\n", "def save_pdf_to_bucket(\n", " pdf_bytes: bytes, bucket_name: str, destination_blob_name: str\n", ") -> None:\n", " \"\"\"\n", " Saves a PDF to a Google Cloud Storage bucket.\n", "\n", " Args:\n", " pdf_bytes (bytes): The byte content of the PDF.\n", " bucket_name (str): The name of the Cloud Storage bucket.\n", " destination_blob_name (str): The destination path and name for the blob in the bucket.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(destination_blob_name)\n", " blob.upload_from_string(pdf_bytes, content_type=\"application/pdf\")" ] }, { "cell_type": "markdown", "id": "41e63cfd-2709-4d63-96e2-c60703b4ff14", "metadata": {}, "source": [ "### 4.Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "ee9ddf02-8cd1-4b4f-8c92-3ba430ec738d", "metadata": { "tags": [] }, "outputs": [], "source": [ "file_name = document_path.split(\"/\")[-1]\n", "output_file_name = f\"{file_name}_tables.pdf\"\n", "document_bytes = document_downloader(input_bucket_name, document_path)\n", "pdf_document = fitz.open(\"pdf\", document_bytes)\n", "table_pages = identify_table_pages(pdf_document)\n", "\n", "\n", "# Create a new PDF document with only the pages that have tables\n", "output_pdf = fitz.open()\n", "for page_index in table_pages:\n", " output_pdf.insert_pdf(pdf_document, from_page=page_index, to_page=page_index)\n", "\n", "\n", "# Save the new PDF to a BytesIO object\n", "output_pdf_bytes = io.BytesIO()\n", "output_pdf.save(output_pdf_bytes)\n", "output_pdf_bytes.seek(0)\n", "\n", "\n", "# Save the new PDF to GCS\n", "save_pdf_to_bucket(\n", " output_pdf_bytes.read(),\n", " output_bucket_name,\n", " output_full_path + \"/\" + output_file_name,\n", ")\n", "print(\n", " f\"PDF with extracted tables saved to: gs://{output_bucket_name}/{output_full_path}/{output_file_name}\"\n", ")" ] }, { "cell_type": "markdown", "id": "6ca7b613-855b-42ec-b6e6-3010c289b5c2", "metadata": {}, "source": [ "### 5.Output\n", "\n", "This code extracts pages containing tables from a document and saves them as separate PDFs in a specified output location.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ea34affa-2a0d-4730-be80-2a4509107bb5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m125", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }

incubator-tools/PDF_Table_Identification/PDF_Table_Identification.ipynb (377 lines of code) (raw):