incubator-tools/character_box_removal/character_box

{ "cells": [ { "cell_type": "markdown", "id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee", "metadata": {}, "source": [ "# Character Box Removal" ] }, { "cell_type": "markdown", "id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55", "metadata": {}, "source": [ "## Objective\n", "\n", "This document is intended as a guide to help with pre-processing PDFs with values inside of character boxes to remove the character boxes completely. This will help mitigate OCR confusion caused by the lines of the character boxes." ] }, { "cell_type": "markdown", "id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50", "metadata": {}, "source": [ "## Prerequisites\n", "* Vertex AI Notebook.\n", "* Permission For Google DocAI Processors, Storage and Vertex AI Notebook." ] }, { "cell_type": "markdown", "id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02", "metadata": {}, "source": [ "## Step by Step Procedure" ] }, { "cell_type": "markdown", "id": "eadfe192-5691-4053-8670-32ef3e5bea8c", "metadata": {}, "source": [ "### 1. Import Modules/Packages" ] }, { "cell_type": "code", "execution_count": null, "id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install opencv-python matplotlib numpy pillow\n", "!pip install pdf2image" ] }, { "cell_type": "code", "execution_count": null, "id": "6495b161-ac1b-40b2-a668-fe835da2950a", "metadata": { "tags": [] }, "outputs": [], "source": [ "import cv2\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from PIL import Image\n", "from pdf2image import convert_from_path\n", "import os\n", "from google.cloud import storage\n", "import os\n", "from typing import List, Union" ] }, { "cell_type": "markdown", "id": "1709ea5e-bf14-44ab-9963-cda14b092dc2", "metadata": {}, "source": [ "### 2. Input Details" ] }, { "cell_type": "markdown", "id": "7ce030a1-663b-445e-a38e-24dc415942c2", "metadata": {}, "source": [ "* **local_folder_path**: Local directory containing input PDFs\n", "* **local_output_folder_path**: Local directory where output PDFs store\n", "* **bucket_name**: Bucket Name of GCS Storage\n", "* **gcs_folder_path**: Path to GCS directory where processed files will be stored" ] }, { "cell_type": "code", "execution_count": null, "id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17", "metadata": {}, "outputs": [], "source": [ "local_folder_path = \"<input-path>\" # Local directory containing input PDFs\n", "local_output_folder_path = \"<output-path>\" # Local directory where output PDFs store\n", "bucket_name = \"<bucket-name>\" # Replace with your GCS bucket name\n", "gcs_folder_path = (\n", " \"<output-directory>\" # Path to GCS directory where processed files will be stored\n", ")" ] }, { "cell_type": "markdown", "id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef", "metadata": {}, "source": [ "### 3. Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def box_fun(image_path: str) -> np.ndarray:\n", " \"\"\"\n", " Processes an image to detect and remove lines, enhance text quality, and improve resolution.\n", "\n", " Args:\n", " image_path (str): Path to the input image.\n", "\n", " Returns:\n", " np.ndarray: The processed image as a NumPy array.\n", " \"\"\"\n", " # Read the image\n", " image = cv2.imread(image_path)\n", "\n", " # Increase the resolution by resizing the image\n", " image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)\n", "\n", " # Create a copy of the original image\n", " result = image.copy()\n", "\n", " # Convert to grayscale\n", " gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n", "\n", " # Apply adaptive thresholding\n", " thresh = cv2.adaptiveThreshold(\n", " gray,\n", " 255,\n", " cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\n", " cv2.THRESH_BINARY_INV,\n", " 21, # Adjusted block size\n", " 5, # Adjusted constant\n", " )\n", "\n", " # Create kernels for line detection\n", " horizontal_kernel = cv2.getStructuringElement(\n", " cv2.MORPH_RECT, (50, 1)\n", " ) # Adjusted size\n", " vertical_kernel = cv2.getStructuringElement(\n", " cv2.MORPH_RECT, (1, 45)\n", " ) # Adjusted size\n", "\n", " # Detect horizontal and vertical lines\n", " horizontal_lines = cv2.morphologyEx(\n", " thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2\n", " )\n", " vertical_lines = cv2.morphologyEx(\n", " thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2\n", " )\n", "\n", " # Combine horizontal and vertical lines\n", " lines = cv2.addWeighted(horizontal_lines, 0.5, vertical_lines, 0.6, 0.0)\n", "\n", " # Dilate the lines to make them more prominent\n", " lines = cv2.dilate(\n", " lines, cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5)), iterations=2\n", " ) # Adjusted dilation\n", "\n", " # Remove the lines from the original image\n", " result = cv2.inpaint(result, lines, 5, cv2.INPAINT_TELEA) # Adjusted radius\n", "\n", " # Optional: Apply Gaussian blur for smoothing\n", " result = cv2.GaussianBlur(result, (3, 3), 0)\n", "\n", " # Enhance text quality with sharpening\n", " sharpen_kernel = np.array(\n", " [[-1, -1, -1], [-1, 10, -1], [-1, -1, -1]]\n", " ) # Adjusted kernel\n", " result = cv2.filter2D(result, -1, sharpen_kernel) # Apply sharpening once\n", "\n", " # Display the images using matplotlib\n", " plt.figure(figsize=(15, 5))\n", "\n", " plt.subplot(1, 3, 1)\n", " plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))\n", " plt.title(\"Original Image\")\n", "\n", " plt.subplot(1, 3, 2)\n", " plt.imshow(lines, cmap=\"gray\")\n", " plt.title(\"Detected Lines\")\n", "\n", " plt.subplot(1, 3, 3)\n", " plt.imshow(cv2.cvtColor(result, cv2.COLOR_BGR2RGB))\n", " plt.title(\"Result Image\")\n", "\n", " plt.tight_layout()\n", " plt.show()\n", "\n", " return result\n", "\n", "\n", "def process_pdf(pdf_path: str) -> List[Union[Image.Image, None]]:\n", " \"\"\"\n", " Processes each page of a PDF to remove lines, enhance quality, and save the results as a new PDF.\n", "\n", " Args:\n", " pdf_path (str): Path to the input PDF file.\n", "\n", " Returns:\n", " List[Union[Image.Image, None]]: List of processed images for each page.\n", " \"\"\"\n", " print(pdf_path)\n", " pages = convert_from_path(pdf_path)\n", " processed_images = []\n", "\n", " for i, page in enumerate(pages):\n", " # Save each page as a temporary PNG\n", " temp_image_path = f\"temp_page_{i}.png\"\n", " page.save(temp_image_path, \"PNG\")\n", "\n", " # Process the image\n", " processed_image = box_fun(temp_image_path)\n", "\n", " if isinstance(processed_image, np.ndarray):\n", " processed_image = Image.fromarray(processed_image.astype(\"uint8\"))\n", "\n", " processed_images.append(processed_image)\n", "\n", " os.remove(temp_image_path)\n", "\n", " filename = pdf_path.split(\".pdf\")[0].split(\"/\")[-1]\n", " output_pdf_path = f\"{local_output_folder_path}/improved_{filename}.pdf\"\n", " print(output_pdf_path)\n", "\n", " processed_images[0].save(\n", " output_pdf_path, save_all=True, append_images=processed_images[1:]\n", " )\n", "\n", " return processed_images\n", "\n", "\n", "def upload_files_to_gcs(\n", " local_folder_path: str, bucket_name: str, gcs_folder_path: str\n", ") -> None:\n", " \"\"\"\n", " Uploads all files from a local folder to a specified Google Cloud Storage (GCS) folder.\n", "\n", " Args:\n", " local_folder_path (str): The local folder path containing files to upload.\n", " bucket_name (str): The name of the GCS bucket.\n", " gcs_folder_path (str): The destination folder path in GCS.\n", " \"\"\"\n", " # Initialize the GCS client\n", " client = storage.Client()\n", " bucket = client.bucket(bucket_name)\n", "\n", " # List all files in the local folder\n", " files = [\n", " f\n", " for f in os.listdir(local_folder_path)\n", " if os.path.isfile(os.path.join(local_folder_path, f))\n", " ]\n", "\n", " print(\"Uploading files to GCS Bucket\")\n", " for file_name in files:\n", " # Full path to the local file\n", " local_file_path = os.path.join(local_folder_path, file_name)\n", "\n", " # GCS path (include the file name in the destination)\n", " gcs_blob_path = os.path.join(gcs_folder_path, file_name)\n", "\n", " # Create a blob in the bucket and upload the file\n", " blob = bucket.blob(gcs_blob_path)\n", " blob.upload_from_filename(local_file_path)\n", "\n", " print(f\"Uploaded {file_name} to gs://{bucket_name}/{gcs_blob_path}\")" ] }, { "cell_type": "markdown", "id": "ac1d85cf-7529-469b-be7e-3e00a778af87", "metadata": {}, "source": [ "### 4. Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "49dab42f-6c98-4aac-b804-cd3ec81127f8", "metadata": { "tags": [] }, "outputs": [], "source": [ "if __name__ == \"__main__\":\n", " # For Local Input Files\n", " files = [f\"{local_folder_path}/{f}\" for f in os.listdir(local_folder_path)]\n", "\n", " for file in files:\n", " if file.endswith(\".pdf\") or file.endswith(\".PDF\"):\n", " process_pdf(file)\n", "\n", " upload_files_to_gcs(local_output_folder_path, bucket_name, gcs_folder_path)" ] }, { "cell_type": "markdown", "id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b", "metadata": {}, "source": [ "### Output Details" ] }, { "cell_type": "markdown", "id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f", "metadata": { "tags": [] }, "source": [ "### Before Tooling\n", "<img src='./images/before_tooling.png' width=400 height=600 alt=\"Sample Output\"></img>\n", "### After Tooling\n", "<img src='./images/after_tooling.png' width=400 height=600 alt=\"Sample Output\"></img>" ] }, { "cell_type": "code", "execution_count": null, "id": "166fa02b-c3c1-4d2b-8a37-a843e961b2d2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m127", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }

incubator-tools/character_box_removal/character_box_removal.ipynb (413 lines of code) (raw):