incubator-tools/character_box_removal/character_box_removal.ipynb (413 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee",
"metadata": {},
"source": [
"# Character Box Removal"
]
},
{
"cell_type": "markdown",
"id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This document is intended as a guide to help with pre-processing PDFs with values inside of character boxes to remove the character boxes completely. This will help mitigate OCR confusion caused by the lines of the character boxes."
]
},
{
"cell_type": "markdown",
"id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Vertex AI Notebook.\n",
"* Permission For Google DocAI Processors, Storage and Vertex AI Notebook."
]
},
{
"cell_type": "markdown",
"id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02",
"metadata": {},
"source": [
"## Step by Step Procedure"
]
},
{
"cell_type": "markdown",
"id": "eadfe192-5691-4053-8670-32ef3e5bea8c",
"metadata": {},
"source": [
"### 1. Import Modules/Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install opencv-python matplotlib numpy pillow\n",
"!pip install pdf2image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6495b161-ac1b-40b2-a668-fe835da2950a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import cv2\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"from PIL import Image\n",
"from pdf2image import convert_from_path\n",
"import os\n",
"from google.cloud import storage\n",
"import os\n",
"from typing import List, Union"
]
},
{
"cell_type": "markdown",
"id": "1709ea5e-bf14-44ab-9963-cda14b092dc2",
"metadata": {},
"source": [
"### 2. Input Details"
]
},
{
"cell_type": "markdown",
"id": "7ce030a1-663b-445e-a38e-24dc415942c2",
"metadata": {},
"source": [
"* **local_folder_path**: Local directory containing input PDFs\n",
"* **local_output_folder_path**: Local directory where output PDFs store\n",
"* **bucket_name**: Bucket Name of GCS Storage\n",
"* **gcs_folder_path**: Path to GCS directory where processed files will be stored"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17",
"metadata": {},
"outputs": [],
"source": [
"local_folder_path = \"<input-path>\" # Local directory containing input PDFs\n",
"local_output_folder_path = \"<output-path>\" # Local directory where output PDFs store\n",
"bucket_name = \"<bucket-name>\" # Replace with your GCS bucket name\n",
"gcs_folder_path = (\n",
" \"<output-directory>\" # Path to GCS directory where processed files will be stored\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef",
"metadata": {},
"source": [
"### 3. Run the required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def box_fun(image_path: str) -> np.ndarray:\n",
" \"\"\"\n",
" Processes an image to detect and remove lines, enhance text quality, and improve resolution.\n",
"\n",
" Args:\n",
" image_path (str): Path to the input image.\n",
"\n",
" Returns:\n",
" np.ndarray: The processed image as a NumPy array.\n",
" \"\"\"\n",
" # Read the image\n",
" image = cv2.imread(image_path)\n",
"\n",
" # Increase the resolution by resizing the image\n",
" image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)\n",
"\n",
" # Create a copy of the original image\n",
" result = image.copy()\n",
"\n",
" # Convert to grayscale\n",
" gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n",
"\n",
" # Apply adaptive thresholding\n",
" thresh = cv2.adaptiveThreshold(\n",
" gray,\n",
" 255,\n",
" cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\n",
" cv2.THRESH_BINARY_INV,\n",
" 21, # Adjusted block size\n",
" 5, # Adjusted constant\n",
" )\n",
"\n",
" # Create kernels for line detection\n",
" horizontal_kernel = cv2.getStructuringElement(\n",
" cv2.MORPH_RECT, (50, 1)\n",
" ) # Adjusted size\n",
" vertical_kernel = cv2.getStructuringElement(\n",
" cv2.MORPH_RECT, (1, 45)\n",
" ) # Adjusted size\n",
"\n",
" # Detect horizontal and vertical lines\n",
" horizontal_lines = cv2.morphologyEx(\n",
" thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2\n",
" )\n",
" vertical_lines = cv2.morphologyEx(\n",
" thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2\n",
" )\n",
"\n",
" # Combine horizontal and vertical lines\n",
" lines = cv2.addWeighted(horizontal_lines, 0.5, vertical_lines, 0.6, 0.0)\n",
"\n",
" # Dilate the lines to make them more prominent\n",
" lines = cv2.dilate(\n",
" lines, cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5)), iterations=2\n",
" ) # Adjusted dilation\n",
"\n",
" # Remove the lines from the original image\n",
" result = cv2.inpaint(result, lines, 5, cv2.INPAINT_TELEA) # Adjusted radius\n",
"\n",
" # Optional: Apply Gaussian blur for smoothing\n",
" result = cv2.GaussianBlur(result, (3, 3), 0)\n",
"\n",
" # Enhance text quality with sharpening\n",
" sharpen_kernel = np.array(\n",
" [[-1, -1, -1], [-1, 10, -1], [-1, -1, -1]]\n",
" ) # Adjusted kernel\n",
" result = cv2.filter2D(result, -1, sharpen_kernel) # Apply sharpening once\n",
"\n",
" # Display the images using matplotlib\n",
" plt.figure(figsize=(15, 5))\n",
"\n",
" plt.subplot(1, 3, 1)\n",
" plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))\n",
" plt.title(\"Original Image\")\n",
"\n",
" plt.subplot(1, 3, 2)\n",
" plt.imshow(lines, cmap=\"gray\")\n",
" plt.title(\"Detected Lines\")\n",
"\n",
" plt.subplot(1, 3, 3)\n",
" plt.imshow(cv2.cvtColor(result, cv2.COLOR_BGR2RGB))\n",
" plt.title(\"Result Image\")\n",
"\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
" return result\n",
"\n",
"\n",
"def process_pdf(pdf_path: str) -> List[Union[Image.Image, None]]:\n",
" \"\"\"\n",
" Processes each page of a PDF to remove lines, enhance quality, and save the results as a new PDF.\n",
"\n",
" Args:\n",
" pdf_path (str): Path to the input PDF file.\n",
"\n",
" Returns:\n",
" List[Union[Image.Image, None]]: List of processed images for each page.\n",
" \"\"\"\n",
" print(pdf_path)\n",
" pages = convert_from_path(pdf_path)\n",
" processed_images = []\n",
"\n",
" for i, page in enumerate(pages):\n",
" # Save each page as a temporary PNG\n",
" temp_image_path = f\"temp_page_{i}.png\"\n",
" page.save(temp_image_path, \"PNG\")\n",
"\n",
" # Process the image\n",
" processed_image = box_fun(temp_image_path)\n",
"\n",
" if isinstance(processed_image, np.ndarray):\n",
" processed_image = Image.fromarray(processed_image.astype(\"uint8\"))\n",
"\n",
" processed_images.append(processed_image)\n",
"\n",
" os.remove(temp_image_path)\n",
"\n",
" filename = pdf_path.split(\".pdf\")[0].split(\"/\")[-1]\n",
" output_pdf_path = f\"{local_output_folder_path}/improved_{filename}.pdf\"\n",
" print(output_pdf_path)\n",
"\n",
" processed_images[0].save(\n",
" output_pdf_path, save_all=True, append_images=processed_images[1:]\n",
" )\n",
"\n",
" return processed_images\n",
"\n",
"\n",
"def upload_files_to_gcs(\n",
" local_folder_path: str, bucket_name: str, gcs_folder_path: str\n",
") -> None:\n",
" \"\"\"\n",
" Uploads all files from a local folder to a specified Google Cloud Storage (GCS) folder.\n",
"\n",
" Args:\n",
" local_folder_path (str): The local folder path containing files to upload.\n",
" bucket_name (str): The name of the GCS bucket.\n",
" gcs_folder_path (str): The destination folder path in GCS.\n",
" \"\"\"\n",
" # Initialize the GCS client\n",
" client = storage.Client()\n",
" bucket = client.bucket(bucket_name)\n",
"\n",
" # List all files in the local folder\n",
" files = [\n",
" f\n",
" for f in os.listdir(local_folder_path)\n",
" if os.path.isfile(os.path.join(local_folder_path, f))\n",
" ]\n",
"\n",
" print(\"Uploading files to GCS Bucket\")\n",
" for file_name in files:\n",
" # Full path to the local file\n",
" local_file_path = os.path.join(local_folder_path, file_name)\n",
"\n",
" # GCS path (include the file name in the destination)\n",
" gcs_blob_path = os.path.join(gcs_folder_path, file_name)\n",
"\n",
" # Create a blob in the bucket and upload the file\n",
" blob = bucket.blob(gcs_blob_path)\n",
" blob.upload_from_filename(local_file_path)\n",
"\n",
" print(f\"Uploaded {file_name} to gs://{bucket_name}/{gcs_blob_path}\")"
]
},
{
"cell_type": "markdown",
"id": "ac1d85cf-7529-469b-be7e-3e00a778af87",
"metadata": {},
"source": [
"### 4. Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49dab42f-6c98-4aac-b804-cd3ec81127f8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" # For Local Input Files\n",
" files = [f\"{local_folder_path}/{f}\" for f in os.listdir(local_folder_path)]\n",
"\n",
" for file in files:\n",
" if file.endswith(\".pdf\") or file.endswith(\".PDF\"):\n",
" process_pdf(file)\n",
"\n",
" upload_files_to_gcs(local_output_folder_path, bucket_name, gcs_folder_path)"
]
},
{
"cell_type": "markdown",
"id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b",
"metadata": {},
"source": [
"### Output Details"
]
},
{
"cell_type": "markdown",
"id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f",
"metadata": {
"tags": []
},
"source": [
"### Before Tooling\n",
"<img src='./images/before_tooling.png' width=400 height=600 alt=\"Sample Output\"></img>\n",
"### After Tooling\n",
"<img src='./images/after_tooling.png' width=400 height=600 alt=\"Sample Output\"></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "166fa02b-c3c1-4d2b-8a37-a843e961b2d2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m127",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}