incubator-tools/image_segmentation/image

{ "cells": [ { "cell_type": "markdown", "id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee", "metadata": {}, "source": [ "# Image Segmentation" ] }, { "cell_type": "markdown", "id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "cell_type": "markdown", "id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55", "metadata": {}, "source": [ "## Objective\n", "\n", "This document provides a step-by-step guide on how to process a PDF file containing multiple images separated by white spaces, extract individual images, and save each image as a separate page in a new PDF. " ] }, { "cell_type": "markdown", "id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50", "metadata": {}, "source": [ "## Prerequisites\n", "* Python : Jupyter Notebook (Vertex).\n", "* Storage Bucket." ] }, { "cell_type": "markdown", "id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02", "metadata": {}, "source": [ "## Step by Step Procedure" ] }, { "cell_type": "markdown", "id": "eadfe192-5691-4053-8670-32ef3e5bea8c", "metadata": {}, "source": [ "### 1. Import Modules/Packages" ] }, { "cell_type": "code", "execution_count": null, "id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install google-cloud-documentai google-cloud-storage\n", "!pip install opencv-python-headless fpdf pdf2image" ] }, { "cell_type": "code", "execution_count": null, "id": "6495b161-ac1b-40b2-a668-fe835da2950a", "metadata": { "tags": [] }, "outputs": [], "source": [ "from google.cloud import storage\n", "import cv2\n", "import numpy as np\n", "import os\n", "from fpdf import FPDF\n", "from pdf2image import convert_from_path\n", "\n", "from utilities import file_names" ] }, { "cell_type": "markdown", "id": "1709ea5e-bf14-44ab-9963-cda14b092dc2", "metadata": {}, "source": [ "### 2. Input Details" ] }, { "cell_type": "markdown", "id": "7ce030a1-663b-445e-a38e-24dc415942c2", "metadata": {}, "source": [ "* **input_file_path**: Provide the gcs path of the parent folder where the sub-folders contain input files. Please follow the folder structure described earlier.\n", "* **output_file_path**: Provide gcs path where the output json files have to be saved" ] }, { "cell_type": "code", "execution_count": null, "id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17", "metadata": {}, "outputs": [], "source": [ "input_file_path = \"gs://<<bucket_name>>/<<input_pdf_images>>/\"\n", "output_file_path = \"gs://<<bucket_name>>/<<output_pdf_images>>/\"" ] }, { "cell_type": "markdown", "id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef", "metadata": {}, "source": [ "### 3.Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def download_pdf_from_gcs(\n", " bucket_name: str, source_blob_name: str, destination_file_name: str\n", ") -> None:\n", " \"\"\"\n", " Download a PDF file from a Google Cloud Storage (GCS) bucket to local storage.\n", "\n", " Parameters:\n", " bucket_name (str): Name of the GCS bucket.\n", " source_blob_name (str): Name of the blob (file) in the GCS bucket.\n", " destination_file_name (str): Path to save the downloaded file locally.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(source_blob_name)\n", "\n", " blob.download_to_filename(destination_file_name)\n", " print(f\"PDF file {source_blob_name} downloaded to {destination_file_name}.\")\n", "\n", "\n", "def upload_pdf_to_gcs(\n", " bucket_name: str, source_file_name: str, destination_blob_name: str\n", ") -> None:\n", " \"\"\"\n", " Upload a PDF file from local storage to a GCS bucket.\n", "\n", " Parameters:\n", " bucket_name (str): Name of the GCS bucket.\n", " source_file_name (str): Path of the local file to be uploaded.\n", " destination_blob_name (str): Name of the blob (file) in the GCS bucket.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(destination_blob_name)\n", "\n", " blob.upload_from_filename(source_file_name)\n", " print(f\"PDF file {source_file_name} uploaded to {destination_blob_name}.\")\n", "\n", "\n", "def split_pdf_images_into_pages(input_pdf_path: str, output_pdf_path: str) -> None:\n", " \"\"\"\n", " Split images in a PDF into separate pages and save the output as a new PDF.\n", "\n", " Parameters:\n", " input_pdf_path (str): Path to the input PDF file.\n", " output_pdf_path (str): Path to save the output PDF file.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " temp_folder = \"temp_images\"\n", " if not os.path.exists(temp_folder):\n", " os.makedirs(temp_folder)\n", "\n", " pages = convert_from_path(input_pdf_path, dpi=300)\n", " pdf = FPDF()\n", "\n", " for page_number, page_image in enumerate(pages):\n", " page_array = np.array(page_image)\n", " color_image = cv2.cvtColor(page_array, cv2.COLOR_RGB2BGR)\n", "\n", " gray_image = cv2.cvtColor(color_image, cv2.COLOR_BGR2GRAY)\n", " _, binary = cv2.threshold(gray_image, 240, 255, cv2.THRESH_BINARY)\n", " inverted_binary = cv2.bitwise_not(binary)\n", "\n", " contours, _ = cv2.findContours(\n", " inverted_binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE\n", " )\n", "\n", " bounding_boxes = [cv2.boundingRect(c) for c in contours]\n", " sorted_bounding_boxes = sorted(bounding_boxes, key=lambda x: x[1])\n", "\n", " min_width, min_height = 100, 100\n", " for i, (x, y, w, h) in enumerate(sorted_bounding_boxes):\n", " if w > min_width and h > min_height:\n", " cropped_image = color_image[y : y + h, x : x + w]\n", " temp_image_path = os.path.join(\n", " temp_folder, f\"temp_image_{page_number}_{i}.png\"\n", " )\n", " cv2.imwrite(temp_image_path, cropped_image)\n", "\n", " pdf.add_page()\n", " pdf.image(temp_image_path, x=10, y=10, w=190)\n", "\n", " pdf.output(output_pdf_path)\n", " print(f\"Output PDF created: {output_pdf_path}\")\n", "\n", "\n", "def process_pdf_in_gcs(\n", " input_bucket_name: str,\n", " input_blob_name: str,\n", " output_bucket_name: str,\n", " output_blob_name: str,\n", ") -> None:\n", " \"\"\"\n", " Process a PDF stored in a GCS bucket by splitting its images into pages\n", " and saving the output back to a GCS bucket.\n", "\n", " Parameters:\n", " input_bucket_name (str): Name of the input GCS bucket.\n", " input_blob_name (str): Name of the input blob (file) in the GCS bucket.\n", " output_bucket_name (str): Name of the output GCS bucket.\n", " output_blob_name (str): Name of the output blob (file) in the GCS bucket.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " local_input_pdf = \"input.pdf\"\n", " local_output_pdf = \"output.pdf\"\n", "\n", " download_pdf_from_gcs(input_bucket_name, input_blob_name, local_input_pdf)\n", " split_pdf_images_into_pages(local_input_pdf, local_output_pdf)\n", " upload_pdf_to_gcs(output_bucket_name, local_output_pdf, output_blob_name)\n", "\n", " os.remove(local_input_pdf)\n", " os.remove(local_output_pdf)" ] }, { "cell_type": "markdown", "id": "ac1d85cf-7529-469b-be7e-3e00a778af87", "metadata": {}, "source": [ "### 4.Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "49dab42f-6c98-4aac-b804-cd3ec81127f8", "metadata": { "tags": [] }, "outputs": [], "source": [ "if __name__ == \"__main__\":\n", " input_bucket = input_file_path.split(\"/\")[2]\n", " output_bucket = output_file_path.split(\"/\")[2]\n", "\n", " input_blob_files = list(file_names(input_file_path)[1].values())\n", "\n", " for input_blob in input_blob_files:\n", " output_blob = (\n", " \"/\".join(output_file_path.split(\"/\")[3:]) + input_blob.split(\"/\")[-1]\n", " )\n", " process_pdf_in_gcs(input_bucket, input_blob, output_bucket, output_blob)\n", " print(\"Splitting for all the files are done.\")" ] }, { "cell_type": "markdown", "id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b", "metadata": {}, "source": [ "### Output Details" ] }, { "cell_type": "markdown", "id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f", "metadata": { "tags": [] }, "source": [ "### Before Splitting\n", "<img src='./images/before_splitting.png' width=400 height=600 alt=\"Sample Output\"></img>\n", "### After Splitting\n", "<img src='./images/after_splitting.png' width=400 height=600 alt=\"Sample Output\"></img>" ] }, { "cell_type": "code", "execution_count": null, "id": "40c13d25-50de-4c31-ad8b-13eb022b0303", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m127", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }

incubator-tools/image_segmentation/image_segmentation.ipynb (366 lines of code) (raw):