incubator-tools/image_segmentation/image_segmentation.ipynb (366 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "d9ad8788-654c-4e38-9d6d-42158d95c6ee",
"metadata": {},
"source": [
"# Image Segmentation"
]
},
{
"cell_type": "markdown",
"id": "08400e10-6639-43a6-8e2a-0b4f28ba08e2",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "8cc2d117-9e2a-4ebe-8aef-94ad4f2b46b1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "94f1d1c0-823d-46d4-9a4a-e5ffa56c6b55",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This document provides a step-by-step guide on how to process a PDF file containing multiple images separated by white spaces, extract individual images, and save each image as a separate page in a new PDF. "
]
},
{
"cell_type": "markdown",
"id": "ffe00a75-567e-4f33-b4b4-aa72f9f4fe50",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Python : Jupyter Notebook (Vertex).\n",
"* Storage Bucket."
]
},
{
"cell_type": "markdown",
"id": "9b33b7f0-baf8-421c-b476-2d25b1ee9b02",
"metadata": {},
"source": [
"## Step by Step Procedure"
]
},
{
"cell_type": "markdown",
"id": "eadfe192-5691-4053-8670-32ef3e5bea8c",
"metadata": {},
"source": [
"### 1. Import Modules/Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea066b24-5e3b-4c28-a56f-263fc6e9aa67",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d8e2a54-e250-4924-8f58-9a4642e6ddf3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install google-cloud-documentai google-cloud-storage\n",
"!pip install opencv-python-headless fpdf pdf2image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6495b161-ac1b-40b2-a668-fe835da2950a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from google.cloud import storage\n",
"import cv2\n",
"import numpy as np\n",
"import os\n",
"from fpdf import FPDF\n",
"from pdf2image import convert_from_path\n",
"\n",
"from utilities import file_names"
]
},
{
"cell_type": "markdown",
"id": "1709ea5e-bf14-44ab-9963-cda14b092dc2",
"metadata": {},
"source": [
"### 2. Input Details"
]
},
{
"cell_type": "markdown",
"id": "7ce030a1-663b-445e-a38e-24dc415942c2",
"metadata": {},
"source": [
"* **input_file_path**: Provide the gcs path of the parent folder where the sub-folders contain input files. Please follow the folder structure described earlier.\n",
"* **output_file_path**: Provide gcs path where the output json files have to be saved"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36eb0a45-7e29-41b4-a9e0-44dab8f55c17",
"metadata": {},
"outputs": [],
"source": [
"input_file_path = \"gs://<<bucket_name>>/<<input_pdf_images>>/\"\n",
"output_file_path = \"gs://<<bucket_name>>/<<output_pdf_images>>/\""
]
},
{
"cell_type": "markdown",
"id": "b4a4b1ed-2e86-4a8e-b00a-ebf6e41921ef",
"metadata": {},
"source": [
"### 3.Run the required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b2afdc0-3873-4db9-a7b7-e64cf0dce00b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def download_pdf_from_gcs(\n",
" bucket_name: str, source_blob_name: str, destination_file_name: str\n",
") -> None:\n",
" \"\"\"\n",
" Download a PDF file from a Google Cloud Storage (GCS) bucket to local storage.\n",
"\n",
" Parameters:\n",
" bucket_name (str): Name of the GCS bucket.\n",
" source_blob_name (str): Name of the blob (file) in the GCS bucket.\n",
" destination_file_name (str): Path to save the downloaded file locally.\n",
"\n",
" Returns:\n",
" None\n",
" \"\"\"\n",
" storage_client = storage.Client()\n",
" bucket = storage_client.bucket(bucket_name)\n",
" blob = bucket.blob(source_blob_name)\n",
"\n",
" blob.download_to_filename(destination_file_name)\n",
" print(f\"PDF file {source_blob_name} downloaded to {destination_file_name}.\")\n",
"\n",
"\n",
"def upload_pdf_to_gcs(\n",
" bucket_name: str, source_file_name: str, destination_blob_name: str\n",
") -> None:\n",
" \"\"\"\n",
" Upload a PDF file from local storage to a GCS bucket.\n",
"\n",
" Parameters:\n",
" bucket_name (str): Name of the GCS bucket.\n",
" source_file_name (str): Path of the local file to be uploaded.\n",
" destination_blob_name (str): Name of the blob (file) in the GCS bucket.\n",
"\n",
" Returns:\n",
" None\n",
" \"\"\"\n",
" storage_client = storage.Client()\n",
" bucket = storage_client.bucket(bucket_name)\n",
" blob = bucket.blob(destination_blob_name)\n",
"\n",
" blob.upload_from_filename(source_file_name)\n",
" print(f\"PDF file {source_file_name} uploaded to {destination_blob_name}.\")\n",
"\n",
"\n",
"def split_pdf_images_into_pages(input_pdf_path: str, output_pdf_path: str) -> None:\n",
" \"\"\"\n",
" Split images in a PDF into separate pages and save the output as a new PDF.\n",
"\n",
" Parameters:\n",
" input_pdf_path (str): Path to the input PDF file.\n",
" output_pdf_path (str): Path to save the output PDF file.\n",
"\n",
" Returns:\n",
" None\n",
" \"\"\"\n",
" temp_folder = \"temp_images\"\n",
" if not os.path.exists(temp_folder):\n",
" os.makedirs(temp_folder)\n",
"\n",
" pages = convert_from_path(input_pdf_path, dpi=300)\n",
" pdf = FPDF()\n",
"\n",
" for page_number, page_image in enumerate(pages):\n",
" page_array = np.array(page_image)\n",
" color_image = cv2.cvtColor(page_array, cv2.COLOR_RGB2BGR)\n",
"\n",
" gray_image = cv2.cvtColor(color_image, cv2.COLOR_BGR2GRAY)\n",
" _, binary = cv2.threshold(gray_image, 240, 255, cv2.THRESH_BINARY)\n",
" inverted_binary = cv2.bitwise_not(binary)\n",
"\n",
" contours, _ = cv2.findContours(\n",
" inverted_binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE\n",
" )\n",
"\n",
" bounding_boxes = [cv2.boundingRect(c) for c in contours]\n",
" sorted_bounding_boxes = sorted(bounding_boxes, key=lambda x: x[1])\n",
"\n",
" min_width, min_height = 100, 100\n",
" for i, (x, y, w, h) in enumerate(sorted_bounding_boxes):\n",
" if w > min_width and h > min_height:\n",
" cropped_image = color_image[y : y + h, x : x + w]\n",
" temp_image_path = os.path.join(\n",
" temp_folder, f\"temp_image_{page_number}_{i}.png\"\n",
" )\n",
" cv2.imwrite(temp_image_path, cropped_image)\n",
"\n",
" pdf.add_page()\n",
" pdf.image(temp_image_path, x=10, y=10, w=190)\n",
"\n",
" pdf.output(output_pdf_path)\n",
" print(f\"Output PDF created: {output_pdf_path}\")\n",
"\n",
"\n",
"def process_pdf_in_gcs(\n",
" input_bucket_name: str,\n",
" input_blob_name: str,\n",
" output_bucket_name: str,\n",
" output_blob_name: str,\n",
") -> None:\n",
" \"\"\"\n",
" Process a PDF stored in a GCS bucket by splitting its images into pages\n",
" and saving the output back to a GCS bucket.\n",
"\n",
" Parameters:\n",
" input_bucket_name (str): Name of the input GCS bucket.\n",
" input_blob_name (str): Name of the input blob (file) in the GCS bucket.\n",
" output_bucket_name (str): Name of the output GCS bucket.\n",
" output_blob_name (str): Name of the output blob (file) in the GCS bucket.\n",
"\n",
" Returns:\n",
" None\n",
" \"\"\"\n",
" local_input_pdf = \"input.pdf\"\n",
" local_output_pdf = \"output.pdf\"\n",
"\n",
" download_pdf_from_gcs(input_bucket_name, input_blob_name, local_input_pdf)\n",
" split_pdf_images_into_pages(local_input_pdf, local_output_pdf)\n",
" upload_pdf_to_gcs(output_bucket_name, local_output_pdf, output_blob_name)\n",
"\n",
" os.remove(local_input_pdf)\n",
" os.remove(local_output_pdf)"
]
},
{
"cell_type": "markdown",
"id": "ac1d85cf-7529-469b-be7e-3e00a778af87",
"metadata": {},
"source": [
"### 4.Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49dab42f-6c98-4aac-b804-cd3ec81127f8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" input_bucket = input_file_path.split(\"/\")[2]\n",
" output_bucket = output_file_path.split(\"/\")[2]\n",
"\n",
" input_blob_files = list(file_names(input_file_path)[1].values())\n",
"\n",
" for input_blob in input_blob_files:\n",
" output_blob = (\n",
" \"/\".join(output_file_path.split(\"/\")[3:]) + input_blob.split(\"/\")[-1]\n",
" )\n",
" process_pdf_in_gcs(input_bucket, input_blob, output_bucket, output_blob)\n",
" print(\"Splitting for all the files are done.\")"
]
},
{
"cell_type": "markdown",
"id": "16e3f538-c1ab-4ff1-a720-80d07ea9245b",
"metadata": {},
"source": [
"### Output Details"
]
},
{
"cell_type": "markdown",
"id": "b4badf9b-1b40-4b6a-9ac9-b7a0f233018f",
"metadata": {
"tags": []
},
"source": [
"### Before Splitting\n",
"<img src='./images/before_splitting.png' width=400 height=600 alt=\"Sample Output\"></img>\n",
"### After Splitting\n",
"<img src='./images/after_splitting.png' width=400 height=600 alt=\"Sample Output\"></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40c13d25-50de-4c31-ad8b-13eb022b0303",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m127",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m127"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}