gemini/use-cases/document-processing/document

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ijGzTHJJUCPY" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "VEqbX8OhE8y9" }, "source": [ "# Document Processing with Gemini\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Run in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fdocument-processing%2Fdocument_processing.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/document-processing/document_processing.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://goo.gle/4jhBze9\">\n", " <img width=\"32px\" src=\"https://cdn.qwiklabs.com/assets/gcp_cloud-e3a77215f0b8bfa9b3f611c0d2208c7e8708ed31.svg\" alt=\"Google Cloud logo\"><br> Open in Cloud Skills Boost\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a> \n" ] }, { "cell_type": "markdown", "metadata": { "id": "fb49ff2efb96" }, "source": [ "| Authors |\n", "| --- |\n", "| [Holt Skinner](https://github.com/holtskinner) |\n", "| [Renato Leite](https://github.com/leiterenato) |" ] }, { "cell_type": "markdown", "metadata": { "id": "CkHPv2myT2cx" }, "source": [ "## Overview\n", "\n", "In today's information-driven world, the volume of digital documents generated daily is staggering. From emails and reports to legal contracts and scientific papers, businesses and individuals alike are inundated with vast amounts of textual data. Extracting meaningful insights from these documents efficiently and accurately has become a paramount challenge.\n", "\n", "Document processing involves a range of tasks, including text extraction, classification, summarization, and translation, among others. Traditional methods often rely on rule-based algorithms or statistical models, which may struggle with the nuances and complexities of natural language.\n", "\n", "Generative AI offers a promising alternative to understand, generate, and manipulate text using natural language prompting. Gemini on Vertex AI allows these models to be used in a scalable manner through:\n", "\n", "- [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) in the Cloud Console\n", "- [Vertex AI REST API](https://cloud.google.com/vertex-ai/docs/reference/rest)\n", "- [Google Gen AI SDK for Python](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview)\n", "\n", "For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "DrkcqHrrwMAo" }, "source": [ "### Objectives\n", "\n", "In this tutorial, you will learn how to use the Gemini API in Vertex AI with the Google Gen AI SDK for Python to process PDF documents.\n", "\n", "You will complete the following tasks:\n", "\n", "- Install the SDK\n", "- Use the Gemini 2.0 Flash model to:\n", " - Extract structured entities from an unstructured document\n", " - Classify document types\n", " - Combine classification and entity extraction into a single workflow\n", " - Answer questions from documents\n", " - Summarize documents\n", " - Extract Table Data as HTML\n", " - Translate documents\n", " - Compare and contrast similar documents\n", " - Identify and extract relevant pages from a PDF" ] }, { "cell_type": "markdown", "metadata": { "id": "C9nEPojogw-g" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "- Vertex AI\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "r11Gu7qNgx1p" }, "source": [ "## Getting Started\n" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install Google Gen AI SDK for Python\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tFy3H3aPgx12" }, "outputs": [], "source": [ "%pip install --upgrade --quiet google-genai pypdf" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "# Additional authentication is required for Google Colab\n", "if \"google.colab\" in sys.modules:\n", " # Authenticate user to Google Cloud\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information and create client\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nqwi-5ufWp_B" }, "outputs": [], "source": [ "import os\n", "\n", "from google import genai\n", "\n", "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", "\n", "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n", "\n", "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "jXHfaVS66_01" }, "source": [ "### Import libraries\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lslYAvw37JGQ" }, "outputs": [], "source": [ "from datetime import date\n", "from enum import Enum\n", "import json\n", "\n", "from IPython.display import Markdown, display\n", "from google.genai.types import GenerateContentConfig, Part\n", "from pydantic import BaseModel, Field\n", "import pypdf\n", "\n", "PDF_MIME_TYPE = \"application/pdf\"\n", "JSON_MIME_TYPE = \"application/json\"\n", "ENUM_MIME_TYPE = \"text/x.enum\"" ] }, { "cell_type": "markdown", "metadata": { "id": "FTMywdzUORIA" }, "source": [ "### Load the Gemini 2.0 Flash model\n", "\n", "Gemini 2.0 Flash (`gemini-2.0-flash`) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.\n", "\n", "Learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e771399cfc79" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-2.0-flash\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "Wy75sLb-yjNn" }, "source": [ "## Entity Extraction\n", "\n", "[Named Entity Extraction](https://en.wikipedia.org/wiki/Named-entity_recognition) is a technique of Natural Language Processing to identify specific fields and values from unstructured text. For example, you can find key-value pairs from a filled out form, or get all of the important data from an invoice categorized by the type." ] }, { "cell_type": "markdown", "metadata": { "id": "7a75f6e4bd54" }, "source": [ "### Extract entities from an invoice\n", "\n", "In this example, you will use a sample invoice and get all of the information in a structured format.\n", "\n", "This is the prompt to be sent to Gemini along with the PDF document. Feel free to edit this for your specific use case." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0841cb312d46" }, "outputs": [], "source": [ "entity_extraction_system_instruction = \"\"\"You are a document entity extraction specialist. Given a document, your task is to extract the text value of the entities provided in the schema.\n", "- The values must only include text found in the document\n", "- Do not normalize any entity values.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "802016a08f79" }, "source": [ "We will use [Controlled generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) to tell the model which fields need to be extracted.\n", "\n", "The response schema is specified in the `response_schema` parameter in `config`, and the model output will strictly follow that schema.\n", "\n", "You can provide the schemas as [Pydantic](https://docs.pydantic.dev/) models or a [JSON](https://www.json.org/json-en.html) string and the model will respond as JSON or an [Enum](https://docs.python.org/3/library/enum.html) depending on the value set in `response_mime_type`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "608a06507932" }, "outputs": [], "source": [ "class Address(BaseModel):\n", " street: str | None = Field(None, example=\"123 Main St\")\n", " city: str | None = Field(None, example=\"Springfield\")\n", " state: str | None = Field(None, example=\"IL\")\n", " postal_code: str | None = Field(None, example=\"62704\")\n", " country: str | None = Field(None, example=\"USA\")\n", "\n", "\n", "class LineItem(BaseModel):\n", " amount: float = Field(..., example=100.00)\n", " description: str | None = Field(None, example=\"Laptop\")\n", " product_code: str | None = Field(None, example=\"LPT-001\")\n", " quantity: int = Field(..., example=2)\n", " unit: str | None = Field(None, example=\"pcs\")\n", " unit_price: float = Field(..., example=50.00)\n", "\n", "\n", "class VAT(BaseModel):\n", " amount: float = Field(..., example=20.00)\n", " category_code: str | None = Field(None, example=\"A\")\n", " tax_amount: float | None = Field(None, example=5.00)\n", " tax_rate: float | None = Field(\n", " None, example=10.0\n", " ) # Percentage as a float (e.g., 10 for 10%)\n", " total_amount: float = Field(..., example=200.00)\n", "\n", "\n", "class Party(BaseModel):\n", " name: str = Field(..., example=\"Google\")\n", " street: str | None = Field(None, example=\"456 Business Rd\")\n", " city: str | None = Field(None, example=\"Metropolis\")\n", " state: str | None = Field(None, example=\"NY\")\n", " postal_code: str | None = Field(None, example=\"10001\")\n", " country: str | None = Field(None, example=\"USA\")\n", " email: str | None = Field(None, example=\"contact@google.com\")\n", " phone: str | None = Field(None, example=\"+1-555-1234\")\n", " website: str | None = Field(None, example=\"https://google.com\")\n", " tax_id: str | None = Field(None, example=\"123456789\")\n", " registration: str | None = Field(None, example=\"Reg-98765\")\n", " iban: str | None = Field(None, example=\"US1234567890123456789\")\n", " payment_ref: str | None = Field(None, example=\"INV-2024-001\")\n", "\n", "\n", "class Invoice(BaseModel):\n", " invoice_id: str = Field(..., example=\"INV-2024-001\")\n", " invoice_date: str = Field(..., example=\"2024-02-03\")\n", " supplier: Party\n", " receiver: Party\n", " line_items: list[LineItem]\n", " vat: list[VAT]" ] }, { "cell_type": "markdown", "metadata": { "id": "91dcaf17c2ce" }, "source": [ "For this example, we will download a PDF document to local storage and send the file bytes to the API for processing.\n", "\n", "You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "42b044f767e3" }, "outputs": [], "source": [ "# Download a PDF from Google Cloud Storage\n", "! gsutil cp \"gs://cloud-samples-data/generative-ai/pdf/invoice.pdf\" ./invoice.pdf" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KzqjpEiryjNo" }, "outputs": [], "source": [ "# Load file bytes\n", "with open(\"invoice.pdf\", \"rb\") as f:\n", " file_bytes = f.read()\n", "\n", "# Send to Gemini API\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"The following document is an invoice.\",\n", " Part.from_bytes(data=file_bytes, mime_type=PDF_MIME_TYPE),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=entity_extraction_system_instruction,\n", " temperature=0,\n", " response_schema=Invoice,\n", " response_mime_type=JSON_MIME_TYPE,\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "131e8044cf70" }, "source": [ "We can load the extracted data as an object using the `response.parsed` field." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "63f7f16fabc7" }, "outputs": [], "source": [ "invoice_data = response.parsed\n", "print(\"\\n-------Extracted Entities--------\")\n", "print(invoice_data)" ] }, { "cell_type": "markdown", "metadata": { "id": "c82b9d10e9d1" }, "source": [ "Or the response can then be parsed as JSON into a Python dictionary for use in other applications." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ce9731cb0a84" }, "outputs": [], "source": [ "json_object = json.loads(response.text)\n", "print(json_object)" ] }, { "cell_type": "markdown", "metadata": { "id": "c7cdda6aa720" }, "source": [ "You can see that Gemini extracted all of the relevant fields from the document." ] }, { "cell_type": "markdown", "metadata": { "id": "4dca9fa02c05" }, "source": [ "### Extract entities from a payslip\n", "\n", "Let's try with another type of document, a payslip or paystub.\n", "\n", "In this example, we will use a document hosted on Google Cloud Storage and process it by passing the URI.\n", "\n", "You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/earnings_statement.pdf)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3ca20cd3f738" }, "outputs": [], "source": [ "class Payslip(BaseModel):\n", " employee_id: str = Field(..., description=\"Unique identifier for the employee\")\n", " employee_name: str = Field(..., description=\"Full name of the employee\")\n", " pay_period_start: date = Field(..., description=\"Start date of the pay period\")\n", " pay_period_end: date = Field(..., description=\"End date of the pay period\")\n", " gross_income: float = Field(..., description=\"Total income before deductions\")\n", " federal_tax: float = Field(..., description=\"Federal tax deduction amount\")\n", " state_tax: float | None = Field(\n", " 0.0, description=\"State tax deduction amount, if applicable\"\n", " )\n", " social_security: float = Field(..., description=\"Social Security deduction amount\")\n", " medicare: float = Field(..., description=\"Medicare deduction amount\")\n", " other_deductions: float | None = Field(\n", " 0.0, description=\"Other deductions (e.g., health insurance, retirement)\"\n", " )\n", " net_income: float = Field(..., description=\"Income after all deductions\")\n", " payment_date: date = Field(..., description=\"Date the payment was issued\")\n", " hours_worked: float | None = Field(\n", " None, description=\"Total hours worked in the pay period\"\n", " )\n", " hourly_rate: float | None = Field(\n", " None, description=\"Employee's hourly rate, if applicable\"\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "06d34a6f08d9" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"The following document is a Payslip.\",\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/earnings_statement.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=entity_extraction_system_instruction,\n", " temperature=0,\n", " response_schema=Payslip,\n", " response_mime_type=JSON_MIME_TYPE,\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "230b3ae51289" }, "outputs": [], "source": [ "print(\"\\n-------Extracted Entities--------\")\n", "print(response.parsed)" ] }, { "cell_type": "markdown", "metadata": { "id": "Uhtahn_jTZKC" }, "source": [ "## Document Classification\n", "\n", "Document classification is the process for identifying the type of document. For example, invoice, W-2, receipt, etc.\n", "\n", "In this example, you will use a [sample tax form (W-9)](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf) and get the specific type of document from a specified `Enum`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d797c2bfb490" }, "outputs": [], "source": [ "classification_prompt = \"\"\"You are a document classification specialist. Given a document, your task is to find which category the document belongs to from the document categories provided in the schema.\"\"\"\n", "\n", "\n", "class DocumentCategory(Enum):\n", " TAX_1040_2019 = \"1040_2019\"\n", " TAX_1040_2020 = \"1040_2020\"\n", " TAX_1099_R = \"1099-r\"\n", " BANK_STATEMENT = \"bank_statement\"\n", " CREDIT_CARD_STATEMENT = \"credit_card_statement\"\n", " EXPENSE = \"expense\"\n", " TAX_1120S_2019 = \"form_1120S_2019\"\n", " TAX_1120S_2020 = \"form_1120S_2020\"\n", " INVESTMENT_RETIREMENT_STATEMENT = \"investment_retirement_statement\"\n", " INVOICE = \"invoice\"\n", " PAYSTUB = \"paystub\"\n", " PROPERTY_INSURANCE = \"property_insurance\"\n", " PURCHASE_ORDER = \"purchase_order\"\n", " UTILITY_STATEMENT = \"utility_statement\"\n", " W2 = \"w2\"\n", " W9 = \"w9\"\n", " DRIVER_LICENSE = \"driver_license\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7dcab4a008a5" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"Classify the following document.\",\n", " Part.from_uri(\n", " file_uri=\"https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=classification_prompt,\n", " temperature=0,\n", " response_schema=DocumentCategory,\n", " response_mime_type=ENUM_MIME_TYPE,\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "200922ddac39" }, "outputs": [], "source": [ "print(\"\\n-------Document Classification--------\")\n", "print(response.text)\n", "print(response.parsed)" ] }, { "cell_type": "markdown", "metadata": { "id": "d99b968e9faa" }, "source": [ "You can see that Gemini successfully categorized the document." ] }, { "cell_type": "markdown", "metadata": { "id": "9c41c7273b66" }, "source": [ "### Chaining Classification and Extraction\n", "\n", "These techniques can also be chained together to extract any number of document types.\n", "\n", "For example, if you have multiple types of documents to process, you can send each document to Gemini with a classification prompt, then based on that output, you can write logic to decide which extraction prompt to use.\n", "\n", "These are the sample documents:\n", "\n", "- [US Driver License](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf)\n", "- [Invoice](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf)\n", "- [Form W-2](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "69fd5883a812" }, "outputs": [], "source": [ "class W2Form(BaseModel):\n", " control_number: str | None = Field(None)\n", " ein: str = Field(...)\n", "\n", " employee_first_name: str = Field(...)\n", " employee_last_name: str = Field(...)\n", " employee_address_street: str = Field(...)\n", " employee_address_city: str = Field(...)\n", " employee_address_state: str = Field(...)\n", " employee_address_zip: str = Field(...)\n", "\n", " employer_name: str = Field(...)\n", " employer_address_street: str = Field(...)\n", " employer_address_city: str = Field(...)\n", " employer_address_state: str = Field(...)\n", " employer_address_zip: str = Field(...)\n", " employer_state_id_number: str | None = Field(None)\n", "\n", " wages_tips_other_compensation: float = Field(...)\n", " federal_income_tax_withheld: float = Field(...)\n", " social_security_wages: float = Field(...)\n", " social_security_tax_withheld: float = Field(...)\n", " medicare_wages_and_tips: float = Field(...)\n", " medicare_tax_withheld: float = Field(...)\n", "\n", " state: str | None = Field(None)\n", " state_wages_tips_etc: float | None = Field(None)\n", " state_income_tax: float | None = Field(None)\n", "\n", " box_12_code: str | None = Field(None)\n", " box_12_value: str | None = Field(None)\n", "\n", " form_year: int = Field(...)\n", "\n", "\n", "class DriversLicense(BaseModel):\n", " address: str = Field(\n", " ..., title=\"Address\", description=\"The address of the individual.\"\n", " )\n", " date_of_birth: date = Field(\n", " ..., title=\"Date of Birth\", description=\"The birthdate of the individual.\"\n", " )\n", " document_id: str = Field(\n", " ...,\n", " title=\"Document ID\",\n", " description=\"The unique document ID for the driver's license.\",\n", " )\n", " expiration_date: date = Field(\n", " ...,\n", " title=\"Expiration Date\",\n", " description=\"The expiration date of the driver's license.\",\n", " )\n", " family_name: str = Field(\n", " ...,\n", " title=\"Family Name\",\n", " description=\"The family name (last name) of the individual.\",\n", " )\n", " given_names: str = Field(\n", " ...,\n", " title=\"Given Names\",\n", " description=\"The given names (first and middle names) of the individual.\",\n", " )\n", " issue_date: date = Field(\n", " ..., title=\"Issue Date\", description=\"The issue date of the driver's license.\"\n", " )\n", "\n", "\n", "# Map classification types to schemas\n", "classification_to_schema = {\n", " DocumentCategory.INVOICE: Invoice,\n", " DocumentCategory.W2: W2Form,\n", " DocumentCategory.DRIVER_LICENSE: DriversLicense,\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2c806b4d757e" }, "outputs": [], "source": [ "gcs_uris = [\n", " \"gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf\",\n", " \"gs://cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf\",\n", " \"gs://cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf\",\n", "]\n", "\n", "for gcs_uri in gcs_uris:\n", " print(f\"\\nFile: {gcs_uri}\\n\")\n", "\n", " # Send to Gemini with Classification Prompt\n", " classification_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"Classify the following document.\",\n", " Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=classification_prompt,\n", " temperature=0,\n", " response_schema=DocumentCategory,\n", " response_mime_type=ENUM_MIME_TYPE,\n", " ),\n", " )\n", "\n", " print(f\"Document Classification: {classification_response.text}\")\n", "\n", " # Get Extraction schema based on Classification\n", " extraction_schema = classification_to_schema.get(classification_response.parsed)\n", "\n", " if not extraction_schema:\n", " print(f\"Document does not belong to a specified class. Skipping extraction.\")\n", " continue\n", "\n", " # Send to Gemini with Extraction Prompt\n", " extraction_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " f\"Extract the entities from the following {classification_response.text} document.\",\n", " Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=classification_prompt,\n", " temperature=0,\n", " response_schema=extraction_schema,\n", " response_mime_type=JSON_MIME_TYPE,\n", " ),\n", " )\n", "\n", " print(\"\\n-------Extracted Entities--------\")\n", " print(extraction_response.parsed)" ] }, { "cell_type": "markdown", "metadata": { "id": "322abdb6d63d" }, "source": [ "## Document Question Answering\n", "\n", "Gemini can be used to answer questions about a document.\n", "\n", "This example answers a question about the Transformer model paper [\"Attention is all you need\"](https://arxiv.org/pdf/1706.03762), we will be loading the PDF file directly from the source on [arXiv](https://arxiv.org)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f47a8b63ce13" }, "outputs": [], "source": [ "qa_system_instruction = \"You are a question answering specialist. Given a question and a context, your task is to provide the answer to the question based on the context provided. Give the answer first, followed by an explanation.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "636f158c24fb" }, "outputs": [], "source": [ "# Send Q&A Prompt to Gemini\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"What is the attention mechanism?\",\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=qa_system_instruction,\n", " temperature=0,\n", " response_mime_type=\"text/plain\",\n", " ),\n", ")\n", "\n", "print(f\"Answer: {response.text}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "d5881bdeb3b0" }, "source": [ "## Document Summarization\n", "\n", "Gemini can also be used to summarize or paraphrase a document's contents. Your prompt can specify how detailed the summary should be or specific formatting, such as bullet points or paragraphs." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "85b23b916ffa" }, "outputs": [], "source": [ "summarization_system_instruction = \"\"\"You are a professional document summarization specialist. Given a document, your task is to provide a detailed summary of the content of the document.\n", "\n", "If it includes images, provide descriptions of the images.\n", "If it includes tables, extract all elements of the tables.\n", "If it includes graphs, explain the findings in the graphs.\n", "Do not include any numbers that are not mentioned in the document.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "01c2c8c947e0" }, "outputs": [], "source": [ "# Send Summarization Prompt to Gemini\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"Summarize the following document.\",\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(\n", " system_instruction=summarization_system_instruction,\n", " temperature=0,\n", " response_mime_type=\"text/plain\",\n", " ),\n", ")\n", "\n", "display(Markdown(f\"### Document Summary\"))\n", "display(Markdown(response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "85015f00a36f" }, "source": [ "## Table parsing from documents\n", "\n", "Gemini can parse contents of a table and return it in a structured format, such as HTML or markdown." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b780755d42e0" }, "outputs": [], "source": [ "table_extraction_prompt = \"\"\"What is the HTML code of the table in this document?\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2ad318a19c6a" }, "outputs": [], "source": [ "# Send Table Extraction Prompt to Gemini\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " table_extraction_prompt,\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/salary_table.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(temperature=0),\n", ")\n", "\n", "display(Markdown(response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "1ebe7318abf6" }, "source": [ "## Document Translation\n", "\n", "Gemini can translate documents between languages. This example translates meeting notes from English into French and Spanish." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c03f55376e76" }, "outputs": [], "source": [ "translation_prompt = \"\"\"Translate the first paragraph into French and Spanish. Label each paragraph with the target language.\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0e22d1c06508" }, "outputs": [], "source": [ "# Send Translation Prompt to Gemini\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " translation_prompt,\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(\n", " temperature=0,\n", " ),\n", ")\n", "\n", "display(Markdown(f\"### Translations\"))\n", "display(Markdown(response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "0e8111f438db" }, "source": [ "## Document Comparison\n", "\n", "Gemini can compare and contrast the contents of multiple documents. This example finds the changes in the IRS Form 1040 between 2013 and 2023.\n", "\n", "Note: when working with multiple documents, the order can matter and should be specified in your prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "62bd15c5553f" }, "outputs": [], "source": [ "comparison_prompt = \"\"\"The first document is from 2013, the second one from 2023. How did the standard deduction evolve?\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e5f07456ed8d" }, "outputs": [], "source": [ "# Send Comparison Prompt to Gemini\n", "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " comparison_prompt,\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/form_1040_2013.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " Part.from_uri(\n", " file_uri=\"gs://cloud-samples-data/generative-ai/pdf/form_1040_2023.pdf\",\n", " mime_type=PDF_MIME_TYPE,\n", " ),\n", " ],\n", " config=GenerateContentConfig(temperature=0),\n", ")\n", "\n", "display(Markdown(f\"### Comparison\"))\n", "display(Markdown(response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "1f99e3fb7a97" }, "source": [ "## Document Page Extraction\n", "\n", "This example uses Gemini to identify relevant pages and creates a new, focused PDF." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ee236963ec3d" }, "outputs": [], "source": [ "PROMPT_PAGES = \"\"\"\n", "Return the numbers of all pages in the document above that contain information related to the question below.\n", "<Instructions>\n", " - Use the document above as your only source of information to determine which pages are related to the question below.\n", " - Return the page numbers of the document above that are related to the question. When in doubt, return the page anyway.\n", " - The page numbers should be in the format of a list of integers, e.g. [1, 2, 3].\n", "</Instructions>\n", "<Suggestions>\n", " - The document above is a financial report with various tables, charts, infographics, lists, and additional text information.\n", " - Pay CLOSE ATTENTION to the chart legends and chart COLORS to determine the pages. Colors may indicate which information is important for determining the pages.\n", " - The color of the chart legends represents the color of the bars in the chart.\n", " - Use ONLY this document as context to determine the pages.\n", " - In most cases, the page number can be found in the footer.\n", "</Suggestions>\n", "<Question>\n", "{question}\n", "</Question>\n", "\"\"\"\n", "\n", "\n", "def pdf_slice(input_file: str, output_file: str, pages: list[int]) -> None:\n", " \"\"\"Using an input pdf file name and a list of page numbers,\n", " return the file name of a new pdf with only those pages\n", " \"\"\"\n", " pdf_reader = pypdf.PdfReader(input_file)\n", " pdf_writer = pypdf.PdfWriter()\n", " for page_num in pages:\n", " if 1 <= page_num <= len(pdf_reader.pages):\n", " pdf_writer.add_page(pdf_reader.pages[page_num - 1])\n", " pdf_writer.write(output_file)" ] }, { "cell_type": "markdown", "metadata": { "id": "5a8fad605e6a" }, "source": [ "Include your question and the path to your PDF from a URL." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ffcad6183966" }, "outputs": [], "source": [ "question = \"From the Consolidated Balance Sheet, what was the difference between the total assets from 2022 to 2023?\" # @param {type: \"string\"}\n", "pdf_path = \"https://storage.googleapis.com/github-repo/generative-ai/gemini/use-cases/document-processing/CymbalBankFinancialStatements.pdf\" # @param {type: \"string\"}\n", "local_pdf = os.path.basename(pdf_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "25ba38644e57" }, "source": [ "Extract the relevant pages using Gemini and print them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "31408efe4118" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " \"<Document>\",\n", " Part.from_uri(file_uri=pdf_path, mime_type=PDF_MIME_TYPE),\n", " \"</Document>\",\n", " PROMPT_PAGES.format(question=question),\n", " ],\n", " config=GenerateContentConfig(\n", " temperature=0,\n", " response_mime_type=JSON_MIME_TYPE,\n", " response_schema=list[int],\n", " ),\n", ")\n", "pages = response.parsed\n", "print(pages)" ] }, { "cell_type": "markdown", "metadata": { "id": "6d2a92e816d4" }, "source": [ "Download the PDF file to local storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "33474e737de4" }, "outputs": [], "source": [ "!wget {pdf_path} -O {local_pdf}" ] }, { "cell_type": "markdown", "metadata": { "id": "bde5fbbbbe08" }, "source": [ "To ensure we find the answer to the question, we will also retrieve the page immediately after the selected page." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c05346c2724e" }, "outputs": [], "source": [ "expanded_pages = set(pages).union(page + 1 for page in pages)\n", "pdf_slice(input_file=local_pdf, output_file=\"sample.pdf\", pages=sorted(expanded_pages))" ] } ], "metadata": { "colab": { "name": "document_processing.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

gemini/use-cases/document-processing/document_processing.ipynb (1,280 lines of code) (raw):