incubator-tools/update_cde_schema/update_cde_schema.ipynb (441 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "5cbcbc6b-0c0b-42d5-8619-ebd1f4bf8657", "metadata": {}, "source": [ "# Update CDE Schema" ] }, { "cell_type": "markdown", "id": "a054d9da-3034-4fb0-b828-cd536198f68c", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "108c2600-51ad-4085-9555-7f3332ccc1b4", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.\n" ] }, { "cell_type": "markdown", "id": "4c8d9223-2537-4ae0-98a5-32e372d1419d", "metadata": {}, "source": [ "## Objective\n", "\n", "The objective of this tooling is to enable users to edit a Custom Document Extractor schema via API call. This includes adding new schema entities (support for single level nesting), deleting entities (this will fail if a processor has been trained on the entity previously), modifying Occurrence Type, and modifying Data Type.\n", "\n", "Eventual support for Description modification to be compatible with the prompting feature." ] }, { "cell_type": "markdown", "id": "07918d4f-aae3-428d-8a71-4fbe2d6c8c90", "metadata": {}, "source": [ "## Prerequisites\n", "* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n", "* Processor details\n", "* Permission For Google Storage and Vertex AI Notebook." ] }, { "cell_type": "markdown", "id": "c92d9b0c-4c03-41b6-aa2e-fcf838791610", "metadata": {}, "source": [ "## Step by Step procedure " ] }, { "cell_type": "markdown", "id": "26d50f73-088d-4b58-9cfc-f276f2201f9c", "metadata": {}, "source": [ "### 1.Importing Required Modules" ] }, { "cell_type": "code", "execution_count": null, "id": "ad01d013-4198-463f-9c6b-2792ec407efc", "metadata": { "tags": [] }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "a512cb48-2d05-4f19-9669-64af75a1edcd", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install google-cloud-documentai google-cloud-storage" ] }, { "cell_type": "code", "execution_count": null, "id": "ad415a6d-6b54-4241-a781-678a527f0ed7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from google.cloud import documentai_v1beta3" ] }, { "cell_type": "markdown", "id": "110046b1-ebea-4a2b-990d-5acda4a3b66c", "metadata": {}, "source": [ "### 2.Setup the inputs\n", "\n", "* `project_id` : A unique identifier for a Google Cloud project.\n", "* `location` : The geographic region of the resource or operation, e.g., us-central1.\n", "* `processor_id` : Identifier for the Google Cloud processor." ] }, { "cell_type": "code", "execution_count": null, "id": "41618163-fafe-438a-a8d4-25571fc600b4", "metadata": {}, "outputs": [], "source": [ "project_id = \"<project-id>\" # Project ID of the project\n", "location = \"us\" # location of the processor: us or eu\n", "processor_id = \"<cde-processor-id>\" # Processor id of processor from which the schema has to be exported to spreadsheet\n", "processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"" ] }, { "cell_type": "markdown", "id": "846011bb-eb92-454d-9ea9-6c9763490bce", "metadata": {}, "source": [ "### 3.Run the required functions" ] }, { "cell_type": "code", "execution_count": null, "id": "6127e1d9-6a45-4827-a10c-481343697abc", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_dataset_schema(processor_name: str) -> documentai_v1beta3.DatasetSchema:\n", " \"\"\"\n", " Retrieves the dataset schema for a given Document AI processor.\n", "\n", " Args:\n", " processor_name (str): The name of the processor from which to retrieve the dataset schema.\n", " The processor name should be in the format:\n", " 'projects/{project_id}/locations/{location}/processors/{processor_id}'.\n", "\n", " Returns:\n", " documentai_v1beta3.DatasetSchema: The dataset schema associated with the processor.\n", " \"\"\"\n", " # Create a client\n", " client = documentai_v1beta3.DocumentServiceClient()\n", "\n", " # Initialize request argument(s)\n", " request = documentai_v1beta3.GetDatasetSchemaRequest(\n", " name=processor_name + \"/dataset/datasetSchema\",\n", " )\n", "\n", " # Make the request\n", " response = client.get_dataset_schema(request=request)\n", " return response\n", "\n", "\n", "def update_dataset_schema(\n", " schema: documentai_v1beta3.DatasetSchema,\n", ") -> documentai_v1beta3.DatasetSchema:\n", " \"\"\"\n", " Updates the dataset schema for a given Document AI processor.\n", "\n", " Args:\n", " schema (documentai_v1beta3.DatasetSchema): The schema object to update. It should contain\n", " the `name` and `document_schema` fields that\n", " represent the dataset and document schema\n", " definitions respectively.\n", "\n", " Returns:\n", " documentai_v1beta3.DatasetSchema: The updated dataset schema response.\n", " \"\"\"\n", " # Create a client\n", " client = documentai_v1beta3.DocumentServiceClient()\n", "\n", " # Initialize request argument(s)\n", " request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n", " dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n", " )\n", "\n", " # Make the request\n", " response = client.update_dataset_schema(request=request)\n", " # Handle the response\n", " return response\n", "\n", "\n", "def modify_schema(\n", " schema: documentai_v1beta3.DatasetSchema, changes: dict\n", ") -> documentai_v1beta3.DatasetSchema:\n", " \"\"\"\n", " Modifies the schema based on the provided changes.\n", "\n", " Args:\n", " schema (documentai_v1beta3.DatasetSchema): The original dataset schema to modify.\n", " changes (dict): A dictionary specifying the changes to apply.\n", " The dictionary keys represent the change types and can include:\n", " - \"rename\": Renames an entity or property.\n", " - \"change_type\": Changes the value type of an entity or property.\n", " - \"change_occurrence\": Modifies the occurrence type of an entity or property.\n", " - \"add_field\": Adds a new field to the schema.\n", " - \"delete_field\": Deletes an existing field from the schema.\n", " Each change type has corresponding details in `change_data` like:\n", " - \"old_name\" (for rename), \"new_name\", \"parent_entity\" (for nested fields), etc.\n", "\n", " Returns:\n", " documentai_v1beta3.DatasetSchema: The modified dataset schema.\n", " \"\"\"\n", "\n", " for change_type, change_data in changes.items():\n", " if change_type == \"rename\":\n", " for entity_type in schema.document_schema.entity_types:\n", " for prop in entity_type.properties:\n", " if prop.name == change_data[\"old_name\"]:\n", " prop.name = change_data[\"new_name\"]\n", " if change_data[\"parent_entity\"] is not None:\n", " # Rename nested entity within parent entity\n", " for parent_entity in schema.document_schema.entity_types:\n", " if parent_entity.name == change_data[\"parent_entity\"]:\n", " for parent_prop in parent_entity.properties:\n", " if parent_prop.name == change_data[\"old_name\"]:\n", " parent_prop.name = change_data[\"new_name\"]\n", " break\n", "\n", " elif change_type == \"change_type\":\n", " for entity_type in schema.document_schema.entity_types:\n", " for prop in entity_type.properties:\n", " if prop.name == change_data[\"entity_name\"]:\n", " prop.value_type = change_data[\"new_type\"]\n", "\n", " elif change_type == \"change_occurrence\":\n", " for entity_type in schema.document_schema.entity_types:\n", " for prop in entity_type.properties:\n", " if prop.name == change_data[\"entity_name\"]:\n", " prop.occurrence_type = change_data[\"new_occurrence_type\"]\n", "\n", " elif change_type == \"add_field\":\n", " # print(\"Add field called for \")\n", " # print(change_data)\n", " new_entity = {\n", " \"name\": change_data[\"entity_name\"],\n", " \"value_type\": change_data[\"value_type\"],\n", " \"occurrence_type\": change_data[\"occurrence_type\"],\n", " }\n", " for entity_type in schema.document_schema.entity_types:\n", " if entity_type.base_types[0] == \"document\":\n", " entity_type.properties.append(new_entity)\n", "\n", " if change_data[\"parent_entity\"] is not None:\n", " # Add nested field under parent entity\n", " parent_found = False\n", " for parent_entity in schema.document_schema.entity_types:\n", " if parent_entity.name == change_data[\"parent_entity\"]:\n", " parent_entity.properties.append(new_entity)\n", " parent_found = True\n", " break\n", "\n", " # Create new entity type if parent not found\n", " if not parent_found:\n", " new_parent_entity = (\n", " documentai_v1beta3.DocumentSchema.EntityType(\n", " name=change_data[\"parent_entity\"],\n", " base_types=[\"object\"],\n", " properties=[new_entity],\n", " )\n", " )\n", " schema.document_schema.entity_types.append(\n", " new_parent_entity\n", " )\n", " new_parent_entity = {\n", " \"name\": change_data[\"parent_entity\"],\n", " \"value_type\": change_data[\"parent_entity\"],\n", " \"occurrence_type\": \"OPTIONAL_MULTIPLE\", # TODO: Update the code to set this with change parameter\n", " }\n", " print(schema.document_schema.entity_types[0])\n", " schema.document_schema.entity_types[\n", " 0\n", " ].properties.append(new_parent_entity)\n", "\n", " elif change_type == \"delete_field\":\n", " for entity_type in schema.document_schema.entity_types:\n", " for prop in entity_type.properties:\n", " if prop.name == change_data[\"entity_name\"]:\n", " entity_type.properties.remove(prop)\n", " if change_data[\"parent_entity\"] is not None:\n", " # Delete nested field from parent entity\n", " for parent_entity in schema.document_schema.entity_types:\n", " if parent_entity.name == change_data[\"parent_entity\"]:\n", " for parent_prop in parent_entity.properties:\n", " if (\n", " parent_prop.name\n", " == change_data[\"entity_name\"]\n", " ):\n", " parent_entity.properties.remove(parent_prop)\n", " break\n", "\n", " return schema" ] }, { "cell_type": "markdown", "id": "8eb2b3e0-8d10-4a57-83ab-330b041fa187", "metadata": {}, "source": [ "### 4.Run the code" ] }, { "cell_type": "code", "execution_count": null, "id": "90b255b0-cb06-4f4e-8f0e-75c2d56a52bd", "metadata": { "tags": [] }, "outputs": [], "source": [ "def main():\n", " # Get schema from your source\n", " schema = response_document_schema = get_dataset_schema(processor_name)\n", "\n", " # Define schema changes\n", " # TODO: Define your schema changes\n", " changes = {\n", " \"rename\": {\n", " \"old_name\": \"entity_1\",\n", " \"new_name\": \"new_entity_1\",\n", " \"parent_entity\": None,\n", " },\n", " \"change_type\": {\"entity_name\": \"entity_2\", \"new_type\": \"money\"},\n", " \"change_occurrence\": {\n", " \"entity_name\": \"entity_3\",\n", " \"new_occurrence_type\": \"OPTIONAL_ONCE\",\n", " },\n", " \"add_field\": {\n", " \"entity_name\": \"new_entity_field\",\n", " \"value_type\": \"string\",\n", " \"occurrence_type\": \"OPTIONAL_MULTIPLE\",\n", " \"parent_entity\": None,\n", " },\n", " \"add_field\": {\n", " \"entity_name\": \"new_child_field\",\n", " \"value_type\": \"string\",\n", " \"occurrence_type\": \"OPTIONAL_MULTIPLE\",\n", " \"parent_entity\": \"parent_entity\",\n", " },\n", " \"delete_field\": {\"entity_name\": \"entity_4\", \"parent_entity\": None},\n", " \"rename\": {\n", " \"old_name\": \"child_entity_1\",\n", " \"new_name\": \"rename_child_entity_1\",\n", " \"parent_entity\": \"parent_entity\",\n", " },\n", " }\n", "\n", " # Apply changes to the schema\n", " updated_schema = modify_schema(schema, changes)\n", "\n", " # Print the updated schema\n", " # print(\"Updated schema:\")\n", " # print(updated_schema)\n", "\n", " # Update dataset schema in your system\n", " response_update = update_dataset_schema(updated_schema)\n", " print(\"Schema Updated\")\n", "\n", "\n", "main()" ] }, { "cell_type": "markdown", "id": "c21c9da6-5333-4b95-90dc-7f78425b597e", "metadata": {}, "source": [ "### 5.Output\n", "\n", "This will update schema in the processor as per the change schema mention" ] }, { "cell_type": "markdown", "id": "6390a5e8-f8c8-49de-b832-05a43d7357d0", "metadata": { "tags": [] }, "source": [ "#### Schema Before Tooling\n", "<img src=\"./Images/Before_schema_updation.png\" width=800 height=400 ></img>\n", "\n", "#### Schema After Tooling\n", "<img src=\"./Images/After_schema_updation.png\" width=800 height=400 ></img>" ] }, { "cell_type": "code", "execution_count": null, "id": "25da77cc-daa6-445e-8a20-33b7be984ead", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "conda-base-py", "name": "workbench-notebooks.m125", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }