incubator-tools/update_cde_schema/update_cde_schema.ipynb (441 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "5cbcbc6b-0c0b-42d5-8619-ebd1f4bf8657",
"metadata": {},
"source": [
"# Update CDE Schema"
]
},
{
"cell_type": "markdown",
"id": "a054d9da-3034-4fb0-b828-cd536198f68c",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "108c2600-51ad-4085-9555-7f3332ccc1b4",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.\n"
]
},
{
"cell_type": "markdown",
"id": "4c8d9223-2537-4ae0-98a5-32e372d1419d",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"The objective of this tooling is to enable users to edit a Custom Document Extractor schema via API call. This includes adding new schema entities (support for single level nesting), deleting entities (this will fail if a processor has been trained on the entity previously), modifying Occurrence Type, and modifying Data Type.\n",
"\n",
"Eventual support for Description modification to be compatible with the prompting feature."
]
},
{
"cell_type": "markdown",
"id": "07918d4f-aae3-428d-8a71-4fbe2d6c8c90",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n",
"* Processor details\n",
"* Permission For Google Storage and Vertex AI Notebook."
]
},
{
"cell_type": "markdown",
"id": "c92d9b0c-4c03-41b6-aa2e-fcf838791610",
"metadata": {},
"source": [
"## Step by Step procedure "
]
},
{
"cell_type": "markdown",
"id": "26d50f73-088d-4b58-9cfc-f276f2201f9c",
"metadata": {},
"source": [
"### 1.Importing Required Modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad01d013-4198-463f-9c6b-2792ec407efc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a512cb48-2d05-4f19-9669-64af75a1edcd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install google-cloud-documentai google-cloud-storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad415a6d-6b54-4241-a781-678a527f0ed7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from google.cloud import documentai_v1beta3"
]
},
{
"cell_type": "markdown",
"id": "110046b1-ebea-4a2b-990d-5acda4a3b66c",
"metadata": {},
"source": [
"### 2.Setup the inputs\n",
"\n",
"* `project_id` : A unique identifier for a Google Cloud project.\n",
"* `location` : The geographic region of the resource or operation, e.g., us-central1.\n",
"* `processor_id` : Identifier for the Google Cloud processor."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41618163-fafe-438a-a8d4-25571fc600b4",
"metadata": {},
"outputs": [],
"source": [
"project_id = \"<project-id>\" # Project ID of the project\n",
"location = \"us\" # location of the processor: us or eu\n",
"processor_id = \"<cde-processor-id>\" # Processor id of processor from which the schema has to be exported to spreadsheet\n",
"processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\""
]
},
{
"cell_type": "markdown",
"id": "846011bb-eb92-454d-9ea9-6c9763490bce",
"metadata": {},
"source": [
"### 3.Run the required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6127e1d9-6a45-4827-a10c-481343697abc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def get_dataset_schema(processor_name: str) -> documentai_v1beta3.DatasetSchema:\n",
" \"\"\"\n",
" Retrieves the dataset schema for a given Document AI processor.\n",
"\n",
" Args:\n",
" processor_name (str): The name of the processor from which to retrieve the dataset schema.\n",
" The processor name should be in the format:\n",
" 'projects/{project_id}/locations/{location}/processors/{processor_id}'.\n",
"\n",
" Returns:\n",
" documentai_v1beta3.DatasetSchema: The dataset schema associated with the processor.\n",
" \"\"\"\n",
" # Create a client\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.GetDatasetSchemaRequest(\n",
" name=processor_name + \"/dataset/datasetSchema\",\n",
" )\n",
"\n",
" # Make the request\n",
" response = client.get_dataset_schema(request=request)\n",
" return response\n",
"\n",
"\n",
"def update_dataset_schema(\n",
" schema: documentai_v1beta3.DatasetSchema,\n",
") -> documentai_v1beta3.DatasetSchema:\n",
" \"\"\"\n",
" Updates the dataset schema for a given Document AI processor.\n",
"\n",
" Args:\n",
" schema (documentai_v1beta3.DatasetSchema): The schema object to update. It should contain\n",
" the `name` and `document_schema` fields that\n",
" represent the dataset and document schema\n",
" definitions respectively.\n",
"\n",
" Returns:\n",
" documentai_v1beta3.DatasetSchema: The updated dataset schema response.\n",
" \"\"\"\n",
" # Create a client\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n",
" dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n",
" )\n",
"\n",
" # Make the request\n",
" response = client.update_dataset_schema(request=request)\n",
" # Handle the response\n",
" return response\n",
"\n",
"\n",
"def modify_schema(\n",
" schema: documentai_v1beta3.DatasetSchema, changes: dict\n",
") -> documentai_v1beta3.DatasetSchema:\n",
" \"\"\"\n",
" Modifies the schema based on the provided changes.\n",
"\n",
" Args:\n",
" schema (documentai_v1beta3.DatasetSchema): The original dataset schema to modify.\n",
" changes (dict): A dictionary specifying the changes to apply.\n",
" The dictionary keys represent the change types and can include:\n",
" - \"rename\": Renames an entity or property.\n",
" - \"change_type\": Changes the value type of an entity or property.\n",
" - \"change_occurrence\": Modifies the occurrence type of an entity or property.\n",
" - \"add_field\": Adds a new field to the schema.\n",
" - \"delete_field\": Deletes an existing field from the schema.\n",
" Each change type has corresponding details in `change_data` like:\n",
" - \"old_name\" (for rename), \"new_name\", \"parent_entity\" (for nested fields), etc.\n",
"\n",
" Returns:\n",
" documentai_v1beta3.DatasetSchema: The modified dataset schema.\n",
" \"\"\"\n",
"\n",
" for change_type, change_data in changes.items():\n",
" if change_type == \"rename\":\n",
" for entity_type in schema.document_schema.entity_types:\n",
" for prop in entity_type.properties:\n",
" if prop.name == change_data[\"old_name\"]:\n",
" prop.name = change_data[\"new_name\"]\n",
" if change_data[\"parent_entity\"] is not None:\n",
" # Rename nested entity within parent entity\n",
" for parent_entity in schema.document_schema.entity_types:\n",
" if parent_entity.name == change_data[\"parent_entity\"]:\n",
" for parent_prop in parent_entity.properties:\n",
" if parent_prop.name == change_data[\"old_name\"]:\n",
" parent_prop.name = change_data[\"new_name\"]\n",
" break\n",
"\n",
" elif change_type == \"change_type\":\n",
" for entity_type in schema.document_schema.entity_types:\n",
" for prop in entity_type.properties:\n",
" if prop.name == change_data[\"entity_name\"]:\n",
" prop.value_type = change_data[\"new_type\"]\n",
"\n",
" elif change_type == \"change_occurrence\":\n",
" for entity_type in schema.document_schema.entity_types:\n",
" for prop in entity_type.properties:\n",
" if prop.name == change_data[\"entity_name\"]:\n",
" prop.occurrence_type = change_data[\"new_occurrence_type\"]\n",
"\n",
" elif change_type == \"add_field\":\n",
" # print(\"Add field called for \")\n",
" # print(change_data)\n",
" new_entity = {\n",
" \"name\": change_data[\"entity_name\"],\n",
" \"value_type\": change_data[\"value_type\"],\n",
" \"occurrence_type\": change_data[\"occurrence_type\"],\n",
" }\n",
" for entity_type in schema.document_schema.entity_types:\n",
" if entity_type.base_types[0] == \"document\":\n",
" entity_type.properties.append(new_entity)\n",
"\n",
" if change_data[\"parent_entity\"] is not None:\n",
" # Add nested field under parent entity\n",
" parent_found = False\n",
" for parent_entity in schema.document_schema.entity_types:\n",
" if parent_entity.name == change_data[\"parent_entity\"]:\n",
" parent_entity.properties.append(new_entity)\n",
" parent_found = True\n",
" break\n",
"\n",
" # Create new entity type if parent not found\n",
" if not parent_found:\n",
" new_parent_entity = (\n",
" documentai_v1beta3.DocumentSchema.EntityType(\n",
" name=change_data[\"parent_entity\"],\n",
" base_types=[\"object\"],\n",
" properties=[new_entity],\n",
" )\n",
" )\n",
" schema.document_schema.entity_types.append(\n",
" new_parent_entity\n",
" )\n",
" new_parent_entity = {\n",
" \"name\": change_data[\"parent_entity\"],\n",
" \"value_type\": change_data[\"parent_entity\"],\n",
" \"occurrence_type\": \"OPTIONAL_MULTIPLE\", # TODO: Update the code to set this with change parameter\n",
" }\n",
" print(schema.document_schema.entity_types[0])\n",
" schema.document_schema.entity_types[\n",
" 0\n",
" ].properties.append(new_parent_entity)\n",
"\n",
" elif change_type == \"delete_field\":\n",
" for entity_type in schema.document_schema.entity_types:\n",
" for prop in entity_type.properties:\n",
" if prop.name == change_data[\"entity_name\"]:\n",
" entity_type.properties.remove(prop)\n",
" if change_data[\"parent_entity\"] is not None:\n",
" # Delete nested field from parent entity\n",
" for parent_entity in schema.document_schema.entity_types:\n",
" if parent_entity.name == change_data[\"parent_entity\"]:\n",
" for parent_prop in parent_entity.properties:\n",
" if (\n",
" parent_prop.name\n",
" == change_data[\"entity_name\"]\n",
" ):\n",
" parent_entity.properties.remove(parent_prop)\n",
" break\n",
"\n",
" return schema"
]
},
{
"cell_type": "markdown",
"id": "8eb2b3e0-8d10-4a57-83ab-330b041fa187",
"metadata": {},
"source": [
"### 4.Run the code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90b255b0-cb06-4f4e-8f0e-75c2d56a52bd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def main():\n",
" # Get schema from your source\n",
" schema = response_document_schema = get_dataset_schema(processor_name)\n",
"\n",
" # Define schema changes\n",
" # TODO: Define your schema changes\n",
" changes = {\n",
" \"rename\": {\n",
" \"old_name\": \"entity_1\",\n",
" \"new_name\": \"new_entity_1\",\n",
" \"parent_entity\": None,\n",
" },\n",
" \"change_type\": {\"entity_name\": \"entity_2\", \"new_type\": \"money\"},\n",
" \"change_occurrence\": {\n",
" \"entity_name\": \"entity_3\",\n",
" \"new_occurrence_type\": \"OPTIONAL_ONCE\",\n",
" },\n",
" \"add_field\": {\n",
" \"entity_name\": \"new_entity_field\",\n",
" \"value_type\": \"string\",\n",
" \"occurrence_type\": \"OPTIONAL_MULTIPLE\",\n",
" \"parent_entity\": None,\n",
" },\n",
" \"add_field\": {\n",
" \"entity_name\": \"new_child_field\",\n",
" \"value_type\": \"string\",\n",
" \"occurrence_type\": \"OPTIONAL_MULTIPLE\",\n",
" \"parent_entity\": \"parent_entity\",\n",
" },\n",
" \"delete_field\": {\"entity_name\": \"entity_4\", \"parent_entity\": None},\n",
" \"rename\": {\n",
" \"old_name\": \"child_entity_1\",\n",
" \"new_name\": \"rename_child_entity_1\",\n",
" \"parent_entity\": \"parent_entity\",\n",
" },\n",
" }\n",
"\n",
" # Apply changes to the schema\n",
" updated_schema = modify_schema(schema, changes)\n",
"\n",
" # Print the updated schema\n",
" # print(\"Updated schema:\")\n",
" # print(updated_schema)\n",
"\n",
" # Update dataset schema in your system\n",
" response_update = update_dataset_schema(updated_schema)\n",
" print(\"Schema Updated\")\n",
"\n",
"\n",
"main()"
]
},
{
"cell_type": "markdown",
"id": "c21c9da6-5333-4b95-90dc-7f78425b597e",
"metadata": {},
"source": [
"### 5.Output\n",
"\n",
"This will update schema in the processor as per the change schema mention"
]
},
{
"cell_type": "markdown",
"id": "6390a5e8-f8c8-49de-b832-05a43d7357d0",
"metadata": {
"tags": []
},
"source": [
"#### Schema Before Tooling\n",
"<img src=\"./Images/Before_schema_updation.png\" width=800 height=400 ></img>\n",
"\n",
"#### Schema After Tooling\n",
"<img src=\"./Images/After_schema_updation.png\" width=800 height=400 ></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "25da77cc-daa6-445e-8a20-33b7be984ead",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "conda-base-py",
"name": "workbench-notebooks.m125",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel) (Local)",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}