# Update CDE Schema

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Objective

The objective of this tooling is to enable users to edit a Custom Document Extractor schema via API call. This includes adding new schema entities (support for single level nesting), deleting entities (this will fail if a processor has been trained on the entity previously), modifying Occurrence Type, and modifying Data Type.

Eventual support for Description modification to be compatible with the prompting feature.

## Prerequisites
*  Vertex AI Notebook Or Colab (If using Colab, use authentication)
* Processor details
* Permission For Google Storage and Vertex AI Notebook.

## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
!pip install google-cloud-documentai google-cloud-storage

In [None]:
from google.cloud import documentai_v1beta3

### 2.Setup the inputs

* `project_id` : A unique identifier for a Google Cloud project.
* `location` : The geographic region of the resource or operation, e.g., us-central1.
* `processor_id` : Identifier for the Google Cloud processor.

In [None]:
project_id = "<project-id>"  # Project ID of the project
location = "us"  # location of the processor: us or eu
processor_id = "<cde-processor-id>"  # Processor id of processor from which the schema has to be exported to spreadsheet
processor_name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

### 3.Run the required functions

In [None]:
def get_dataset_schema(processor_name: str) -> documentai_v1beta3.DatasetSchema:
    """
    Retrieves the dataset schema for a given Document AI processor.

    Args:
        processor_name (str): The name of the processor from which to retrieve the dataset schema.
                              The processor name should be in the format:
                              'projects/{project_id}/locations/{location}/processors/{processor_id}'.

    Returns:
        documentai_v1beta3.DatasetSchema: The dataset schema associated with the processor.
    """
    # Create a client
    client = documentai_v1beta3.DocumentServiceClient()

    # Initialize request argument(s)
    request = documentai_v1beta3.GetDatasetSchemaRequest(
        name=processor_name + "/dataset/datasetSchema",
    )

    # Make the request
    response = client.get_dataset_schema(request=request)
    return response


def update_dataset_schema(
    schema: documentai_v1beta3.DatasetSchema,
) -> documentai_v1beta3.DatasetSchema:
    """
    Updates the dataset schema for a given Document AI processor.

    Args:
        schema (documentai_v1beta3.DatasetSchema): The schema object to update. It should contain
                                                   the `name` and `document_schema` fields that
                                                   represent the dataset and document schema
                                                   definitions respectively.

    Returns:
        documentai_v1beta3.DatasetSchema: The updated dataset schema response.
    """
    # Create a client
    client = documentai_v1beta3.DocumentServiceClient()

    # Initialize request argument(s)
    request = documentai_v1beta3.UpdateDatasetSchemaRequest(
        dataset_schema={"name": schema.name, "document_schema": schema.document_schema}
    )

    # Make the request
    response = client.update_dataset_schema(request=request)
    # Handle the response
    return response


def modify_schema(
    schema: documentai_v1beta3.DatasetSchema, changes: dict
) -> documentai_v1beta3.DatasetSchema:
    """
    Modifies the schema based on the provided changes.

    Args:
        schema (documentai_v1beta3.DatasetSchema): The original dataset schema to modify.
        changes (dict): A dictionary specifying the changes to apply.
                        The dictionary keys represent the change types and can include:
                          - "rename": Renames an entity or property.
                          - "change_type": Changes the value type of an entity or property.
                          - "change_occurrence": Modifies the occurrence type of an entity or property.
                          - "add_field": Adds a new field to the schema.
                          - "delete_field": Deletes an existing field from the schema.
                        Each change type has corresponding details in `change_data` like:
                          - "old_name" (for rename), "new_name", "parent_entity" (for nested fields), etc.

    Returns:
        documentai_v1beta3.DatasetSchema: The modified dataset schema.
    """

    for change_type, change_data in changes.items():
        if change_type == "rename":
            for entity_type in schema.document_schema.entity_types:
                for prop in entity_type.properties:
                    if prop.name == change_data["old_name"]:
                        prop.name = change_data["new_name"]
                        if change_data["parent_entity"] is not None:
                            # Rename nested entity within parent entity
                            for parent_entity in schema.document_schema.entity_types:
                                if parent_entity.name == change_data["parent_entity"]:
                                    for parent_prop in parent_entity.properties:
                                        if parent_prop.name == change_data["old_name"]:
                                            parent_prop.name = change_data["new_name"]
                                            break

        elif change_type == "change_type":
            for entity_type in schema.document_schema.entity_types:
                for prop in entity_type.properties:
                    if prop.name == change_data["entity_name"]:
                        prop.value_type = change_data["new_type"]

        elif change_type == "change_occurrence":
            for entity_type in schema.document_schema.entity_types:
                for prop in entity_type.properties:
                    if prop.name == change_data["entity_name"]:
                        prop.occurrence_type = change_data["new_occurrence_type"]

        elif change_type == "add_field":
            # print("Add field called for ")
            # print(change_data)
            new_entity = {
                "name": change_data["entity_name"],
                "value_type": change_data["value_type"],
                "occurrence_type": change_data["occurrence_type"],
            }
            for entity_type in schema.document_schema.entity_types:
                if entity_type.base_types[0] == "document":
                    entity_type.properties.append(new_entity)

                    if change_data["parent_entity"] is not None:
                        # Add nested field under parent entity
                        parent_found = False
                        for parent_entity in schema.document_schema.entity_types:
                            if parent_entity.name == change_data["parent_entity"]:
                                parent_entity.properties.append(new_entity)
                                parent_found = True
                                break

                                # Create new entity type if parent not found
                                if not parent_found:
                                    new_parent_entity = (
                                        documentai_v1beta3.DocumentSchema.EntityType(
                                            name=change_data["parent_entity"],
                                            base_types=["object"],
                                            properties=[new_entity],
                                        )
                                    )
                                    schema.document_schema.entity_types.append(
                                        new_parent_entity
                                    )
                                    new_parent_entity = {
                                        "name": change_data["parent_entity"],
                                        "value_type": change_data["parent_entity"],
                                        "occurrence_type": "OPTIONAL_MULTIPLE",  # TODO: Update the code to set this with change parameter
                                    }
                                    print(schema.document_schema.entity_types[0])
                                    schema.document_schema.entity_types[
                                        0
                                    ].properties.append(new_parent_entity)

        elif change_type == "delete_field":
            for entity_type in schema.document_schema.entity_types:
                for prop in entity_type.properties:
                    if prop.name == change_data["entity_name"]:
                        entity_type.properties.remove(prop)
                        if change_data["parent_entity"] is not None:
                            # Delete nested field from parent entity
                            for parent_entity in schema.document_schema.entity_types:
                                if parent_entity.name == change_data["parent_entity"]:
                                    for parent_prop in parent_entity.properties:
                                        if (
                                            parent_prop.name
                                            == change_data["entity_name"]
                                        ):
                                            parent_entity.properties.remove(parent_prop)
                                            break

    return schema

### 4.Run the code

In [None]:
def main():
    # Get schema from your source
    schema = response_document_schema = get_dataset_schema(processor_name)

    # Define schema changes
    # TODO: Define your schema changes
    changes = {
        "rename": {
            "old_name": "entity_1",
            "new_name": "new_entity_1",
            "parent_entity": None,
        },
        "change_type": {"entity_name": "entity_2", "new_type": "money"},
        "change_occurrence": {
            "entity_name": "entity_3",
            "new_occurrence_type": "OPTIONAL_ONCE",
        },
        "add_field": {
            "entity_name": "new_entity_field",
            "value_type": "string",
            "occurrence_type": "OPTIONAL_MULTIPLE",
            "parent_entity": None,
        },
        "add_field": {
            "entity_name": "new_child_field",
            "value_type": "string",
            "occurrence_type": "OPTIONAL_MULTIPLE",
            "parent_entity": "parent_entity",
        },
        "delete_field": {"entity_name": "entity_4", "parent_entity": None},
        "rename": {
            "old_name": "child_entity_1",
            "new_name": "rename_child_entity_1",
            "parent_entity": "parent_entity",
        },
    }

    # Apply changes to the schema
    updated_schema = modify_schema(schema, changes)

    # Print the updated schema
    # print("Updated schema:")
    # print(updated_schema)

    # Update dataset schema in your system
    response_update = update_dataset_schema(updated_schema)
    print("Schema Updated")


main()

### 5.Output

This will update schema in the processor as per the change schema mention

#### Schema Before Tooling
<img src="./Images/Before_schema_updation.png" width=800 height=400 ></img>

#### Schema After Tooling
<img src="./Images/After_schema_updation.png" width=800 height=400 ></img>