incubator-tools/document-schema-from-form-parser-output/document-schema-from-form-parser-output.ipynb (542 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "55646894-baf1-4c1c-9790-5fc16468a282",
"metadata": {},
"source": [
"# Document Schema from Form Parser Output"
]
},
{
"cell_type": "markdown",
"id": "9c1e5f0d-91ea-4766-8ab3-f66429e66e1b",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com"
]
},
{
"cell_type": "markdown",
"id": "528c5a36-a860-468f-a30a-08d051880aee",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied."
]
},
{
"cell_type": "markdown",
"id": "c195555a-cb6d-4896-af89-3e97b3312cc1",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This Document guides to create document schema from key value pairs of form parser output in csv format (The generated schema can be reviewed, updated). This schema can be used to update for any parser through API.\n"
]
},
{
"cell_type": "markdown",
"id": "3b5734e7-8944-485a-9412-ae063ef57318",
"metadata": {},
"source": [
"## Prerequisites\n",
"* Python : Jupyter notebook (Vertex AI) \n",
"* Service account permissions in projects.\n",
"* GCS Folder Path which has form parser parsed jsons\n"
]
},
{
"cell_type": "markdown",
"id": "ce627c16-65b6-4870-b0b7-28d5fe599905",
"metadata": {},
"source": [
"## Step by Step procedure \n",
"\n",
"### 1.Importing Required Modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d1964d8-b1e4-42b7-a399-40861581c371",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f1928e8-b562-443d-b17d-88f42cebda45",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"from google.cloud import documentai_v1beta3 as documentai\n",
"from google.cloud import storage\n",
"from tqdm import tqdm\n",
"from utilities import *\n",
"import pandas as pd\n",
"from collections import Counter, defaultdict\n",
"import csv"
]
},
{
"cell_type": "markdown",
"id": "9fdcdd55-89b2-4b67-838e-d82fc5c2d957",
"metadata": {},
"source": [
"### 2.Setup the required inputs\n",
"* `project_id` : Your Google project id or name\n",
"* `formparser_parsed_jsons_path` : GCS storegar path where the form parser output is saved"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d717d6b4-4bdd-4d04-b9e4-da57f4dd008f",
"metadata": {},
"outputs": [],
"source": [
"project_id = \"xxx-xxxx-xxxx\" # your project id\n",
"formparser_parsed_jsons_path = \"gs://xxxx/xxxx/xxx/\" # path of the form parser output"
]
},
{
"cell_type": "markdown",
"id": "10d320b8-c2c8-46b9-8c10-33c5147347a6",
"metadata": {},
"source": [
"### 3.Importing Required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92e49ea6-f84c-47e5-a562-1632f8e6e2a5",
"metadata": {},
"outputs": [],
"source": [
"def get_schema_file(json_dict: object) -> List[Dict[str, Any]]:\n",
" \"\"\"\n",
" Extracts schema information from Document AI output and returns a list of entities with their types and occurrences.\n",
"\n",
" Args:\n",
" json_dict (object): The OCR output in the form of a Document AI document.\n",
"\n",
" Returns:\n",
" List[Dict[str, Any]]: A list of dictionaries representing entities with their types, occurrences, mention text, and value type.\n",
" \"\"\"\n",
"\n",
" entities_kv = []\n",
" for page_number, page_data in enumerate(json_dict.pages):\n",
" for formField_number, formField_data in enumerate(\n",
" getattr(page_data, \"form_fields\", [])\n",
" ):\n",
" # Cleaning the entity name\n",
" key_name = re.sub(\n",
" r\"[^\\w\\s]\",\n",
" \"\",\n",
" formField_data.field_name.text_anchor.content.replace(\" \", \"_\")\n",
" .lower()\n",
" .strip(),\n",
" )\n",
" if key_name[-1] == \"_\":\n",
" key_name = key_name[:-1]\n",
" if key_name:\n",
" # print(formField_data.field_value.bounding_poly.normalized_vertices)\n",
" ent_xy = {\"x\": [], \"y\": []}\n",
" text_anc = []\n",
" for xy in formField_data.field_value.bounding_poly.normalized_vertices:\n",
" ent_xy[\"x\"].append(xy.x)\n",
" ent_xy[\"y\"].append(xy.y)\n",
" for anc in formField_data.field_value.text_anchor.text_segments:\n",
" text_anc.append(\n",
" {\"start_index\": anc.start_index, \"end_index\": anc.end_index}\n",
" )\n",
"\n",
" page_anc_1 = [\n",
" {\"x\": min(ent_xy[\"x\"]), \"y\": min(ent_xy[\"y\"])},\n",
" {\"x\": min(ent_xy[\"x\"]), \"y\": max(ent_xy[\"y\"])},\n",
" {\"x\": max(ent_xy[\"x\"]), \"y\": min(ent_xy[\"y\"])},\n",
" {\"x\": max(ent_xy[\"x\"]), \"y\": max(ent_xy[\"y\"])},\n",
" ]\n",
"\n",
" entity_new = {\n",
" \"confidence\": formField_data.field_value.confidence,\n",
" \"mention_text\": formField_data.field_value.text_anchor.content,\n",
" \"page_anchor\": {\n",
" \"page_refs\": [\n",
" {\n",
" \"bounding_poly\": {\"normalized_vertices\": page_anc_1},\n",
" \"page\": str(page_number),\n",
" }\n",
" ]\n",
" },\n",
" \"text_anchor\": {\"text_segments\": text_anc},\n",
" \"type\": key_name,\n",
" }\n",
"\n",
" entities_kv.append(entity_new)\n",
"\n",
" file_schema = []\n",
" ent_considered = []\n",
" keys_dict = {}\n",
"\n",
" for entity in entities_kv:\n",
" if entity[\"type\"] in keys_dict:\n",
" keys_dict[entity[\"type\"]] = \"OPTIONAL_MULTIPLE\"\n",
" else:\n",
" keys_dict[entity[\"type\"]] = \"OPTIONAL_ONCE\"\n",
"\n",
" for ent in entities_kv:\n",
" if ent[\"type\"] not in ent_considered:\n",
" ent_considered.append(ent[\"type\"])\n",
" temp_ent = {\n",
" \"entity_type\": ent[\"type\"],\n",
" \"occurrence\": keys_dict[ent[\"type\"]],\n",
" \"entity_mention_text\": ent[\"mention_text\"],\n",
" \"value_type\": \"string\",\n",
" }\n",
" file_schema.append(temp_ent)\n",
"\n",
" return file_schema\n",
"\n",
"\n",
"def get_consolidated_schema(data: List[List[Dict[str, Any]]]) -> List[Dict[str, Any]]:\n",
" \"\"\"\n",
" Consolidates schema information from a list of entities and returns a list of dictionaries with majority occurrence\n",
" and value type for each entity type.\n",
"\n",
" Args:\n",
" data (List[List[Dict[str, Any]]]): A list of entities, where each entity is represented by a dictionary.\n",
"\n",
" Returns:\n",
" List[Dict[str, Any]]: A list of dictionaries representing consolidated schema information for each entity type.\n",
" \"\"\"\n",
"\n",
" counters = {}\n",
" for item in data:\n",
" for entity in item:\n",
" entity_type = entity[\"entity_type\"]\n",
" occurrence = entity[\"occurrence\"]\n",
" value_type = entity[\"value_type\"]\n",
"\n",
" if entity_type not in counters:\n",
" counters[entity_type] = {\n",
" \"occurrence\": Counter(),\n",
" \"value_type\": Counter(),\n",
" }\n",
"\n",
" counters[entity_type][\"occurrence\"][occurrence] += 1\n",
" counters[entity_type][\"value_type\"][value_type] += 1\n",
"\n",
" # Create a new list of dictionaries with majority occurrence and value type for each entity type\n",
" result = []\n",
" for entity_type, counts in counters.items():\n",
" majority_occurrence = counts[\"occurrence\"].most_common(1)[0][0]\n",
" majority_value_type = counts[\"value_type\"].most_common(1)[0][0]\n",
"\n",
" result.append(\n",
" {\n",
" \"entity_type\": entity_type,\n",
" \"occurrence\": majority_occurrence,\n",
" \"value_type\": majority_value_type,\n",
" }\n",
" )\n",
" df = pd.DataFrame(result)\n",
"\n",
" df.to_csv(\"document_schema.csv\")\n",
"\n",
" return result\n",
"\n",
"\n",
"def get_allfiles_csv(data: Dict[str, List[Dict[str, Any]]]) -> None:\n",
" \"\"\"\n",
" Groups entities by filename and writes the data to a CSV file.\n",
"\n",
" Args:\n",
" data (Dict[str, List[Dict[str, Any]]]): A dictionary where keys are filenames and values are lists of entities.\n",
"\n",
" Returns:\n",
" None\n",
" \"\"\"\n",
"\n",
" grouped_data = defaultdict(list)\n",
" for file_name, entities in data.items():\n",
" grouped_data[file_name].extend(entities)\n",
"\n",
" csv_file_path = \"Allfiles_data.csv\"\n",
" header = [\n",
" \"filename\",\n",
" \"entity_type\",\n",
" \"occurrence\",\n",
" \"entity_mention_text\",\n",
" \"value_type\",\n",
" ]\n",
"\n",
" with open(csv_file_path, \"w\", newline=\"\") as csvfile:\n",
" csv_writer = csv.DictWriter(csvfile, fieldnames=header)\n",
" csv_writer.writeheader()\n",
" csv_writer.writerows(\n",
" {\"filename\": file_name, **entity}\n",
" for file_name, entities in grouped_data.items()\n",
" for entity in entities\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "a9ed47fe-cea8-4577-9b22-526d6f7642b4",
"metadata": {},
"source": [
"### 4. Calling functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7349143-6b5b-48e9-9845-078d9aa365af",
"metadata": {},
"outputs": [],
"source": [
"# calling functions\n",
"bucket_name = formparser_parsed_jsons_path.split(\"/\")[2]\n",
"files = list(file_names(formparser_parsed_jsons_path)[1].values())\n",
"list_schema = []\n",
"file_wise = {}\n",
"for file in tqdm(files, desc=\"Status : \"):\n",
" json_dict = documentai_json_proto_downloader(bucket_name, file)\n",
" file_schema = get_schema_file(json_dict)\n",
" list_schema.append(file_schema)\n",
" file_wise[file.split(\"/\")[-1]] = file_schema\n",
"# if you need data for all files individually to review uncomment below line\n",
"# get_allfiles_csv(file_wise)\n",
"consolidated_schema = get_consolidated_schema(list_schema)"
]
},
{
"cell_type": "markdown",
"id": "955f1175-2fd4-490d-bd4c-1f7d5531eab5",
"metadata": {},
"source": [
"### 5.CSV schema output\n",
"\n",
"Form parser output in UI\n",
"\n",
"<img src=\"./Images/Form_parser_output.png\" width=800 height=400 alt=\"Form_parser_output\"></img>\n",
"\n",
"Retrieved schema from code in the form of csv(‘document_schema.csv’)\n",
"\n",
"<img src=\"./Images/CSV_output.png\" width=800 height=400 alt=\"CSV_output\"></img>"
]
},
{
"cell_type": "markdown",
"id": "773e10b1-347b-45b6-8d9e-127d35bf63d5",
"metadata": {},
"source": [
"### The above schema can be reviewed or modified as per the user requirements."
]
},
{
"cell_type": "markdown",
"id": "7911bae8-dbd5-4290-9daf-1e61201c0b70",
"metadata": {
"tags": []
},
"source": [
"## Updating Schema to another parser"
]
},
{
"cell_type": "markdown",
"id": "96f0af86-3b3f-41e9-b923-226ec818e64e",
"metadata": {
"tags": []
},
"source": [
"### 1.Setup the required inputs\n",
"* `project_id` : Your Google project id or name\n",
"* `location_processor` : Location of processor\n",
"* `processor_id` : to which schema has to be updated\n",
"* `updated_schema_csv_path` : csv file modified or reviewed from above step"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ccaba42-fdcc-4b74-8e1c-619fcdd765eb",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"project_number = \"XXXXXXXXXXXXXXXX\" # project number\n",
"location_processor = \"us\" # location of processor\n",
"processor_id = \"xxxxxxxxxxxxxxxx\" # to which schema has to be updated\n",
"updated_schema_csv_path = (\n",
" \"document_schema.csv\" # csv file modified or reviewed from above step\n",
")"
]
},
{
"cell_type": "markdown",
"id": "bf26a3a0-3c9a-49d6-85a7-275664f65424",
"metadata": {},
"source": [
"### Required functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bcc9a0c7-af40-4fd8-8038-acdce986be80",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# helper functions\n",
"# get document schema\n",
"def get_dataset_schema(processor_name: str) -> Any:\n",
" \"\"\"\n",
" Retrieves the dataset schema for a specified processor.\n",
"\n",
" Args:\n",
" processor_name (str): The name of the processor.\n",
"\n",
" Returns:\n",
" Any: The response containing the dataset schema information.\n",
" \"\"\"\n",
"\n",
" # Create a client\n",
" from google.cloud import documentai_v1beta3\n",
"\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # dataset_name = client.dataset_schema_path(project, location, processor)\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.GetDatasetSchemaRequest(\n",
" name=processor_name + \"/dataset/datasetSchema\",\n",
" )\n",
"\n",
" # Make the request\n",
" response = client.get_dataset_schema(request=request)\n",
"\n",
" return response\n",
"\n",
"\n",
"# update schema\n",
"def update_dataset_schema(schema: document.Document):\n",
" \"\"\"\n",
" Updates the dataset schema.\n",
"\n",
" Args:\n",
" schema (document.Document): The document containing the updated dataset schema.\n",
"\n",
" Returns:\n",
" document.Document: The response containing the updated dataset schema information.\n",
" \"\"\"\n",
"\n",
" from google.cloud import documentai_v1beta3\n",
"\n",
" # Create a client\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n",
" dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n",
" )\n",
" # Make the request\n",
" response = client.update_dataset_schema(request=request)\n",
"\n",
" # Handle the response\n",
" return response"
]
},
{
"cell_type": "markdown",
"id": "03ea2a33-7731-45c4-b796-925e441ebda4",
"metadata": {},
"source": [
"### Calling functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "616e70c6-ec64-426f-945a-dd1d0dbe3567",
"metadata": {},
"outputs": [],
"source": [
"# updating schema of processor\n",
"import pandas as pd\n",
"\n",
"df_updated = pd.read_csv(updated_schema_csv_path)\n",
"schema_updated = []\n",
"for m in range(len(df_updated)):\n",
" schema_ent = {\n",
" \"name\": df_updated.loc[m][\"entity_type\"],\n",
" \"value_type\": df_updated.loc[m][\"value_type\"],\n",
" \"occurrence_type\": df_updated.loc[m][\"occurrence\"],\n",
" }\n",
" schema_updated.append(schema_ent)\n",
"\n",
"response_schema = get_dataset_schema(\n",
" f\"projects/{project_number}/locations/{location_processor}/processors/{processor_id}\"\n",
")\n",
"\n",
"for i in response_schema.document_schema.entity_types:\n",
" for e3 in schema_updated:\n",
" i.properties.append(e3)\n",
"\n",
"response = update_dataset_schema(response_schema)"
]
},
{
"cell_type": "markdown",
"id": "b571acbf-420d-4a05-bd7e-a0a4b3de211b",
"metadata": {},
"source": [
"### output\n",
"The above script adds the schema in the parser as below\n",
"\n",
"<img src=\"./Images/processor_output.png\" width=800 height=400 alt=\"processor_output\"></img>"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m112",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m112"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}