incubator-tools/document_level_accuracy/document_level_accuracy.ipynb (655 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "ea9ed8f8-1929-4898-8cdd-60d5612b9250", "metadata": {}, "source": [ "# Document Level Accuracy" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3aebb5e4-f65e-434c-b8ec-737d418c4bcf", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "attachments": {}, "cell_type": "markdown", "id": "acdbd5f8-c3af-4176-b0f5-5526a1a4aeb8", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "2a0ee50a-6427-414e-9be4-f99ca16654e6", "metadata": {}, "source": [ "## Objective\n", "This tool uses annotated docs (JSON files) from GCS bucket as input and then runs the same (image) files through the designated version of the processor. Comparison of Annotated json files and processed json files should be provided in a CSV file with difference and Document level accuracy stats. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "42490221-3a88-4a03-9974-43da3ceed9fb", "metadata": {}, "source": [ "## Step by Step procedure " ] }, { "attachments": {}, "cell_type": "markdown", "id": "f4dd2b0e-2026-4f91-9885-2d8de58346ea", "metadata": {}, "source": [ "### 1. Install and import the required libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "04a57429-1b4a-4fea-a512-81c4ec94ede9", "metadata": {}, "outputs": [], "source": [ "!pip install pandas numpy google-cloud-storage google-cloud-documentai==2.16.0 PyPDF2 configparser pillow\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": 50, "id": "2c1bf07d-b0ad-4c4f-8f06-de927501fb23", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import operator\n", "import difflib\n", "import json\n", "import os\n", "import pandas as pd\n", "import time\n", "import numpy as np\n", "from google.cloud import storage\n", "from google.cloud import documentai_v1beta3\n", "from PIL import Image\n", "from typing import (\n", " Container,\n", " Iterable,\n", " Iterator,\n", " List,\n", " Mapping,\n", " Optional,\n", " Sequence,\n", " Tuple,\n", " Union,\n", ")\n", "from PyPDF2 import PdfFileReader\n", "import configparser\n", "import warnings\n", "import ast\n", "import io\n", "import re\n", "import traceback\n", "import datetime\n", "from typing import Any, Dict, List, Optional, Sequence, Tuple, Union\n", "import utilities" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b3114c71-e9dc-4ef2-abd4-389aa40aca93", "metadata": {}, "source": [ "### 2. Setup the reuired Input Details" ] }, { "cell_type": "code", "execution_count": 51, "id": "3a13e64c-befa-4b50-848e-a89107aef24d", "metadata": {}, "outputs": [], "source": [ "project_id = \"your-project-id\"\n", "location = \"your-location\"\n", "processor_id = \"your-processor-id\"\n", "processor_version = \"your-processor-version\"\n", "groundtruth_bucket_uri = \"gs://your-bucket-uri\"\n", "critical_entities = [\"entity1\", \"entity2\", \"entity3\", \"entity4\", \"entity5\"]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c104efc9-ed01-439b-a580-6d87babeca3e", "metadata": {}, "source": [ "Enter the input details with necessary information as outlined below:\n", "\n", "- `project_id`: Provide the project ID of your Google Cloud project.\n", "- `groundtruth_bucket_uri`: Provide the Google Cloud Storage (GCS) path of the annotated JSON files.\n", "- `critical_entities`: Provide a list of critical entities for which you require document level accuracy.\n", " - Example: `['invoice_id','invoice_date','receiver_name','receiver_address','supplier_name']`\n", "- `processor_id`: Provide the processor ID of your Document AI processor.\n", "- `processor_version_ID`: Provide the processor version ID.\n", "- `Location`: Specify the location (e.g., 'us' or 'eu') where your processor is created.\n", "\n", "Note: If the critical_entities parameter is provided as an empty list then the tool will compare all the entities." ] }, { "cell_type": "code", "execution_count": 52, "id": "5eede1ae-f723-47a8-a319-ac7979716dd5", "metadata": {}, "outputs": [], "source": [ "def f1_calculator(merged_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:\n", " \"\"\"\n", " Calculates F1-Score and other related metrics for entities in a dataframe.\n", "\n", " Args:\n", " merged_df (pd.DataFrame): A pandas DataFrame containing the merged data.\n", "\n", " Returns:\n", " Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames.\n", " The first DataFrame contains the calculated metrics.\n", " The second DataFrame is the original input DataFrame.\n", " \"\"\"\n", " metrics_list = []\n", "\n", " # Function to calculate metrics\n", " def calculate_metrics(subset_df, entity_name):\n", " TP_Count = len(subset_df[subset_df[\"match\"] == \"TP\"])\n", " FP_Count = len(subset_df[subset_df[\"match\"] == \"FP\"])\n", " FN_Count = len(subset_df[subset_df[\"match\"] == \"FN\"])\n", " TN_Count = len(subset_df[subset_df[\"match\"] == \"TN\"])\n", " metrics = {\n", " \"entity_name\": entity_name,\n", " \"Accuracy\": round(\n", " (TP_Count + TN_Count) / (TP_Count + FP_Count + FN_Count + TN_Count), 2\n", " )\n", " if TP_Count\n", " else 0,\n", " \"Precision\": round(TP_Count / (TP_Count + FP_Count), 2) if TP_Count else 0,\n", " \"Recall\": round(TP_Count / (TP_Count + FN_Count), 2) if TP_Count else 0,\n", " \"F1-Score\": round(2 * TP_Count / (2 * TP_Count + FP_Count + FN_Count), 2)\n", " if TP_Count\n", " else 0,\n", " }\n", " return metrics\n", "\n", " # Calculate metrics for all entities\n", " all_entities_metrics = calculate_metrics(merged_df, \"All Entities\")\n", " metrics_list.append(all_entities_metrics)\n", "\n", " # Calculate metrics for each unique entity\n", " for entity in merged_df[\"entity_name\"].unique():\n", " entity_df = merged_df[merged_df[\"entity_name\"] == entity]\n", " entity_metrics = calculate_metrics(entity_df, entity)\n", " metrics_list.append(entity_metrics)\n", "\n", " # Convert list of dictionaries to DataFrame\n", " All_metrics = pd.DataFrame(metrics_list)\n", " return All_metrics, merged_df\n", "\n", "\n", "def doc_proto_to_dataframe(data: documentai_v1beta3.Document) -> pd.DataFrame:\n", " \"\"\"It will convert Document Proto object to DataFrame. Returns entities in dataframe format\n", "\n", " Args:\n", " data (documentai_v1beta3.Document): It is Document Proto Object\n", "\n", " Returns:\n", " pd.DataFrame: It is a DataFrame which having all entities data as rows\n", " \"\"\"\n", "\n", " df = pd.DataFrame(columns=[\"type_\", \"mention_text\", \"bbox\"])\n", " if not data.entities:\n", " print(\"No entities Found\")\n", " return df\n", " for entity in data.entities:\n", " if entity.properties:\n", " for sub_entity in entity.properties:\n", " df = add_entity_to_dataframe(sub_entity, df)\n", " continue\n", " df = add_entity_to_dataframe(entity, df)\n", " return df\n", "\n", "\n", "def add_entity_to_dataframe(\n", " entity: documentai_v1beta3.Document.Entity, df: pd.DataFrame\n", ") -> pd.DataFrame:\n", " \"\"\"It will append entity data to given DataFrame\n", "\n", " Args:\n", " entity (documentai_v1beta3.Document.Entity): An entity from Document Object\n", " df (pd.DataFrame): Target Dataframe to add an entity as new row\n", "\n", " Returns:\n", " pd.DataFrame: It is a Dataframe with newly appended entity as row\n", " \"\"\"\n", "\n", " if entity.mention_text:\n", " coord1, _, coord3, _ = entity.page_anchor.page_refs[\n", " 0\n", " ].bounding_poly.normalized_vertices\n", " bbox = [coord1.x, coord1.y, coord3.x, coord3.y]\n", " df.loc[len(df.index)] = [entity.type_, entity.mention_text, bbox]\n", " else:\n", " df.loc[len(df.index)] = [entity.type_, \"Entity not found.\", []]\n", " return df\n", "\n", "\n", "def compare_doc_proto_convert_dataframe(\n", " file1: documentai_v1beta3.Document, file2: documentai_v1beta3.Document\n", ") -> Tuple[pd.DataFrame, np.float64]:\n", " \"\"\"Compares the entities between two files and returns the results in a dataframe\n", "\n", " Args:\n", " file1 (documentai_v1beta3.Document): It is Document Proto Object\n", " file2 (documentai_v1beta3.Document): It is also Document Proto Object to compare with previous\n", "\n", " Returns:\n", " Tuple[pd.DataFrame, np.float64]: It returns Dataframe and matched score\n", " between two Document Protos\n", " \"\"\"\n", "\n", " df_file1 = doc_proto_to_dataframe(file1)\n", " df_file2 = doc_proto_to_dataframe(file2)\n", " file1_entities = [entity[0] for entity in df_file1.values]\n", " file2_entities = [entity[0] for entity in df_file2.values]\n", "\n", " # find entities which are present only once in both files\n", " # these entities will be matched directly\n", " common_entities = set(file1_entities).intersection(set(file2_entities))\n", " exclude_entities = []\n", " for entity in common_entities:\n", " if file1_entities.count(entity) > 1 or file2_entities.count(entity) > 1:\n", " exclude_entities.append(entity)\n", " for entity in exclude_entities:\n", " common_entities.remove(entity)\n", " df_compare = pd.DataFrame(\n", " columns=[\"entity_name\", \"initial_prediction\", \"current_prediction\"]\n", " )\n", " for entity in common_entities:\n", " value1 = df_file1[df_file1[\"type_\"] == entity].iloc[0][\"mention_text\"]\n", " value2 = df_file2[df_file2[\"type_\"] == entity].iloc[0][\"mention_text\"]\n", " df_compare.loc[len(df_compare.index)] = [entity, value1, value2]\n", " # common entities are removed from df_file1 and df_file2\n", " df_file1 = utilities.remove_row(df_file1, entity)\n", " df_file2 = utilities.remove_row(df_file2, entity)\n", "\n", " # remaining entities are matched comparing the area of IOU across them\n", " mention_text2 = pd.Series(dtype=str)\n", " for index, row in enumerate(df_file1.values):\n", " matched_index = utilities.find_match(row, df_file2)\n", " if matched_index is not None:\n", " mention_text2.loc[index] = df_file2.loc[matched_index][1]\n", " df_file2 = df_file2.drop(matched_index)\n", " else:\n", " mention_text2.loc[index] = \"Entity not found.\"\n", "\n", " df_file1[\"mention_text2\"] = mention_text2.values\n", " df_file1 = df_file1.drop([\"bbox\"], axis=1)\n", " df_file1.rename(\n", " columns={\n", " \"type_\": \"entity_name\",\n", " \"mention_text\": \"initial_prediction\",\n", " \"mention_text2\": \"current_prediction\",\n", " },\n", " inplace=True,\n", " )\n", " df_compare = pd.concat([df_compare, df_file1], ignore_index=True)\n", "\n", " # adding entities which are present in file2 but not in file1\n", " for row in df_file2.values:\n", " df_compare.loc[len(df_compare.index)] = [row[0], \"Entity not found.\", row[1]]\n", "\n", " df_compare[\"match\"] = (\n", " df_compare[\"initial_prediction\"] == df_compare[\"current_prediction\"]\n", " )\n", " df_compare[\"fuzzy ratio\"] = df_compare.apply(utilities.get_match_ratio, axis=1)\n", " if list(df_compare.index):\n", " score = df_compare[\"fuzzy ratio\"].sum() / len(df_compare.index)\n", " else:\n", " score = 0\n", " return df_compare, score\n", "\n", "\n", "def classify_row(row: pd.Series) -> str:\n", " \"\"\"\n", " Classifies a row into categories based on the comparison of 'initial_prediction'\n", " and 'current_prediction' values.\n", "\n", " Args:\n", " row (pd.Series): A row from a pandas DataFrame, expected to contain\n", " 'initial_prediction' and 'current_prediction' columns.\n", "\n", " Returns:\n", " str: The classification result, which can be 'TN', 'FN', 'FP', 'TP', or an error message.\n", " \"\"\"\n", " if (\n", " row[\"initial_prediction\"] == \"Entity not found.\"\n", " and row[\"current_prediction\"] == \"Entity not found.\"\n", " ):\n", " return \"TN\"\n", " elif (\n", " row[\"initial_prediction\"] != \"Entity not found.\"\n", " and row[\"current_prediction\"] == \"Entity not found.\"\n", " ):\n", " return \"FN\"\n", " elif (\n", " row[\"initial_prediction\"] == \"Entity not found.\"\n", " and row[\"current_prediction\"] != \"Entity not found.\"\n", " ):\n", " return \"FP\"\n", " elif (\n", " row[\"initial_prediction\"] != \"Entity not found.\"\n", " and row[\"current_prediction\"] != \"Entity not found.\"\n", " ):\n", " if row[\"initial_prediction\"] == row[\"current_prediction\"]:\n", " return \"TP\"\n", " else:\n", " return \"FP\"\n", " else:\n", " return \"Something went Wrong.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "245dce21-4879-445a-8e5f-c596e30edf15", "metadata": {}, "outputs": [], "source": [ "try:\n", " storage_client = storage.Client()\n", "\n", " # Current time for unique bucket names\n", " time_stamp = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n", "\n", " print(\"Creating temporary buckets\")\n", " groundtruth_bucket_name = \"groundtruth-vb_temp_\" + time_stamp\n", " parsed_output_bucket_name = \"processedoutput-vb_temp_\" + time_stamp\n", "\n", " # Extract the ground truth bucket name from the URI\n", " ground_truth_bucket = groundtruth_bucket_uri.split(\"/\")[2]\n", "\n", " # Create temporary buckets\n", " utilities.check_create_bucket(groundtruth_bucket_name)\n", " utilities.check_create_bucket(parsed_output_bucket_name)\n", " warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n", " try:\n", " ground_truth_files, ground_truth_dict = utilities.file_names(\n", " groundtruth_bucket_uri\n", " )\n", " print(\"Copying files to temporary bucket\")\n", " for file_name in ground_truth_files:\n", " utilities.copy_blob(\n", " ground_truth_bucket,\n", " ground_truth_dict[file_name],\n", " groundtruth_bucket_name,\n", " file_name,\n", " )\n", "\n", " # List files in the new bucket\n", " files_list = [\n", " blob.name\n", " for blob in storage_client.bucket(groundtruth_bucket_name).list_blobs()\n", " ]\n", " except Exception as e:\n", " print(\"Unable to process files due to: \", e)\n", "\n", " for file_name in files_list:\n", " print(groundtruth_bucket_name, file_name)\n", " input_path_json = utilities.blob_downloader(groundtruth_bucket_name, file_name)\n", " pdf_bytes, synthesized_images = utilities.create_pdf_bytes_from_json(\n", " input_path_json\n", " )\n", "\n", " try:\n", " res = utilities.process_document_sample(\n", " project_id, location, processor_id, pdf_bytes, processor_version\n", " )\n", " document_json = documentai_v1beta3.Document.to_json(res.document).encode(\n", " \"utf-8\"\n", " )\n", " blob = storage_client.bucket(parsed_output_bucket_name).blob(file_name)\n", " blob.upload_from_string(document_json, content_type=\"application/json\")\n", " except Exception as e:\n", " print(f\"Unable to process file {file_name} due to: \", e)\n", "\n", " (\n", " relation_dict,\n", " relation_non_matched_files_dict,\n", " ) = utilities.matching_files_two_buckets(\n", " groundtruth_bucket_name, parsed_output_bucket_name\n", " )\n", " # print(relation_dict)\n", " test_harness_merged = pd.DataFrame()\n", " accuracy_docs = []\n", " print(\"comparing the Annotated Jsons and Processed jsons ....Wait for Summary \")\n", " for i in relation_dict:\n", " groundtruth_json = utilities.blob_downloader(groundtruth_bucket_name, i)\n", " parsed_output_json = utilities.blob_downloader(\n", " parsed_output_bucket_name, relation_dict[i]\n", " )\n", " # test_harness_output = compare_groundtruth_and_output(groundtruth_json, parsed_output_json)[0]\n", "\n", " groundtruth_json_string = json.dumps(groundtruth_json)\n", " parsed_json_string = json.dumps(parsed_output_json)\n", "\n", " groundtruth_json_proto = documentai_v1beta3.Document.from_json(\n", " groundtruth_json_string\n", " )\n", " parsed_output_json_proto = documentai_v1beta3.Document.from_json(\n", " parsed_json_string\n", " )\n", "\n", " test_harness_output = compare_doc_proto_convert_dataframe(\n", " groundtruth_json_proto, parsed_output_json_proto\n", " )[0]\n", "\n", " test_harness_output[\"match\"] = test_harness_output.apply(classify_row, axis=1)\n", "\n", " # Save to CSV\n", " # test_harness_output.to_csv(\"test_harness_output.csv\", index=False)\n", " column = [relation_dict[i]] * test_harness_output.shape[0]\n", " # print(column)\n", " test_harness_output.insert(loc=0, column=\"File Name\", value=column)\n", " Document_accuracy = \"\"\n", " dict_files = {}\n", " if len(critical_entities) > 0:\n", " for j in critical_entities:\n", " try:\n", " if (\n", " test_harness_output[test_harness_output[\"entity_name\"] == j][\n", " \"match\"\n", " ]\n", " .value_counts()\n", " .FP\n", " > 0\n", " ):\n", " Document_accuracy = \"NO\"\n", " break\n", " except AttributeError:\n", " try:\n", " if (\n", " test_harness_output[\n", " test_harness_output[\"entity_name\"] == j\n", " ][\"match\"]\n", " .value_counts()\n", " .FN\n", " > 0\n", " ):\n", " Document_accuracy = \"NO\"\n", " break\n", " except AttributeError:\n", " Document_accuracy = \"YES\"\n", " else:\n", " try:\n", " if test_harness_output[\"match\"].value_counts().FP > 0:\n", " Document_accuracy = \"NO\"\n", " break\n", " except AttributeError:\n", " try:\n", " if test_harness_output[\"match\"].value_counts().FN > 0:\n", " Document_accuracy = \"NO\"\n", " break\n", " except AttributeError:\n", " Document_accuracy = \"YES\"\n", " # print(Document_accuracy)\n", "\n", " dict_files[i] = Document_accuracy\n", " accuracy_docs.append(dict_files)\n", " frames = [test_harness_merged, test_harness_output]\n", " test_harness_merged = pd.concat(frames)\n", " try:\n", " utilities.bucket_delete(groundtruth_bucket_name)\n", " utilities.bucket_delete(parsed_output_bucket_name)\n", " except:\n", " pass\n", "\n", " output = f1_calculator(test_harness_merged)[0]\n", " Match_YES = 0\n", " Match_NO = 0\n", " try:\n", " Match_YES = test_harness_merged[\"fuzzy ratio\"].value_counts().YES\n", " print(\"*******************SUMMARY**************************\")\n", " print(\"NO OF DOCUMENTS HAVE 100% DOCUMENT ACCURACY =\", Match_YES)\n", " except:\n", " print(\"NO OF DOCUMENTS HAVE 100% DOCUMENT ACCURACY =\", Match_YES)\n", " try:\n", " Match_NO = test_harness_merged[\"fuzzy ratio\"].value_counts().NO\n", " print(\"NO OF DOCUMENTS DOESNT HAVE 100% DOCUMENT ACCURACY =\", Match_NO)\n", " except:\n", " print(\"NO OF DOCUMENTS DOESNT HAVE 100% DOCUMENT ACCURACY =\", Match_NO)\n", "\n", " rejected_docs = []\n", " for i in range(len(accuracy_docs)):\n", " for j in accuracy_docs[i]:\n", " if accuracy_docs[i][j] == \"NO\":\n", " rejected_docs.append(j)\n", "\n", " print(\"\\n\")\n", " print(\n", " \"LIST OF DOCUMENTS WHICH DOESNT HAVE 100% DOCUMENT ACCURACY\\n\",\n", " rejected_docs,\n", " \"\\n\",\n", " )\n", " print(\"***********FOR DETAILS SEE THE CSV FILE CREATED******************\")\n", "\n", " df = pd.DataFrame()\n", " for i in range(len(critical_entities)):\n", " df1 = test_harness_merged[\n", " test_harness_merged[\"entity_name\"] == (critical_entities[i])\n", " ]\n", " df = pd.concat([df, df1])\n", " df2 = test_harness_merged[test_harness_merged[\"fuzzy ratio\"] == \"YES\"]\n", " df3 = test_harness_merged[test_harness_merged[\"fuzzy ratio\"] == \"NO\"]\n", " df = pd.concat([df, df2, df3])\n", " df = df.sort_values(by=[\"File Name\"])\n", " df = df.reset_index(drop=True)\n", " df = df.to_csv(\"Document_Level_Accuracy.csv\")\n", "\n", "except Exception as e:\n", " try:\n", " utilities.bucket_delete(groundtruth_bucket_name)\n", " utilities.bucket_delete(parsed_output_bucket_name)\n", " except Exception as inner_e:\n", " print(\"Error during bucket deletion:\", inner_e)\n", " traceback.print_exc()\n", "\n", " print(\"Unable to process the file:\", e)\n", " traceback.print_exc()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4867f1f6-7d0f-48f7-9913-c9b754cc8407", "metadata": {}, "source": [ "### **Output**" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5619aa50-05b8-4aae-8591-e73e3563b23b", "metadata": {}, "source": [ "The CSV file should have all the details of mismatch as shown below with Document level accuracy in ‘YES’ or ‘NO’.\n", "In the comparison, if all the entities in annotated jsons and processed json are matching 100% then the Document level accuracy is shown as YES else NO." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1d60b550-dcdd-4a6c-ae7a-55f28d8e9615", "metadata": {}, "source": [ "<td><img src=\"./images/output.png\" width=800 height=400></td>" ] }, { "cell_type": "code", "execution_count": null, "id": "9f713574-f82d-4b41-8bee-413f68d5dcb9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "71ca46b5-21c9-4f5c-985d-1e434427889d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m104", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m104" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }