incubator-tools/labeled_dataset_validation/labeled_dataset

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Labeled Dataset Validation" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Disclaimer\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Purpose and Description\n", "This tool uses labeled json files as an input and gives whether there are any labeling issues like blank entities, entities which have text anchor or page anchor issues and overlapping of entities as output in a csv and dictionary format for further use." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Prerequisite\n", "* Vertex AI Notebook\n", "* Labeled json files in GCS Folder" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Step By Step Procedure" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import Modules/Packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install pandas\n", "!pip install google-cloud-documentai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "from typing import DefaultDict, List, Tuple\n", "\n", "import pandas as pd\n", "from google.cloud import documentai_v1beta3 as documentai\n", "\n", "from utilities import (\n", " bb_intersection_over_union,\n", " documentai_json_proto_downloader,\n", " file_names,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Input Details\n", "\n", "* **GCS_INPUT_PATH**: GCS folder path of labelled JSON files" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "GCS_INPUT_PATH = \"gs://bucket/path_to/input\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Run Below Code Cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_x_y_list(\n", " bounding_poly: documentai.BoundingPoly,\n", ") -> Tuple[List[float], List[float]]:\n", " \"\"\"It takes BoundingPoly object and separates it x & y normalized coordinates as lists\n", "\n", " Args:\n", " bounding_poly (documentai.BoundingPoly): A token of Document Page object\n", "\n", " Returns:\n", " Tuple[List[float], List[float]]: It returns x & y normalized coordinates as separate lists\n", " \"\"\"\n", "\n", " x, y = [], []\n", " normalized_vertices = bounding_poly.normalized_vertices\n", " for nv in normalized_vertices:\n", " x.append(nv.x)\n", " y.append(nv.y)\n", " return x, y\n", "\n", "\n", "def get_page_wise_entities(\n", " doc: documentai.Document,\n", ") -> DefaultDict[int, List[documentai.Document.Entity]]:\n", " \"\"\"It gives page-wise entites for all pages in Document object\n", "\n", " Args:\n", " doc (documentai.Document): Documnet Proto Object\n", "\n", " Returns:\n", " DefaultDict[int, List[documentai.Document.Entity]]: Dictionary which contains page number as key and list of all entities in corresponding page\n", " \"\"\"\n", "\n", " entities_page = defaultdict(list)\n", " for entity in doc.entities:\n", " page = 0\n", " if entity.properties:\n", " for subentity in entity.properties:\n", " page = entity.page_anchor.page_refs[0].page\n", " entities_page[page].append(subentity)\n", " continue\n", " page = entity.page_anchor.page_refs[0].page\n", " entities_page[page].append(entity)\n", "\n", " return entities_page\n", "\n", "\n", "# getting blank entities\n", "def get_blank_entities(\n", " doc: documentai.Document,\n", ") -> Tuple[List[str], List[documentai.Document.Entity]]:\n", " \"\"\"It is helpful to identify blank entities in document-proto\n", "\n", " Args:\n", " doc (documentai.Document): Documnet Proto Object\n", "\n", " Returns:\n", " Tuple[List[str], List[documentai.Document.Entity]]: It returns entity-type and corresponding entity as list of strings and list of entities whose mention_text is empty\n", " \"\"\"\n", "\n", " blank_space_ent_name = []\n", " blank_space_entities = []\n", " for entity in doc.entities:\n", " if not entity.mention_text:\n", " blank_space_entities.append(entity)\n", " blank_space_ent_name.append(entity.type_)\n", " if entity.properties:\n", " for subentity in entity.properties:\n", " if not subentity.mention_text:\n", " blank_space_entities.append(entity)\n", " blank_space_ent_name.append([entity.type_, subentity.type_])\n", " return blank_space_ent_name, blank_space_entities\n", "\n", "\n", "# getting labelling issue entities(text and page anchor missing)\n", "def get_labeling_issues(\n", " doc: documentai.Document,\n", ") -> Tuple[List[str], List[documentai.Document.Entity]]:\n", " \"\"\"It helps to identify labelling issue entities in which text-anchors or page-anchors missing\n", "\n", " Args:\n", " doc (documentai.Document): Documnet Proto Object\n", "\n", " Returns:\n", " Tuple[List[str], List[documentai.Document.Entity]]: It returns entity-type and corresponding entity as list of strings and list of entities whose text_anchor or page_anchor is empty\n", " \"\"\"\n", "\n", " labeling_issue_ent_name = []\n", " labeling_issue_ent = []\n", "\n", " for entity in doc.entities:\n", " # page anchor issues\n", " ver = entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices\n", " if (not ver) or len(ver) != 4:\n", " labeling_issue_ent_name.append(entity.type_)\n", " labeling_issue_ent.append(entity)\n", " # text anchor issues\n", " index = entity.text_anchor.text_segments\n", " if not index:\n", " labeling_issue_ent_name.append(entity.type_)\n", " labeling_issue_ent.append(entity)\n", "\n", " return labeling_issue_ent_name, labeling_issue_ent\n", "\n", "\n", "def get_overlapping_entities(\n", " doc: documentai.Document,\n", ") -> Tuple[List[List[str]], List[List[documentai.Document.Entity]]]:\n", " \"\"\"It helps to identify overlapping entities(same data with different entity-type)\n", "\n", " Args:\n", " doc (documentai.Document): Documnet Proto Object\n", "\n", " Returns:\n", " Tuple[List[List[str]],List[List[documentai.Document.Entity]]]: It returns entity-type and entity as list-of-lists whose entities have same data with different entity-type\n", " \"\"\"\n", "\n", " entities_type_overlap = []\n", " entities_overlap = []\n", " entitites_page_wise = get_page_wise_entities(doc)\n", " for page, entities in entitites_page_wise.items():\n", " for entity1 in entities:\n", " for entity2 in entities:\n", " if entity1 != entity2:\n", " x, y = get_x_y_list(entity1.page_anchor.page_refs[0].bounding_poly)\n", " box1 = [min(x), min(y), max(x), max(y)]\n", " x, y = get_x_y_list(entity2.page_anchor.page_refs[0].bounding_poly)\n", " box2 = [min(x), min(y), max(x), max(y)]\n", " iou = bb_intersection_over_union(box1, box2)\n", " if iou > 0.5:\n", " if [\n", " entity1.type_,\n", " entity2.type_,\n", " ] not in entities_type_overlap and [\n", " entity2.type_,\n", " entity1.type_,\n", " ] not in entities_type_overlap:\n", " entities_type_overlap.append([entity1.type_, entity2.type_])\n", " entities_overlap.append([entity1, entity2])\n", " return entities_type_overlap, entities_overlap\n", "\n", "\n", "print(\"Process Started\")\n", "df = pd.DataFrame(\n", " columns=[\"File_Name\", \"Blank_Entities\", \"Labeling_Issues\", \"Overlapping_Issues\"]\n", ")\n", "file_names_list, file_dict = file_names(GCS_INPUT_PATH)\n", "file_wise_ent_type = {}\n", "file_wise_entities = {}\n", "for filename, filepath in file_dict.items():\n", " input_bucket_name = GCS_INPUT_PATH.split(\"/\")[2]\n", " print(\"\\tfilename: \", filename)\n", " doc = documentai_json_proto_downloader(input_bucket_name, filepath)\n", " blank_ent_name, blank_entities = get_blank_entities(doc)\n", " labeling_issue_ent_name, labeling_issue_ent = get_labeling_issues(doc)\n", " entities_type_overlap, entities_overlap = get_overlapping_entities(doc)\n", " blank_entities = labeling_issues = overlapping_issues = \"No\"\n", " if blank_ent_name:\n", " blank_entities = \"Yes\"\n", " if labeling_issue_ent_name:\n", " labeling_issues = \"Yes\"\n", " if entities_type_overlap:\n", " overlapping_issues = \"Yes\"\n", " new_row = {\n", " \"File_Name\": filename,\n", " \"Blank_Entities\": blank_entities,\n", " \"Labeling_Issues\": labeling_issues,\n", " \"Overlapping_Issues\": overlapping_issues,\n", " }\n", " df.loc[len(df)] = new_row\n", " file_wise_ent_type[filename] = {\n", " \"Blank_ent_type\": blank_ent_name,\n", " \"Labeling_ent_type\": labeling_issue_ent_name,\n", " \"overlapping_ent_type\": entities_type_overlap,\n", " }\n", " file_wise_entities[filename] = {\n", " \"Blank_ent\": blank_entities,\n", " \"Labeling_ent\": labeling_issue_ent,\n", " \"overlapping_ent\": entities_overlap,\n", " }\n", "\n", "df_ent_type = pd.DataFrame.from_dict(file_wise_ent_type, orient=\"index\")\n", "df_ent_type.reset_index(inplace=True)\n", "df_ent_type.rename(columns={\"index\": \"file_name\"}, inplace=True)\n", "print(\"Writing Data to labeling_issues_with_entity_type.csv \")\n", "df_ent_type.to_csv(\"./labeling_issues_with_entity_type.csv\")\n", "print(\"Writing data to labeling_issues.csv\")\n", "df.to_csv(\"./labeling_issues.csv\")\n", "print(\"Process Completed for all files\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Output Details" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The output is in 2 CSV files and a dictionary. \n", "\n", "In the CSV files , the columns are as below \n", "**File_Name**: File name of the labeled json in GCP folder \n", "**Blank_Entities**: The entities which are labeled blank or which doesn't have anything in the mentionText of the entity \n", "**Yes**- Denotes there are Blank entities \n", "**No** - No Blank entities found in the json \n", "\n", "#### Labeling issues:\n", "\n", "The entities which have issues in Text anchors or Page anchors are treated as labeling issues because of which you cannot convert into proto format. \n", "**Yes** - denotes there are few entities which have labeling issues \n", "**No** - denotes no labeling issues \n", "\n", "\n", "#### Overlapping issues:\n", "The entities where it is labeled more than once with same or different entity type like below \n", "<img src=\"./images/overlapping_issue.png\"> \n", "\n", "### 1. labeling_issues.csv\n", "<img src=\"./images/labeling_issues.csv.png\"> \n", "\n", "### 2. labeling_issues_with_entity_type.csv\n", "This CSV file is the same as **labeling_issues.csv** but with a list of entity types which have issues. \n", "\n", "For Blank_entities and Labeling issues, the entity type needs to be provided in a list which have issues \n", "\n", "But for overlapping issues , it gives the entity name in the nested list where each list has 2 values which are labeled in the same area. \n", "\n", "<img src=\"./images/labeling_issues_with_entity_type.csv.png\"> \n", "\n", "**file_wise_entities** is a dictionary where it has all the entity details which issues and can be deleted if needed from the json. \n", "<img src=\"./images/file_wise_entities_dict.png\"> \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m104", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m104" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }

incubator-tools/labeled_dataset_validation/labeled_dataset_validation.ipynb (409 lines of code) (raw):