incubator-tools/cer_wer/cer_wer.ipynb (289 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "72fd064f-24f5-4d61-b0ad-2b2f3fe9427d", "metadata": {}, "source": [ "# Character and Word Error Rate" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f5756f1a-631f-4c8a-bba0-98c6821d31a9", "metadata": {}, "source": [ "* Author: docai-incubator@google.com\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9b1d12ef-55dd-4fbd-8389-db14ed038eb1", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied." ] }, { "attachments": {}, "cell_type": "markdown", "id": "94527514-1ae2-470b-96e2-0f48e4aa5e81", "metadata": {}, "source": [ "## Purpose and Description" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cdf4462a-fc0d-477f-9c39-4fec383ba4ca", "metadata": {}, "source": [ "The objective of the tool is to evaluate the character error rate and word error rate .\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a8783f52-627b-4efa-b5d9-664ae2ca2564", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "1. Vertex AI Notebook\n", "2. Labeled json files in GCS Folder\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "55cc5540-deb1-4449-8278-716488c54e5c", "metadata": {}, "source": [ "## Step by Step procedure " ] }, { "attachments": {}, "cell_type": "markdown", "id": "d49f5b2d-f7fd-4403-a175-b95cc804f6ba", "metadata": {}, "source": [ "### 1. Input Details" ] }, { "attachments": {}, "cell_type": "markdown", "id": "397acecf-72a3-4913-b5b3-68256ac06f0f", "metadata": {}, "source": [ "* **JSONS_PATH** = \"gs://xxxx/xxxxxxxx\"\n", "* **GROUNDTRUTH_PATH** = \"gs://xxxx/xxxxx\"" ] }, { "attachments": {}, "cell_type": "markdown", "id": "03dff98d-ef2d-4096-9c82-ac2106333a8d", "metadata": {}, "source": [ "* **JSONS_PATH**: Provide the location of the dataset exported from the processor which needs to be evaluated\n", "* **GROUNDTRUTH_PATH**: Provide the location of the ground truth which is the text file containing the content of the document in txt file.\n", "\n", "Note: The json file and its corresponding groudtruth should have the same name." ] }, { "attachments": {}, "cell_type": "markdown", "id": "a6fa4328-3d99-4de4-a5eb-0f2033d78b79", "metadata": {}, "source": [ "### 2. Run the Code" ] }, { "attachments": {}, "cell_type": "markdown", "id": "678ebb32-9b15-4a98-a982-55eb7137a1f9", "metadata": {}, "source": [ "Use the function given in the sample code which returns the mean of cer and wer after the evaluation of provided documents and produce csv file .\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b02a0f30-9f93-46bf-b871-f34b4df08894", "metadata": {}, "source": [ "### 3. Output" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b4232146-0beb-4713-92ff-bf23606d45bf", "metadata": {}, "source": [ "<img src=\"./Images/cer_wer_output.png\" width=800 height=400 alt=\"Cer Wer Output CSV image\">" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8b2c2216-4282-4433-9a74-cd503d874dad", "metadata": {}, "source": [ "### Sample Code" ] }, { "cell_type": "code", "execution_count": null, "id": "53accc1c", "metadata": {}, "outputs": [], "source": [ "# Run this cell to download utilities module\n", "# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" ] }, { "cell_type": "code", "execution_count": null, "id": "6dca87d7", "metadata": {}, "outputs": [], "source": [ "%pip install asrtoolkit\n", "%pip install numpy\n", "%pip install pandas\n", "%pip install google-cloud-storage" ] }, { "cell_type": "code", "execution_count": null, "id": "f7f4c055-7020-4031-9898-7812c0172436", "metadata": {}, "outputs": [], "source": [ "from asrtoolkit import cer as cer_2\n", "from asrtoolkit import wer as wer_2\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import json\n", "from google.cloud import storage\n", "from utilities import (\n", " store_document_as_json,\n", " documentai_json_proto_downloader,\n", " file_names,\n", " blob_downloader,\n", ")\n", "\n", "JSONS_PATH = \"gs://xxxxxx/xxxxxxxxxx\"\n", "GROUNDTRUTH_PATH = \"gs://xxxxxx/xxxxx/xxxxxxx\"\n", "\n", "\"\"\"\n", "Documents will get compared with their groundtruth and provide\n", "Mean of cer/wer with the csv file having individual file with their CER/WER.\n", "\"\"\"\n", "df_output = pd.DataFrame(\n", " columns=[\"filename\", \"ocr_text\", \"groundtruth_text\", \"cer\", \"wer\"]\n", ")\n", "\n", "json_files = file_names(JSONS_PATH)[1].values()\n", "json_files = [i for i in list(json_files) if i.endswith(\".json\")]\n", "\n", "groundtruth_files = file_names(GROUNDTRUTH_PATH)[1].values()\n", "groundtruth_files = [i for i in list(groundtruth_files) if i.endswith(\".txt\")]\n", "\n", "groundtruth_content = \"\"\n", "ocr_text = \"\"\n", "\n", "for file in json_files:\n", " file_name = file.replace(\".json\", \"\").split(\"/\")[-1]\n", " for groundtruth_file in groundtruth_files:\n", " if file_name in groundtruth_file:\n", " bucket_name = GROUNDTRUTH_PATH.split(\"/\")[2]\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(groundtruth_file)\n", " groundtruth_content = blob.download_as_string().decode()\n", "\n", " break\n", " else:\n", " groundtruth_content = \"\"\n", " bucket = JSONS_PATH.split(\"/\")[2]\n", " document_proto = documentai_json_proto_downloader(bucket, file)\n", " if hasattr(document_proto, \"text\"):\n", " ocr_text = document_proto.text\n", " else:\n", " ocr_text = \"\"\n", " if groundtruth_content and ocr_text:\n", " cer = cer_2(groundtruth_content, ocr_text)\n", " wer = wer_2(groundtruth_content, ocr_text)\n", "\n", " row = {\n", " \"filename\": file_name,\n", " \"ocr_text\": ocr_text,\n", " \"groundtruth_text\": groundtruth_content,\n", " \"cer\": cer,\n", " \"wer\": wer,\n", " }\n", " df_output = df_output._append(row, ignore_index=True)\n", " else:\n", " print(f'skipping file \"{file_name}\" as ground Truth or json file is missing')\n", "df_output.to_csv(\"output.csv\")\n", "\n", "# Overall performances\n", "mean_cer = df_output[\"cer\"].mean()\n", "mean_wer = df_output[\"wer\"].mean()\n", "print(f\"Mean CER = {round(mean_cer,2)}%, Mean WER = {round(mean_wer,2)}%\")" ] }, { "cell_type": "code", "execution_count": null, "id": "2f942754-ca9a-4c85-8803-eb54596b108f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m104", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m104" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }