incubator-tools/cer_wer/cer_wer.ipynb (289 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "72fd064f-24f5-4d61-b0ad-2b2f3fe9427d",
"metadata": {},
"source": [
"# Character and Word Error Rate"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f5756f1a-631f-4c8a-bba0-98c6821d31a9",
"metadata": {},
"source": [
"* Author: docai-incubator@google.com\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9b1d12ef-55dd-4fbd-8389-db14ed038eb1",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "94527514-1ae2-470b-96e2-0f48e4aa5e81",
"metadata": {},
"source": [
"## Purpose and Description"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cdf4462a-fc0d-477f-9c39-4fec383ba4ca",
"metadata": {},
"source": [
"The objective of the tool is to evaluate the character error rate and word error rate .\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a8783f52-627b-4efa-b5d9-664ae2ca2564",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"1. Vertex AI Notebook\n",
"2. Labeled json files in GCS Folder\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "55cc5540-deb1-4449-8278-716488c54e5c",
"metadata": {},
"source": [
"## Step by Step procedure "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d49f5b2d-f7fd-4403-a175-b95cc804f6ba",
"metadata": {},
"source": [
"### 1. Input Details"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "397acecf-72a3-4913-b5b3-68256ac06f0f",
"metadata": {},
"source": [
"* **JSONS_PATH** = \"gs://xxxx/xxxxxxxx\"\n",
"* **GROUNDTRUTH_PATH** = \"gs://xxxx/xxxxx\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "03dff98d-ef2d-4096-9c82-ac2106333a8d",
"metadata": {},
"source": [
"* **JSONS_PATH**: Provide the location of the dataset exported from the processor which needs to be evaluated\n",
"* **GROUNDTRUTH_PATH**: Provide the location of the ground truth which is the text file containing the content of the document in txt file.\n",
"\n",
"Note: The json file and its corresponding groudtruth should have the same name."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a6fa4328-3d99-4de4-a5eb-0f2033d78b79",
"metadata": {},
"source": [
"### 2. Run the Code"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "678ebb32-9b15-4a98-a982-55eb7137a1f9",
"metadata": {},
"source": [
"Use the function given in the sample code which returns the mean of cer and wer after the evaluation of provided documents and produce csv file .\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b02a0f30-9f93-46bf-b871-f34b4df08894",
"metadata": {},
"source": [
"### 3. Output"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b4232146-0beb-4713-92ff-bf23606d45bf",
"metadata": {},
"source": [
"<img src=\"./Images/cer_wer_output.png\" width=800 height=400 alt=\"Cer Wer Output CSV image\">"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "8b2c2216-4282-4433-9a74-cd503d874dad",
"metadata": {},
"source": [
"### Sample Code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53accc1c",
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to download utilities module\n",
"# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dca87d7",
"metadata": {},
"outputs": [],
"source": [
"%pip install asrtoolkit\n",
"%pip install numpy\n",
"%pip install pandas\n",
"%pip install google-cloud-storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7f4c055-7020-4031-9898-7812c0172436",
"metadata": {},
"outputs": [],
"source": [
"from asrtoolkit import cer as cer_2\n",
"from asrtoolkit import wer as wer_2\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import json\n",
"from google.cloud import storage\n",
"from utilities import (\n",
" store_document_as_json,\n",
" documentai_json_proto_downloader,\n",
" file_names,\n",
" blob_downloader,\n",
")\n",
"\n",
"JSONS_PATH = \"gs://xxxxxx/xxxxxxxxxx\"\n",
"GROUNDTRUTH_PATH = \"gs://xxxxxx/xxxxx/xxxxxxx\"\n",
"\n",
"\"\"\"\n",
"Documents will get compared with their groundtruth and provide\n",
"Mean of cer/wer with the csv file having individual file with their CER/WER.\n",
"\"\"\"\n",
"df_output = pd.DataFrame(\n",
" columns=[\"filename\", \"ocr_text\", \"groundtruth_text\", \"cer\", \"wer\"]\n",
")\n",
"\n",
"json_files = file_names(JSONS_PATH)[1].values()\n",
"json_files = [i for i in list(json_files) if i.endswith(\".json\")]\n",
"\n",
"groundtruth_files = file_names(GROUNDTRUTH_PATH)[1].values()\n",
"groundtruth_files = [i for i in list(groundtruth_files) if i.endswith(\".txt\")]\n",
"\n",
"groundtruth_content = \"\"\n",
"ocr_text = \"\"\n",
"\n",
"for file in json_files:\n",
" file_name = file.replace(\".json\", \"\").split(\"/\")[-1]\n",
" for groundtruth_file in groundtruth_files:\n",
" if file_name in groundtruth_file:\n",
" bucket_name = GROUNDTRUTH_PATH.split(\"/\")[2]\n",
" storage_client = storage.Client()\n",
" bucket = storage_client.bucket(bucket_name)\n",
" blob = bucket.blob(groundtruth_file)\n",
" groundtruth_content = blob.download_as_string().decode()\n",
"\n",
" break\n",
" else:\n",
" groundtruth_content = \"\"\n",
" bucket = JSONS_PATH.split(\"/\")[2]\n",
" document_proto = documentai_json_proto_downloader(bucket, file)\n",
" if hasattr(document_proto, \"text\"):\n",
" ocr_text = document_proto.text\n",
" else:\n",
" ocr_text = \"\"\n",
" if groundtruth_content and ocr_text:\n",
" cer = cer_2(groundtruth_content, ocr_text)\n",
" wer = wer_2(groundtruth_content, ocr_text)\n",
"\n",
" row = {\n",
" \"filename\": file_name,\n",
" \"ocr_text\": ocr_text,\n",
" \"groundtruth_text\": groundtruth_content,\n",
" \"cer\": cer,\n",
" \"wer\": wer,\n",
" }\n",
" df_output = df_output._append(row, ignore_index=True)\n",
" else:\n",
" print(f'skipping file \"{file_name}\" as ground Truth or json file is missing')\n",
"df_output.to_csv(\"output.csv\")\n",
"\n",
"# Overall performances\n",
"mean_cer = df_output[\"cer\"].mean()\n",
"mean_wer = df_output[\"wer\"].mean()\n",
"print(f\"Mean CER = {round(mean_cer,2)}%, Mean WER = {round(mean_wer,2)}%\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f942754-ca9a-4c85-8803-eb54596b108f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m104",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m104"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}