# Character and Word Error Rate

* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description

The objective of the tool is to evaluate the character error rate and word error rate .


## Prerequisites

1. Vertex AI Notebook
2. Labeled json files in GCS Folder


## Step by Step procedure 

### 1. Input Details

* **JSONS_PATH** = "gs://xxxx/xxxxxxxx"
* **GROUNDTRUTH_PATH** = "gs://xxxx/xxxxx"

* **JSONS_PATH**: Provide the location of the dataset  exported from the processor which needs to be evaluated
* **GROUNDTRUTH_PATH**: Provide the location of the ground truth which is the text file containing the content of the document in txt file.

Note: The json file and its corresponding groudtruth should have the same name.

### 2. Run the Code

Use the function given in the sample code which returns the mean of cer and wer  after the evaluation of provided documents and produce csv file .


### 3. Output

<img src="./Images/cer_wer_output.png" width=800 height=400 alt="Cer Wer Output CSV image">

### Sample Code

In [None]:
# Run this cell to download utilities module
# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
%pip install asrtoolkit
%pip install numpy
%pip install pandas
%pip install google-cloud-storage

In [None]:
from asrtoolkit import cer as cer_2
from asrtoolkit import wer as wer_2
import os
import numpy as np
import pandas as pd
import json
from google.cloud import storage
from utilities import (
    store_document_as_json,
    documentai_json_proto_downloader,
    file_names,
    blob_downloader,
)

JSONS_PATH = "gs://xxxxxx/xxxxxxxxxx"
GROUNDTRUTH_PATH = "gs://xxxxxx/xxxxx/xxxxxxx"

"""
Documents will get compared with their groundtruth and provide
Mean of cer/wer with the csv file having individual file with their CER/WER.
"""
df_output = pd.DataFrame(
    columns=["filename", "ocr_text", "groundtruth_text", "cer", "wer"]
)

json_files = file_names(JSONS_PATH)[1].values()
json_files = [i for i in list(json_files) if i.endswith(".json")]

groundtruth_files = file_names(GROUNDTRUTH_PATH)[1].values()
groundtruth_files = [i for i in list(groundtruth_files) if i.endswith(".txt")]

groundtruth_content = ""
ocr_text = ""

for file in json_files:
    file_name = file.replace(".json", "").split("/")[-1]
    for groundtruth_file in groundtruth_files:
        if file_name in groundtruth_file:
            bucket_name = GROUNDTRUTH_PATH.split("/")[2]
            storage_client = storage.Client()
            bucket = storage_client.bucket(bucket_name)
            blob = bucket.blob(groundtruth_file)
            groundtruth_content = blob.download_as_string().decode()

            break
        else:
            groundtruth_content = ""
    bucket = JSONS_PATH.split("/")[2]
    document_proto = documentai_json_proto_downloader(bucket, file)
    if hasattr(document_proto, "text"):
        ocr_text = document_proto.text
    else:
        ocr_text = ""
    if groundtruth_content and ocr_text:
        cer = cer_2(groundtruth_content, ocr_text)
        wer = wer_2(groundtruth_content, ocr_text)

        row = {
            "filename": file_name,
            "ocr_text": ocr_text,
            "groundtruth_text": groundtruth_content,
            "cer": cer,
            "wer": wer,
        }
        df_output = df_output._append(row, ignore_index=True)
    else:
        print(f'skipping file "{file_name}" as ground Truth or json file is missing')
df_output.to_csv("output.csv")

# Overall performances
mean_cer = df_output["cer"].mean()
mean_wer = df_output["wer"].mean()
print(f"Mean CER = {round(mean_cer,2)}%, Mean WER = {round(mean_wer,2)}%")