# AutoML - Train "the best" NLP NER model for a named entity recognition dataset.

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) 
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorizes specified entities in a body or bodies of texts. NER is also known simply as entity identification, entity chunking and entity extraction.

This notebook using AutoML NLP NER task trains a model using prepared datasets derived from the CoNLL-2003 dataset introduced by Sang et al. in [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://paperswithcode.com/paper/introduction-to-the-conll-2003-shared-task) and also available with a derived version at KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt)

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import automl, Input, MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import ResourceConfiguration

from pprint import pprint

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

# 2. Data

This model trianing uses the datasets from KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt), in particular using the following datasets in the training and validation process:

- Training dataset file (train.txt)
- Validation dataset file (valid.txt)

Both files are placed within their related MLTable folder.

Please make use of the MLTable files present in separate folders at the same location (in the repo) as this notebook.

In [None]:
# MLTable folders
training_mltable_path = "./training-mltable-folder/"
validation_mltable_path = "./validation-mltable-folder/"

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_training_mltable")
# my_validation_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_validation_mltable")

For documentation on creating your own MLTable assets for jobs beyond this notebook:
- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK.

## 2.1 Configure and run the AutoML NLP Text NER training job
In this section we will configure and run the AutoML job, for training the model.    

In [None]:
# Create the AutoML job with the related factory-function.

exp_name = "dpv2-nlp-text-ner-experiment"
exp_timeout = 60
text_ner_job = automl.text_ner(
    # name="dpv2-nlp-text-ner-job-01",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    tags={"my_custom_tag": "My custom value"},
)

text_ner_job.set_limits(timeout_minutes=exp_timeout)
text_ner_job.resources = ResourceConfiguration(instance_type="Standard_NC6s_v3")

## 2.2 Run the Command
Using the `MLClient` created earlier, we will now run this Commandas a job in the workspace.

In [None]:
# Submit the AutoML job

returned_job = ml_client.jobs.create_or_update(
    text_ner_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
ml_client.jobs.stream(returned_job.name)

## 2.3 Runs with models from Hugging Face (Preview)

In addition to the model algorithms supported natively by AutoML, you can launch individual runs to explore any model algorithm from HuggingFace transformers library that supports text classification. Please refer to this [documentation](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers&sort=trending) for the list of models.

If you wish to try a model algorithm (say microsoft/xdoc-base-funsd), you can specify the job for your AutoML NLP runs as follows:

In [None]:
# Compute target setup

from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "gpu-cluster-nc6s-v3"

try:
    _ = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="Standard_NC6s_v3",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(compute_config).result()

In [None]:
# Create the AutoML job with the related factory-function.

text_ner_hf_job = automl.text_ner(
    experiment_name=exp_name,
    compute=compute_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    tags={"my_custom_tag": "My custom value"},
)

text_ner_hf_job.set_limits(timeout_minutes=exp_timeout)
text_ner_hf_job.set_training_parameters(model_name="roberta-base-openai-detector")

In [None]:
# Submit the AutoML job

returned_hf_job = ml_client.jobs.create_or_update(
    text_ner_hf_job
)  # submit the job to the backend

print(f"Created job: {returned_hf_job}")

In [None]:
ml_client.jobs.stream(returned_job.name)

## 2.4 Hyperparameter Sweep Runs (Public Preview)

AutoML allows you to easily train models for Named Entity Recognition on your text data. You can control the model algorithm to be used, specify hyperparameter values for your model, as well as perform a sweep across the hyperparameter space to generate an optimal model.

When using AutoML for text tasks, you can specify the model algorithm using the `model_name` parameter. You can either specify a single model or choose to sweep over multiple models. Please refer to the <font color='blue'><a href="https://github.com/Azure/azureml-examples/blob/48957c70bd53912077e81a180f424f650b414107/sdk/python/jobs/automl-standalone-jobs/automl-nlp-text-named-entity-recognition-task-distributed-sweeping/automl-nlp-text-ner-task-distributed-with-sweeping.ipynb">sweep notebook</a></font> for detailed instructions on configuring and submitting a sweep job.

# 3 Retrieve Model Information from the Best Trial of the Model

Once all the trials complete training, we can retrieve the best model and deploy it.

In [None]:
# Obtain best child run id
returned_nlp_job = ml_client.jobs.get(name=returned_job.name)
best_child_run_id = returned_nlp_job.tags["automl_best_child_run_id"]

### Initialize MLFlow Client
Use the MLFlow interface (MLFlowClient) to access the results and other information (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

`pip install azureml-mlflow`

`pip install mlflow`

### Obtain the tracking URI for MLFlow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient
from mlflow.artifacts import download_artifacts

# Initialize MLFlow client
mlflow_client = MlflowClient()

### Get the AutoML parent Job

In [None]:
job_name = returned_job.name

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

### Get the AutoML best child run

In [None]:
# Get the best model's child run
best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)
# note: best model's child run id can also be retrieved through:
# best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]

In [None]:
best_run.data.metrics

### Download the best model locally

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [None]:
import os

# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
# Download run's artifacts/outputs
local_path = download_artifacts(
    run_id=best_run.info.run_id, artifact_path="outputs", dst_path=local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

In [None]:
# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

# 4 Model Deployment and Inference

## 4.1 Create managed online endpoint

In [None]:
# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)

# Creating a unique endpoint name with current datetime to avoid conflicts
import datetime

online_endpoint_name = "nlp-ner" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="ner endpoint",
    auth_mode="key",
    tags={"foo": "bar"},
)
print(online_endpoint_name)

In [None]:
ml_client.begin_create_or_update(endpoint).result()

## 4.2 Register best model and deploy
### Register Model

In [None]:
# Register best model
## Register model

model_name = "nlp-ner-model"
model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/",
    name=model_name,
    description="my sample nlp NER task",
    type=AssetTypes.MLFLOW_MODEL,
)
# for downloaded file
# model = Model(
#     path=path="artifact_downloads/outputs/mlflow-model/",
#     name=model_name,
#     description="",
#     type=AssetTypes.MLFLOW_MODEL,
# )
registered_model = ml_client.models.create_or_update(model)

In [None]:
registered_model.id

### Deploy

List of SKUs that supports for Axure Machine Learning managed online endpoints https://docs.microsoft.com/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list

In [None]:
deployment = ManagedOnlineDeployment(
    name="ner-deploy",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_DS4_v2",
    instance_count=1,
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        timeout=2,
        period=10,
        initial_delay=2000,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=2000,
    ),
)
ml_client.online_deployments.begin_create_or_update(deployment).result()

In [None]:
# deployment to take 100% traffic
endpoint.traffic = {"ner-deploy": 100}
ml_client.begin_create_or_update(endpoint).result()

### get endpoint details

In [None]:
# Get the details for online endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

## 4.3 Test Deployment

In [None]:
import json

request_json = {"input_data": ["None"]}
request_file_name = "sample_request_data.json"
with open(request_file_name, "w") as request_file:
    json.dump(request_json, request_file)

In [None]:
resp = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name=deployment.name,
    request_file=request_file_name,
)
resp

### Delete Deployment

In [None]:
ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()

# Next Steps
You can see further examples of other AutoML tasks such as Regression, Image-Object-Detection, Time-Series-Forcasting, etc.