# AutoML: Train "the best" NLP NER model for the CoNLL 2003 dataset.

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](../../../resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section
- Installed azure-identity package


**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create an `AutoML Text Named Entity Recognition Training Job` with the 'text_ner()' factory-function
- Specify custom models and hyperparameters to sweep over during training ***(Public Preview)*** 
- Leverage multi-node distribution to accelerate large model training
- Obtain the model and score predictions with it

Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorizes specified entities in a body or bodies of texts. NER is also known simply as entity identification, entity chunking and entity extraction.

This notebook trains a model using prepared datasets derived from the CoNLL-2003 dataset, introduced by Sang et al. in [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://paperswithcode.com/paper/introduction-to-the-conll-2003-shared-task). The derived version is available on KAGGLE: [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt). Below, we go over how you can use AutoML for training a Text NER model. We will use the CoNLL dataset to train, demonstrate how you can sweep over models to get the best-performing one for the task at hand, and deploy the model to use in inference scenarios.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, NlpLearningRateScheduler
from azure.ai.ml.automl import SearchSpace
from azure.ai.ml.sweep import Choice, Uniform, BanditPolicy

from azure.ai.ml import automl

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

### Show Azure ML Workspace information

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output

# 2. Data

This model training uses the datasets from KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt), in particular using the following datasets in the training and validation process:

- Training dataset file (train.txt)
- Validation dataset file (valid.txt)

Both files are placed within their related MLTable folder.

Please make use of the MLTable files present in separate folders at the same location (in the repo) as this notebook.

In [None]:
# MLTable folders
training_mltable_path = "./training-mltable-folder/"
validation_mltable_path = "./validation-mltable-folder/"

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)

# Validation MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_training_mltable")
# my_validation_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_validation_mltable")

For documentation on creating your own MLTable assets for jobs beyond this notebook:
- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK.

# 3. Compute target setup

You will need to provide a [Compute Target](https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#computes) that will be used for your AutoML model training. AutoML models for NLP tasks require [GPU SKUs](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) such as the ones from the NC, NCv2, NCv3, ND, NDv2 and NCasT4 series. We recommend using the NCsv3-series (with v100 GPUs) for faster training. Using a compute target with a multi-GPU VM SKU will leverage the multiple GPUs to speed up training. Additionally, setting up a compute target with multiple nodes will allow for faster training, either by leveraging parallelism when exploring the model search space, or by distributing per-model training across multiple nodes.

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "gpu-cluster-nc6s-v3"

try:
    _ = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="Standard_NC6s_v3",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(compute_config).result()

# 4. Configure and run the AutoML NLP Text NER training job
AutoML allows you to easily train models for Text Classification (single- or multi-label) and Named Entity Recognition on your text data. You can control the model algorithm to be used, specify hyperparameter values for your model, as well as perform a sweep across the hyperparameter space to generate an optimal model.

When using AutoML for text tasks, you can specify the model algorithm using the `model_name` parameter. You can either specify a single model or choose to sweep over multiple models. Please refer to the <font color='blue'><a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-nlp-models?tabs=python#model-sweeping-and-hyperparameter-tuning-preview">docs</a></font> for the list of supported models and hyperparameters.

## 4.1 Train with default hyperparameters for a single, specified algorithm
Before doing a large sweep to search for the optimal models and hyperparameters, we recommend trying the default values for a given model to get a first baseline. Next, you can explore different models and hyperparameters, allowing for an iterative approach. With multiple models and hyperparameters, the search space grows exponentially, meaning you will need more iterations to find optimal configurations.

The following funtions are used to configure the AutoML NLP job:

### text_ner() function parameters:
The `text_ner()` factory function allows the user to configure the training job.
- `compute` - the compute on which the AutoML job will run. In this example we are using a compute called 'gpu-cluster' present in the workspace. You can replace it with any other compute in the workspace.
- `experiment_name` - the name of the experiment. An experiment is like a folder with multiple runs from the AzureML Workspace that should be related to the same logical machine learning experiment.
- `name` - the name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
- `primary_metric` - the metric that AutoML will optimize for during sweeping.

### set_limits() function parameters:
This is an optional configuration method to set limit parameters such as timeouts.
- `timeout_minutes` - maximum amount of time in minutes that the whole AutoML job can take before the job terminates. If not specified, the default job's total timeout is 6 days (8,640 minutes).
- `max_nodes` - if the underlying compute is a multi-node cluster, specify the maximum number of nodes to use for the experiment. The default is 1. This value can be increased to enable multi-node distribution. Note that if insufficient nodes are available on the compute compared to this value, a smaller value is used.

### set_training_parameters() function parameters:
This is an optional configuration method ***(public preview)*** to configure fixed settings or parameters that will _not_ be changed during the job parameter space sweeping. Specifying a `model_name` for instance fixes that model during training, and a range of models should not be specified in the parameter sweeping space for that same job. Some key parameters of this function are:
- `model_name` - the name of the ML algorithm, or model, that we want to use during training.
- `learning_rate` - the initial learning rate to use during training.
- `learning_rate_scheduler` - the learning rate scheduler to use during training.
- `warmup_ratio` - ratio of total training steps used to warmup from 0 to the initial `learning_rate`.

Please refer to <font color='blue'><a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-nlp-models?tabs=python#model-sweeping-and-hyperparameter-tuning-preview">docs</a></font> for the full list of supported NLP models and hyperparameters.
    
Now for an example, if you wish to run 2-way distributed training for a given model algorithm, say `roberta-base`, with a linear learning rate warmup of 10% of the total training steps, you can specify the job for your AutoML NLP runs as follows:

In [None]:
# general job parameters
exp_name = "dpv2-nlp-text-ner-experiment"

In [None]:
# Create the AutoML job with the related factory-function.

text_ner_job = automl.text_ner(
    compute=compute_name,
    # name="dpv2-text-ner-job-01",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    tags={"my_custom_tag": "My custom value"},
)


# Set limits
text_ner_job.set_limits(timeout_minutes=120, max_nodes=2)

# Pass the fixed parameters
text_ner_job.set_training_parameters(
    model_name="roberta-base",
    learning_rate_scheduler=NlpLearningRateScheduler.LINEAR,
    warmup_ratio=0.1,
)

## Submitting an AutoML job for NLP tasks
Once you've configured the job, you can submit it in the workspace in order to train an NLP model using your training dataset.

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    text_ner_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
ml_client.jobs.stream(returned_job.name)

## 4.2 Model & Hyperparameter Sweeping for AutoML NLP (Public Preview)
When using AutoML NLP, we can perform a _sweep_ over a defined parameter space to find the optimal model and hyperparameters. Note that generally, for large pretrained text DNNs, hyperparameter sweeping often leads to less lift than switching to a more powerful model, so we focus our sweeping search space on model exploration. Whenever hyperparameters are not specified, default values are used for the specified algorithm.

### set_limits() parameters
The `set_limits` function has some useful limits specific to sweep procedures:
- `max_trials` - parameter for maximum number of configurations to sweep. This must be an integer between 1 and 1000. Defaults to 1.
- `max_concurrent_trials` - maximum number of runs that can run concurrently. If not specified, defaults to 1. If specified, the value must be an integer between 1 and 100. **Note**: if `max_nodes` is also specified, concurrent scheduling is given priority over multi-node distribution. For example, given an 8 node cluster with `max_nodes=4` and `max_concurrent_trials=4`, four single-node runs will be scheduled at all times until the max_trials limit is exhausted. With `max_nodes=8` and `max_concurrent_trials=4`, only then would we see four two-node distributed runs active at all times.


### set_sweep() parameters
The `set_sweep` function is used to configure the sweep settings:
- `sampling_algorithm` - sampling method to use for sweeping over the defined parameter space. Please refer to <font color='blue'><a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-nlp-models?tabs=cli#sampling-methods-for-the-sweep">docs</a></font> for the list of supported sampling methods.
- `early_termination` - early termination policy to end poorly performing runs. If no termination policy is specified, all configurations are run to completion. Please refer to this <font color='blue'><a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#early-termination">page</a></font> for supported early termination policies.

In the following example, we use random sampling to pick samples from the parameter space and specify a total of 4 iterations, running 2 iterations at a time on our compute target.
    
We leverage the Bandit early termination policy, which will terminate poorly performing configs (those that are not within 5% slack of the best performing config), thus significantly saving compute resources.
    
For more details on model and hyperparameter sweeping, please refer to the <font color='blue'><a href="https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-nlp-models?tabs=cli#model-sweeping-and-hyperparameter-tuning-preview">docs</a></font>.

In [None]:
# Create the AutoML job with the related factory-function.

text_ner_job = automl.text_ner(
    compute=compute_name,
    # name="dpv2-text-ner-job-02",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    tags={"my_custom_tag": "My custom value"},
)

text_ner_job.set_limits(
    timeout_minutes=120, max_trials=4, max_concurrent_trials=2, max_nodes=4
)

text_ner_job.extend_search_space(
    [
        SearchSpace(
            model_name=Choice(["bert-base-cased", "roberta-base"]),
        ),
        SearchSpace(
            model_name=Choice(["distilroberta-base"]),
            weight_decay=Uniform(0.01, 0.1),
        ),
    ]
)

text_ner_job.set_sweep(
    sampling_algorithm="Random",
    early_termination=BanditPolicy(
        evaluation_interval=2, slack_factor=0.05, delay_evaluation=6
    ),
)

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    text_ner_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
ml_client.jobs.stream(returned_job.name)

When sweeping through the parameters in the provided search space, it can be useful to visualize the different configurations that were tried using the HyperDrive UI. You can navigate to this UI by going to the 'Child jobs' tab in the UI of the main automl nlp job from above, which is the HyperDrive parent run. Then you can go into the 'Trials' tab of this HyperDrive parent run. ALternatively, here below you can see directly the HyperDrive parent run and navigate to its 'Trials' tab:

In [None]:
hd_job = ml_client.jobs.get(returned_job.name + "_HD")
hd_job

## 4.3 Manual hyperparameter sweeping for models from Hugging Face (Preview)
You can use any model algorithm from Hugging face transformers library for either an individual run or you can also include these models to perform a hyperparameter sweep. You can also choose a combination of model algorithms supported supported natively by [AutoML](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-nlp-models?view=azureml-api-2&tabs=cli) and model algorithms from [Hugging Face](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers&sort=trending).

In this example, we sweep over bert-base-cased, microsoft/xdoc-base-funsd, and xlm-roberta-large-finetuned-conll03-english, models choosing from a range of values for learning_rate, number_of_epochs, etc., to generate a model with the optimal 'accuracy'.

In [None]:
# Create the AutoML job with the related factory-function.

text_ner_job = automl.text_ner(
    compute=compute_name,
    # name="dpv2-text-ner-job-02",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_validation_data_input,
    tags={"my_custom_tag": "My custom value"},
)

text_ner_job.set_limits(timeout_minutes=120, max_trials=4, max_concurrent_trials=2)

text_ner_job.extend_search_space(
    [
        SearchSpace(
            model_name=Choice(["bert-large-cased", "roberta-base"]),
        ),
        SearchSpace(
            model_name=Choice(["roberta-base-openai-detector"]),
            weight_decay=Uniform(0.01, 0.1),
        ),
    ]
)

text_ner_job.set_sweep(
    sampling_algorithm="Random",
    early_termination=BanditPolicy(
        evaluation_interval=2, slack_factor=0.05, delay_evaluation=6
    ),
)

# 5. Retrieve the Best Model
Once all the trials complete training, we can retrieve the best model and deploy it.

## Initialize MLflow Client
The models and artifacts that are produced by AutoML can be accessed via the MLflow interface. Initialize the MLflow client here and set the backend to Azure ML via the MLflow Client. IMPORTANT: you need to have installed the latest MLflow packages with:

`pip install azureml-mlflow`

`pip install mlflow`

### Obtain the tracking URI for MLflow

In [None]:
import mlflow

# Obtain the tracking URI from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLflow tracking URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient
from mlflow.artifacts import download_artifacts

# Initialize MLflow client
mlflow_client = MlflowClient()

### Get the AutoML parent job

In [None]:
job_name = returned_job.name

# Example if providing a specific job name/ID
# job_name = "joyful_carrot_rv9jrjk6c6"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

### Get the AutoML best child run

In [None]:
# Get the best model's child run
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

### Get best model run's metrics

Access the results (such as Models, Artifacts, and Metrics) of a previously completed AutoML Run.

In [None]:
import pandas as pd

pd.DataFrame(best_run.data.metrics, index=[0]).T

### Download the best model locally

In [None]:
import os

# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [None]:
# Download run's artifacts/outputs
local_path = download_artifacts(
    run_id=best_run.info.run_id, artifact_path="outputs", dst_path=local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

In [None]:
# Show the contents of the MLflow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

# 6. Register best model and deploy

## 6.1 Create managed online endpoint

In [None]:
# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)

In [None]:
# Creating a unique endpoint name with current datetime to avoid conflicts
import datetime

online_endpoint_name = "conll-ner-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is a sample online endpoint for deploying a model",
    auth_mode="key",
    tags={"foo": "bar"},
)
print(online_endpoint_name)

In [None]:
ml_client.begin_create_or_update(endpoint).result()

## 6.2 Deploy

In [None]:
# deploying the mlflow-model
model_name = "conll-ner-mlflow-model"
model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/",
    name=model_name,
    description="my sample ner model",
    type=AssetTypes.MLFLOW_MODEL,
)

# for downloaded file
# model = Model(
#     path=path="artifact_downloads/outputs/mlflow-model/",
#     name=model_name,
#     description="my sample instance segmentation model",
#     type=AssetTypes.MLFLOW_MODEL,
# )

registered_model = ml_client.models.create_or_update(model)

In [None]:
registered_model.id

### Deploy

In [None]:
deployment = ManagedOnlineDeployment(
    name="conll-ner-mlflow-dpl",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_DS4_V2",
    instance_count=1,
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        timeout=2,
        period=10,
        initial_delay=2000,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=2000,
    ),
)

In [None]:
ml_client.online_deployments.begin_create_or_update(deployment).result()

In [None]:
# set our ner endpoint to take 100% of traffic
endpoint.traffic = {"conll-ner-mlflow-dpl": 100}
ml_client.begin_create_or_update(endpoint).result()

### Get endpoint details

In [None]:
# Get the details for online endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

### Test the deployment

In [None]:
CoNLL_formatted_string = """The
European
Commission
made
a
ruling
on
Friday
"""
request_json = {"input_data": [CoNLL_formatted_string]}

In [None]:
import json

request_file_name = "sample_request_data.json"
with open(request_file_name, "w") as request_file:
    json.dump(request_json, request_file)

In [None]:
resp = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name=deployment.name,
    request_file=request_file_name,
)

### Delete the deployment and endpoint

Once you are done with the model, you can delete the endpoint and associated deployment if you wish.

In [None]:
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

# Next Steps

You can see further examples of other AutoML tasks, such as regression, image-classification, time-series forecasting, etc. in other notebooks of this repo.