sdk/python/jobs/automl-standalone-jobs/automl-nlp-text-named-entity-recognition-task/automl-nlp-text-ner-task.ipynb (1,014 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# AutoML - Train \"the best\" NLP NER model for a named entity recognition dataset.\n", "\n", "**Requirements** - In order to benefit from this tutorial, you will need:\n", "- A basic understanding of Machine Learning\n", "- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)\n", "- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) \n", "- A python environment\n", "- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section\n", "\n", "Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorizes specified entities in a body or bodies of texts. NER is also known simply as entity identification, entity chunking and entity extraction.\n", "\n", "This notebook using AutoML NLP NER task trains a model using prepared datasets derived from the CoNLL-2003 dataset introduced by Sang et al. in [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://paperswithcode.com/paper/introduction-to-the-conll-2003-shared-task) and also available with a derived version at KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt)\n", "\n", "CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Connect to Azure Machine Learning Workspace\n", "\n", "The [workspace](https://docs.microsoft.com/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.\n", "\n", "## 1.1. Import the required libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1634852261599 } }, "outputs": [], "source": [ "# Import required libraries\n", "from azure.identity import DefaultAzureCredential\n", "from azure.ai.ml import automl, Input, MLClient\n", "from azure.ai.ml.constants import AssetTypes\n", "from azure.ai.ml.entities import ResourceConfiguration\n", "\n", "from pprint import pprint" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2. Configure workspace details and get a handle to the workspace\n", "\n", "To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1634852261884 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "ml_client = None\n", "try:\n", " ml_client = MLClient.from_config(credential)\n", "except Exception as ex:\n", " print(ex)\n", " # Enter details of your AML workspace\n", " subscription_id = \"<SUBSCRIPTION_ID>\"\n", " resource_group = \"<RESOURCE_GROUP>\"\n", " workspace = \"<AML_WORKSPACE_NAME>\"\n", " ml_client = MLClient(credential, subscription_id, resource_group, workspace)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Data\n", "\n", "This model trianing uses the datasets from KAGGLE [CoNLL003 (English-version)](https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion?select=valid.txt), in particular using the following datasets in the training and validation process:\n", "\n", "- Training dataset file (train.txt)\n", "- Validation dataset file (valid.txt)\n", "\n", "Both files are placed within their related MLTable folder.\n", "\n", "Please make use of the MLTable files present in separate folders at the same location (in the repo) as this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MLTable folders\n", "training_mltable_path = \"./training-mltable-folder/\"\n", "validation_mltable_path = \"./validation-mltable-folder/\"\n", "\n", "# Training MLTable defined locally, with local data to be uploaded\n", "my_training_data_input = Input(type=AssetTypes.MLTABLE, path=training_mltable_path)\n", "\n", "# Validation MLTable defined locally, with local data to be uploaded\n", "my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=validation_mltable_path)\n", "\n", "# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store\n", "# my_training_data_input = Input(type=AssetTypes.MLTABLE, path=\"azureml://datastores/workspaceblobstore/paths/my_training_mltable\")\n", "# my_validation_data_input = Input(type=AssetTypes.MLTABLE, path=\"azureml://datastores/workspaceblobstore/paths/my_validation_mltable\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For documentation on creating your own MLTable assets for jobs beyond this notebook:\n", "- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).\n", "- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Configure and run the AutoML NLP Text NER training job\n", "In this section we will configure and run the AutoML job, for training the model. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1634852262026 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "name": "text-ner-configuration", "nteract": { "transient": { "deleting": false } }, "scrolled": true }, "outputs": [], "source": [ "# Create the AutoML job with the related factory-function.\n", "\n", "exp_name = \"dpv2-nlp-text-ner-experiment\"\n", "exp_timeout = 60\n", "text_ner_job = automl.text_ner(\n", " # name=\"dpv2-nlp-text-ner-job-01\",\n", " experiment_name=exp_name,\n", " training_data=my_training_data_input,\n", " validation_data=my_validation_data_input,\n", " tags={\"my_custom_tag\": \"My custom value\"},\n", ")\n", "\n", "text_ner_job.set_limits(timeout_minutes=exp_timeout)\n", "text_ner_job.resources = ResourceConfiguration(instance_type=\"Standard_NC6s_v3\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 Run the Command\n", "Using the `MLClient` created earlier, we will now run this Commandas a job in the workspace." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1634852267930 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } }, "scrolled": true }, "outputs": [], "source": [ "# Submit the AutoML job\n", "\n", "returned_job = ml_client.jobs.create_or_update(\n", " text_ner_job\n", ") # submit the job to the backend\n", "\n", "print(f\"Created job: {returned_job}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.jobs.stream(returned_job.name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 Runs with models from Hugging Face (Preview)\n", "\n", "In addition to the model algorithms supported natively by AutoML, you can launch individual runs to explore any model algorithm from HuggingFace transformers library that supports text classification. Please refer to this [documentation](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers&sort=trending) for the list of models.\n", "\n", "If you wish to try a model algorithm (say microsoft/xdoc-base-funsd), you can specify the job for your AutoML NLP runs as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute target setup\n", "\n", "from azure.ai.ml.entities import AmlCompute\n", "from azure.core.exceptions import ResourceNotFoundError\n", "\n", "compute_name = \"gpu-cluster-nc6s-v3\"\n", "\n", "try:\n", " _ = ml_client.compute.get(compute_name)\n", " print(\"Found existing compute target.\")\n", "except ResourceNotFoundError:\n", " print(\"Creating a new compute target...\")\n", " compute_config = AmlCompute(\n", " name=compute_name,\n", " type=\"amlcompute\",\n", " size=\"Standard_NC6s_v3\",\n", " idle_time_before_scale_down=120,\n", " min_instances=0,\n", " max_instances=4,\n", " )\n", " ml_client.begin_create_or_update(compute_config).result()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the AutoML job with the related factory-function.\n", "\n", "text_ner_hf_job = automl.text_ner(\n", " experiment_name=exp_name,\n", " compute=compute_name,\n", " training_data=my_training_data_input,\n", " validation_data=my_validation_data_input,\n", " tags={\"my_custom_tag\": \"My custom value\"},\n", ")\n", "\n", "text_ner_hf_job.set_limits(timeout_minutes=exp_timeout)\n", "text_ner_hf_job.set_training_parameters(model_name=\"roberta-base-openai-detector\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Submit the AutoML job\n", "\n", "returned_hf_job = ml_client.jobs.create_or_update(\n", " text_ner_hf_job\n", ") # submit the job to the backend\n", "\n", "print(f\"Created job: {returned_hf_job}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.jobs.stream(returned_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 Hyperparameter Sweep Runs (Public Preview)\n", "\n", "AutoML allows you to easily train models for Named Entity Recognition on your text data. You can control the model algorithm to be used, specify hyperparameter values for your model, as well as perform a sweep across the hyperparameter space to generate an optimal model.\n", "\n", "When using AutoML for text tasks, you can specify the model algorithm using the `model_name` parameter. You can either specify a single model or choose to sweep over multiple models. Please refer to the <font color='blue'><a href=\"https://github.com/Azure/azureml-examples/blob/48957c70bd53912077e81a180f424f650b414107/sdk/python/jobs/automl-standalone-jobs/automl-nlp-text-named-entity-recognition-task-distributed-sweeping/automl-nlp-text-ner-task-distributed-with-sweeping.ipynb\">sweep notebook</a></font> for detailed instructions on configuring and submitting a sweep job." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Retrieve Model Information from the Best Trial of the Model\n", "\n", "Once all the trials complete training, we can retrieve the best model and deploy it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Obtain best child run id\n", "returned_nlp_job = ml_client.jobs.get(name=returned_job.name)\n", "best_child_run_id = returned_nlp_job.tags[\"automl_best_child_run_id\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### Initialize MLFlow Client\n", "Use the MLFlow interface (MLFlowClient) to access the results and other information (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.\n", "\n", "Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.\n", "\n", "*IMPORTANT*, you need to have installed the latest MLFlow packages with:\n", "\n", "`pip install azureml-mlflow`\n", "\n", "`pip install mlflow`\n", "\n", "### Obtain the tracking URI for MLFlow" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "import mlflow\n", "\n", "# Obtain the tracking URL from MLClient\n", "MLFLOW_TRACKING_URI = ml_client.workspaces.get(\n", " name=ml_client.workspace_name\n", ").mlflow_tracking_uri\n", "\n", "print(MLFLOW_TRACKING_URI)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# Set the MLFLOW TRACKING URI\n", "mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)\n", "print(\"\\nCurrent tracking uri: {}\".format(mlflow.get_tracking_uri()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "from mlflow.tracking.client import MlflowClient\n", "from mlflow.artifacts import download_artifacts\n", "\n", "# Initialize MLFlow client\n", "mlflow_client = MlflowClient()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### Get the AutoML parent Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "job_name = returned_job.name\n", "\n", "# Get the parent run\n", "mlflow_parent_run = mlflow_client.get_run(job_name)\n", "\n", "print(\"Parent Run: \")\n", "print(mlflow_parent_run)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### Get the AutoML best child run" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# Get the best model's child run\n", "best_run = mlflow_client.get_run(best_child_run_id)\n", "\n", "print(\"Best child run: \")\n", "print(best_run)\n", "# note: best model's child run id can also be retrieved through:\n", "# best_child_run_id = mlflow_parent_run.data.tags[\"automl_best_child_run_id\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "best_run.data.metrics" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### Download the best model locally\n", "\n", "Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "import os\n", "\n", "# Create local folder\n", "local_dir = \"./artifact_downloads\"\n", "if not os.path.exists(local_dir):\n", " os.mkdir(local_dir)\n", "# Download run's artifacts/outputs\n", "local_path = download_artifacts(\n", " run_id=best_run.info.run_id, artifact_path=\"outputs\", dst_path=local_dir\n", ")\n", "print(\"Artifacts downloaded in: {}\".format(local_path))\n", "print(\"Artifacts: {}\".format(os.listdir(local_path)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# Show the contents of the MLFlow model folder\n", "os.listdir(\"./artifact_downloads/outputs/mlflow-model\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "# 4 Model Deployment and Inference\n", "\n", "## 4.1 Create managed online endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# import required libraries\n", "from azure.ai.ml.entities import (\n", " ManagedOnlineEndpoint,\n", " ManagedOnlineDeployment,\n", " Model,\n", " Environment,\n", " CodeConfiguration,\n", " ProbeSettings,\n", ")\n", "\n", "# Creating a unique endpoint name with current datetime to avoid conflicts\n", "import datetime\n", "\n", "online_endpoint_name = \"nlp-ner\" + datetime.datetime.now().strftime(\"%m%d%H%M%f\")\n", "\n", "# create an online endpoint\n", "endpoint = ManagedOnlineEndpoint(\n", " name=online_endpoint_name,\n", " description=\"ner endpoint\",\n", " auth_mode=\"key\",\n", " tags={\"foo\": \"bar\"},\n", ")\n", "print(online_endpoint_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "ml_client.begin_create_or_update(endpoint).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "## 4.2 Register best model and deploy\n", "### Register Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# Register best model\n", "## Register model\n", "\n", "model_name = \"nlp-ner-model\"\n", "model = Model(\n", " path=f\"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/\",\n", " name=model_name,\n", " description=\"my sample nlp NER task\",\n", " type=AssetTypes.MLFLOW_MODEL,\n", ")\n", "# for downloaded file\n", "# model = Model(\n", "# path=path=\"artifact_downloads/outputs/mlflow-model/\",\n", "# name=model_name,\n", "# description=\"\",\n", "# type=AssetTypes.MLFLOW_MODEL,\n", "# )\n", "registered_model = ml_client.models.create_or_update(model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "registered_model.id" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy\n", "\n", "List of SKUs that supports for Axure Machine Learning managed online endpoints https://docs.microsoft.com/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "deployment = ManagedOnlineDeployment(\n", " name=\"ner-deploy\",\n", " endpoint_name=online_endpoint_name,\n", " model=registered_model.id,\n", " instance_type=\"Standard_DS4_v2\",\n", " instance_count=1,\n", " liveness_probe=ProbeSettings(\n", " failure_threshold=30,\n", " success_threshold=1,\n", " timeout=2,\n", " period=10,\n", " initial_delay=2000,\n", " ),\n", " readiness_probe=ProbeSettings(\n", " failure_threshold=10,\n", " success_threshold=1,\n", " timeout=10,\n", " period=10,\n", " initial_delay=2000,\n", " ),\n", ")\n", "ml_client.online_deployments.begin_create_or_update(deployment).result()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# deployment to take 100% traffic\n", "endpoint.traffic = {\"ner-deploy\": 100}\n", "ml_client.begin_create_or_update(endpoint).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### get endpoint details" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# Get the details for online endpoint\n", "endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)\n", "\n", "# existing traffic details\n", "print(endpoint.traffic)\n", "\n", "# Get the scoring URI\n", "print(endpoint.scoring_uri)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "## 4.3 Test Deployment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "import json\n", "\n", "request_json = {\"input_data\": [\"None\"]}\n", "request_file_name = \"sample_request_data.json\"\n", "with open(request_file_name, \"w\") as request_file:\n", " json.dump(request_json, request_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "resp = ml_client.online_endpoints.invoke(\n", " endpoint_name=online_endpoint_name,\n", " deployment_name=deployment.name,\n", " request_file=request_file_name,\n", ")\n", "resp" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### Delete Deployment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Next Steps\n", "You can see further examples of other AutoML tasks such as Regression, Image-Object-Detection, Time-Series-Forcasting, etc." ] } ], "metadata": { "kernel_info": { "name": "v2" }, "kernelspec": { "display_name": "Python 3.10 - SDK V2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" }, "microsoft": { "host": { "AzureML": { "notebookHasBeenCompleted": true } } }, "nteract": { "version": "nteract-front-end@1.0.0" }, "vscode": { "interpreter": { "hash": "2d563d8fd79d5fa8e61eca0bf87fe3e27a5507bc1a7f6ef398659c2c45d91002" } } }, "nbformat": 4, "nbformat_minor": 1 }