# Training and tracking an XGBoost classifier with MLflow using Service Principal Authentication

This notebook demonstrates how to use MLflow for tracking experiment using MLflow in Azure ML using a Service Principal to authenticate against Azure. Authentication is automatically handled for you when you are running inside an Azure ML compute (Compute Instances, Compute Clusters). However, if you are using any other type of compute (like Azure Databricks, your laptop, etc) then you will have to provide credentials to be able to access Azure ML Services. By default, the `azureml-mlflow` plug-in uses Interactive Authentication. However, they may be cases where you are not able to use interactivity - for instance when you are running inside of a job or unattended system). On those cases, you can use a Service Principal to authenticate against the services. On this example, we will walk you through the steps to train a model using MLflow connected to Azure ML using a Service Principal.

We will consider the [Heart Disease Data Set](https://archive.ics.uci.edu/ml/datasets/heart+disease). This database contains 76 attributes, but we will be using a subset of 14 of them. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. In this example we will concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

In [None]:
# Ensure you have the dependencies for this notebook
%pip install -r xgboost_service_principal.txt

In [None]:
import warnings

warnings.simplefilter("ignore")

## Configuring the experiment

### Configuring how MLflow will authenticate

We are going to configure the communication between Azure ML and MLflow. By default, the plug-in `azureml-mlflow` uses interactive authentication to authenticate against Azure. However, we can change this by populating specific environment variables. The following environment variables, if configured, will result in the communication being established using a Service Principal:

In [None]:
import os

os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"

### Getting access to the workspace

We can use this credentials to get access to the workspace. To do that we will need the workspace details:

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

> **Hint for automation:** If you are running inside a context where Azure CLI is installed, like a GitHub Workflow or an Azure DevOps Pipeline, you can use here `AzureCliCredential` instead of `DefaultAzureCredential` to get the associated credentials. In that case, configuring the environment variables `AZURE_TENANT_ID`, `AZURE_CLIENT_ID` and `AZURE_CLIENT_SECRET` is not required.

In [None]:
credentials = DefaultAzureCredential()

We could have used `ClientSecretCredential` instead of `DefaultAzureCredential` and pass `client_id`, `client_secret` and `tenant_id`. However, since we already configured the environment variables, we can skip that step and use `DefaultAzureCredential` which will pick up the values for us.

In [None]:
ml_client = MLClient(
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    credential=credentials,
)

In [None]:
ws = ml_client.workspaces.get(name=workspace)

### Configuring tracking URL in Mflow and creating the experiment

We can now use `ws.mlflow_tracking_uri` to get access to the tracking URL for the given workspace. Since the environment variables we configured before are populated, when we call `mlflow.set_experiment`, those credentials will be used to authenticate against the service.

In [None]:
import mlflow

mlflow.set_tracking_uri = ws.mlflow_tracking_uri
mlflow.set_experiment(experiment_name="heart-condition-classifier")

## Exploring the data

In [None]:
import pandas as pd

In [None]:
file_url = "https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv"
df = pd.read_csv(file_url)

In [None]:
df

As we can see, some of the variables are categorical. To make it simpler for our model to handle these values, let's use their encoded values:

In [None]:
df["thal"] = df["thal"].astype("category").cat.codes

The encoded values looks then as follows:

In [None]:
df["thal"].unique()

Let's split our dataset in train and test, so we can assess the performance of the model without overfitting the dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

## Training a model

We are going to use autologging capabilities in MLflow to track parameters and metrics:

In [None]:
mlflow.xgboost.autolog()

Let's create a simple classifier and train it:

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

As soon as the `train` method is executed, MLflow will stat a run in Azure ML to start tracking the experiment's run. However, it is always a good idea to start the run manually so you have the run ID at hand quickly. This is not required though.

> Important: When running training routines in Azure ML as jobs, you don't need to start or end the run in your training code as it is automatically done for you by Azure ML.

In [None]:
run = mlflow.start_run()

In [None]:
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

## Logging extra metrics

Autolog capabilities in XGBoost will log metrics like validation loss, however, it won't log any specific metric in a classification problem. In this case, we are going to pay closer attention to our ability to detect heart condition while avoiding a type II error as much as possible. To calculate the metric, we are going to use our test dataset:

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, recall_score

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

In [None]:
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("Recall: %.2f%%" % (recall * 100.0))

## Exploring the expriment with MLFlow

Let's first end the experiment run so we can review it:

In [None]:
mlflow.end_run()

We can query the run again to see what's been logged:

In [None]:
run = mlflow.get_run(run.info.run_id)

Let's explore the parameters that got logged:

In [None]:
pd.DataFrame(data=[run.data.params], index=["Value"]).T

Let's explore the metrics' values:

In [None]:
pd.DataFrame(data=[run.data.metrics], index=["Value"]).T

> Pay attention how metrics calculated with `scikit-learn` were automatically tracked for us. None of them were manually added to the run. Also, MLflow uses naming conventions including the variable's names to help understand what was logged. `X_test` was added to the name of the metric meaning that it corresponds to the metric in the testing split of the dataset.

Let's explore artifacts that got logged in the run. This requires us to use the MLflow client:

In [None]:
client = mlflow.tracking.MlflowClient()

In [None]:
client.list_artifacts(run_id=run.info.run_id)

As you can see in this example, three artifacts are availble in the run:

* `feature_importance_weight.json` -> the feature importance of the model we created.
* `feature_importance_weight.png` -> a plot of the feature importance mentioned above, stored as an image.
* `metric_info.json` -> contains a json representation of all the metrics captured by the XGBoost.
* `model`, the path where the model is stored. Note that this artifact is a directory.

You can download any artifact using the method `download_artifact`

In [None]:
file_path = mlflow.artifacts.download_artifacts(
    run_id=run.info.run_id, artifact_path="feature_importance_weight.png"
)

Since the artifact is an image, we can display it in the following way:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as img

image = img.imread(file_path)
plt.imshow(image)
plt.show()