tutorials/get-started-notebooks/pipeline.ipynb (949 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Create production machine learning pipelines\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The core of a machine learning pipeline is to split a complete machine learning task into a multistep workflow. Each step is a manageable component that can be developed, optimized, configured, and automated individually. Steps are connected through well-defined interfaces. The Azure Machine Learning pipeline service automatically orchestrates all the dependencies between pipeline steps. The benefits of using a pipeline are standardized the MLOps practice, scalable team collaboration, training efficiency and cost reduction. To learn more about the benefits of pipelines, see [What are Azure Machine Learning pipelines](https://learn.microsoft.comazure/machine-learning/concept-ml-pipelines).\n",
"\n",
"In this tutorial, you use Azure Machine Learning to create a production ready machine learning project, using Azure Machine Learning Python SDK v2.\n",
"\n",
"This means you will be able to leverage the AzureML Python SDK to:\n",
"\n",
"- Get a handle to your Azure Machine Learning workspace\n",
"- Create Azure Machine Learning data assets\n",
"- Create reusable Azure Machine Learning components\n",
"- Create, validate and run Azure Machine Learning pipelines\n",
"\n",
"During this tutorial, you create an Azure Machine Learning pipeline to train a model for credit default prediction. The pipeline handles two steps: \n",
"\n",
"1. Data preparation\n",
"1. Training and registering the trained model\n",
"\n",
"The next image shows a simple pipeline as you'll see it in the Azure studio once submitted.\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"* If you opened this notebook from Azure Machine Learning studio, you need a compute instance to run the code. If you don't have a compute instance, select **Create compute** on the toolbar to first create one. You can use all the default settings. \n",
"\n",
" \n",
"\n",
"* If your Azure Machine Learning workspace is configured with a managed virtual network, you may need to add outbound rules to allow access to the public Python package repositories. For more information, see [Scenario: Access public machine learning packages](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-managed-network#scenario-access-public-machine-learning-packages).\n",
"\n",
"* If you're seeing this notebook elsewhere, complete [Create resources you need to get started](https://docs.microsoft.com/azure/machine-learning/quickstart-create-resources) to create an Azure Machine Learning workspace and a compute instance.\n",
"\n",
"* Complete the tutorial [Tutorial: Upload, access and explore your data](explore-data.ipynb) to create the data asset you need in this tutorial. Make sure you run all the code to create the initial data asset. You can optionally explore the data and revise it, but you'll only need the initial data to complete this tutorial.\n",
"\n",
"## Set your kernel\n",
"\n",
"* If your compute instance is stopped, start it now. \n",
" \n",
" \n",
"\n",
"* Once your compute instance is running, make sure the that the kernel, found on the top right, is `Python 3.10 - SDK v2`. If not, use the dropdown to select this kernel.\n",
"\n",
" \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up the pipeline resources\n",
"\n",
"The Azure Machine Learning framework can be used from CLI, Python SDK, or studio interface. In this example, you use the Azure Machine Learning Python SDK v2 to create a pipeline. \n",
"\n",
"Before creating the pipeline, you need the following resources:\n",
"\n",
"* The data asset for training\n",
"* The software environment to run the pipeline\n",
"* A compute resource to where the job runs\n",
"\n",
"## Create handle to workspace\n",
"\n",
"Before we dive in the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace. You'll then use `ml_client` to manage resources and jobs.\n",
"\n",
"In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:\n",
"\n",
"1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.\n",
"1. Copy the value for workspace, resource group and subscription ID into the code.\n",
"1. You'll need to copy one value, close the area and paste, then come back for the next one.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "ml_client"
},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"# authenticate\n",
"credential = DefaultAzureCredential()\n",
"\n",
"SUBSCRIPTION = \"<SUBSCRIPTION_ID>\"\n",
"RESOURCE_GROUP = \"<RESOURCE_GROUP>\"\n",
"WS_NAME = \"<AML_WORKSPACE_NAME>\"\n",
"# Get a handle to the workspace\n",
"ml_client = MLClient(\n",
" credential=credential,\n",
" subscription_id=SUBSCRIPTION,\n",
" resource_group_name=RESOURCE_GROUP,\n",
" workspace_name=WS_NAME,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> [!NOTE]\n",
"> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen in the next code cell).\n",
"\n",
"Verify the connection by making a call to `ml_client`. Since this is the first time that you're making a call to the workspace, you may be asked to authenticate. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify that the handle works correctly.\n",
"# If you ge an error here, modify your SUBSCRIPTION, RESOURCE_GROUP, and WS_NAME in the previous cell.\n",
"ws = ml_client.workspaces.get(WS_NAME)\n",
"print(ws.location, \":\", ws.resource_group)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Access the registered data asset\n",
"\n",
"Start by getting the data that you previously registered in [Tutorial: Upload, access and explore your data](explore-data.ipynb).\n",
"\n",
"* Azure Machine Learning uses a `Data` object to register a reusable definition of data, and consume data within a pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "update-credit_data"
},
"outputs": [],
"source": [
"# get a handle of the data asset and print the URI\n",
"credit_data = ml_client.data.get(name=\"credit-card\", version=\"initial\")\n",
"print(f\"Data asset URI: {credit_data.path}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a job environment for pipeline steps\n",
"\n",
"So far, you've created a development environment on the compute instance, your development machine. You also need an environment to use for each step of the pipeline. Each step can have its own environment, or you can use some common environments for multiple steps.\n",
"\n",
"In this example, you create a conda environment for your jobs, using a conda yaml file.\n",
"First, create a directory to store the file in."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "dependencies_dir"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"dependencies_dir = \"./dependencies\"\n",
"os.makedirs(dependencies_dir, exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, create the file in the dependencies directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "conda.yaml"
},
"outputs": [],
"source": [
"%%writefile {dependencies_dir}/conda.yaml\n",
"name: model-env\n",
"channels:\n",
" - conda-forge\n",
"dependencies:\n",
" - python=3.8\n",
" - numpy=1.21.2\n",
" - pip=21.2.4\n",
" - scikit-learn=0.24.2\n",
" - scipy=1.7.1\n",
" - pandas>=1.1,<1.2\n",
" - pip:\n",
" - inference-schema[numpy-support]==1.3.0\n",
" - xlrd==2.0.1\n",
" - mlflow== 2.4.1\n",
" - azureml-mlflow==1.51.0"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The specification contains some usual packages, that you use in your pipeline (numpy, pip), together with some Azure Machine Learning specific packages (azureml-mlflow).\n",
"\n",
"The Azure Machine Learning packages aren't mandatory to run Azure Machine Learning jobs. However, adding these packages let you interact with Azure Machine Learning for logging metrics and registering models, all inside the Azure Machine Learning job. You use them in the training script later in this tutorial.\n",
"\n",
"Use the *yaml* file to create and register this custom environment in your workspace:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "custom_env_name"
},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Environment\n",
"\n",
"custom_env_name = \"aml-scikit-learn\"\n",
"\n",
"pipeline_job_env = Environment(\n",
" name=custom_env_name,\n",
" description=\"Custom environment for Credit Card Defaults pipeline\",\n",
" tags={\"scikit-learn\": \"0.24.2\"},\n",
" conda_file=os.path.join(dependencies_dir, \"conda.yaml\"),\n",
" image=\"mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest\",\n",
" version=\"0.2.0\",\n",
")\n",
"pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)\n",
"\n",
"print(\n",
" f\"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}\"\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build the training pipeline\n",
"\n",
"Now that you have all assets required to run your pipeline, it's time to build the pipeline itself.\n",
"\n",
"Azure Machine Learning pipelines are reusable ML workflows that usually consist of several components. The typical life of a component is:\n",
"\n",
"- Write the yaml specification of the component, or create it programmatically using `ComponentMethod`.\n",
"- Optionally, register the component with a name and version in your workspace, to make it reusable and shareable.\n",
"- Load that component from the pipeline code.\n",
"- Implement the pipeline using the component's inputs, outputs and parameters.\n",
"- Submit the pipeline.\n",
"\n",
"There are two ways to create a component, programmatic and yaml definition. The next two sections walk you through creating a component both ways. You can either create the two components trying both options or pick your preferred method.\n",
"\n",
"> [!NOTE]\n",
"> In this tutorial for simplicity we are using the same compute for all components. However, you can set different computes for each component, for example by adding a line like `train_step.compute = \"cpu-cluster\"`. To view an example of building a pipeline with different computes for each component, see the [Basic pipeline job section in the cifar-10 pipeline tutorial](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/pipelines/2b_train_cifar_10_with_pytorch/train_cifar_10_with_pytorch.ipynb).\n",
"\n",
"### Create component 1: data prep (using programmatic definition)\n",
"\n",
"Let's start by creating the first component. This component handles the preprocessing of the data. The preprocessing task is performed in the *data_prep.py* Python file.\n",
"\n",
"First create a source folder for the data_prep component:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "data_prep_src_dir"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"data_prep_src_dir = \"./components/data_prep\"\n",
"os.makedirs(data_prep_src_dir, exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This script performs the simple task of splitting the data into train and test datasets. Azure Machine Learning mounts datasets as folders to the computes, therefore, we created an auxiliary `select_first_file` function to access the data file inside the mounted input folder. \n",
"\n",
"[MLFlow](https://learn.microsoft.com/articles/machine-learning/concept-mlflow) is used to log the parameters and metrics during our pipeline run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "def-main"
},
"outputs": [],
"source": [
"%%writefile {data_prep_src_dir}/data_prep.py\n",
"import os\n",
"import argparse\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"import logging\n",
"import mlflow\n",
"\n",
"\n",
"def main():\n",
" \"\"\"Main function of the script.\"\"\"\n",
"\n",
" # input and output arguments\n",
" parser = argparse.ArgumentParser()\n",
" parser.add_argument(\"--data\", type=str, help=\"path to input data\")\n",
" parser.add_argument(\"--test_train_ratio\", type=float, required=False, default=0.25)\n",
" parser.add_argument(\"--train_data\", type=str, help=\"path to train data\")\n",
" parser.add_argument(\"--test_data\", type=str, help=\"path to test data\")\n",
" args = parser.parse_args()\n",
"\n",
" # Start Logging\n",
" mlflow.start_run()\n",
"\n",
" print(\" \".join(f\"{k}={v}\" for k, v in vars(args).items()))\n",
"\n",
" print(\"input data:\", args.data)\n",
"\n",
" credit_df = pd.read_csv(args.data, header=1, index_col=0)\n",
"\n",
" mlflow.log_metric(\"num_samples\", credit_df.shape[0])\n",
" mlflow.log_metric(\"num_features\", credit_df.shape[1] - 1)\n",
"\n",
" credit_train_df, credit_test_df = train_test_split(\n",
" credit_df,\n",
" test_size=args.test_train_ratio,\n",
" )\n",
"\n",
" # output paths are mounted as folder, therefore, we are adding a filename to the path\n",
" credit_train_df.to_csv(os.path.join(args.train_data, \"data.csv\"), index=False)\n",
"\n",
" credit_test_df.to_csv(os.path.join(args.test_data, \"data.csv\"), index=False)\n",
"\n",
" # Stop Logging\n",
" mlflow.end_run()\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have a script that can perform the desired task, create an Azure Machine Learning Component from it.\n",
"\n",
"Use the general purpose `CommandComponent` that can run command line actions. This command line action can directly call system commands or run a script. The inputs/outputs are specified on the command line via the `${{ ... }}` notation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "data_prep_component"
},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml import Input, Output\n",
"\n",
"data_prep_component = command(\n",
" name=\"data_prep_credit_defaults\",\n",
" display_name=\"Data preparation for training\",\n",
" description=\"reads a .xl input, split the input to train and test\",\n",
" inputs={\n",
" \"data\": Input(type=\"uri_folder\"),\n",
" \"test_train_ratio\": Input(type=\"number\"),\n",
" },\n",
" outputs=dict(\n",
" train_data=Output(type=\"uri_folder\", mode=\"rw_mount\"),\n",
" test_data=Output(type=\"uri_folder\", mode=\"rw_mount\"),\n",
" ),\n",
" # The source folder of the component\n",
" code=data_prep_src_dir,\n",
" command=\"\"\"python data_prep.py \\\n",
" --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} \\\n",
" --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}} \\\n",
" \"\"\",\n",
" environment=f\"{pipeline_job_env.name}:{pipeline_job_env.version}\",\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"Optionally, register the component in the workspace for future reuse.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Now we register the component to the workspace\n",
"data_prep_component = ml_client.create_or_update(data_prep_component.component)\n",
"\n",
"# Create (register) the component in your workspace\n",
"print(\n",
" f\"Component {data_prep_component.name} with Version {data_prep_component.version} is registered\"\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create component 2: training (using yaml definition)\n",
"\n",
"The second component that you create consumes the training and test data, train a tree based model and return the output model. Use Azure Machine Learning logging capabilities to record and visualize the learning progress.\n",
"\n",
"You used the `CommandComponent` class to create your first component. This time you use the yaml definition to define the second component. Each method has its own advantages. A yaml definition can actually be checked-in along the code, and would provide a readable history tracking. The programmatic method using `CommandComponent` can be easier with built-in class documentation and code completion.\n",
"\n",
"Create the directory for this component:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "train_src_dir"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"train_src_dir = \"./components/train\"\n",
"os.makedirs(train_src_dir, exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the training script in the directory:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "train.py"
},
"outputs": [],
"source": [
"%%writefile {train_src_dir}/train.py\n",
"import argparse\n",
"from sklearn.ensemble import GradientBoostingClassifier\n",
"from sklearn.metrics import classification_report\n",
"import os\n",
"import pandas as pd\n",
"import mlflow\n",
"\n",
"\n",
"def select_first_file(path):\n",
" \"\"\"Selects first file in folder, use under assumption there is only one file in folder\n",
" Args:\n",
" path (str): path to directory or file to choose\n",
" Returns:\n",
" str: full path of selected file\n",
" \"\"\"\n",
" files = os.listdir(path)\n",
" return os.path.join(path, files[0])\n",
"\n",
"\n",
"# Start Logging\n",
"mlflow.start_run()\n",
"\n",
"# enable autologging\n",
"mlflow.sklearn.autolog()\n",
"\n",
"os.makedirs(\"./outputs\", exist_ok=True)\n",
"\n",
"\n",
"def main():\n",
" \"\"\"Main function of the script.\"\"\"\n",
"\n",
" # input and output arguments\n",
" parser = argparse.ArgumentParser()\n",
" parser.add_argument(\"--train_data\", type=str, help=\"path to train data\")\n",
" parser.add_argument(\"--test_data\", type=str, help=\"path to test data\")\n",
" parser.add_argument(\"--n_estimators\", required=False, default=100, type=int)\n",
" parser.add_argument(\"--learning_rate\", required=False, default=0.1, type=float)\n",
" parser.add_argument(\"--registered_model_name\", type=str, help=\"model name\")\n",
" parser.add_argument(\"--model\", type=str, help=\"path to model file\")\n",
" args = parser.parse_args()\n",
"\n",
" # paths are mounted as folder, therefore, we are selecting the file from folder\n",
" train_df = pd.read_csv(select_first_file(args.train_data))\n",
"\n",
" # Extracting the label column\n",
" y_train = train_df.pop(\"default payment next month\")\n",
"\n",
" # convert the dataframe values to array\n",
" X_train = train_df.values\n",
"\n",
" # paths are mounted as folder, therefore, we are selecting the file from folder\n",
" test_df = pd.read_csv(select_first_file(args.test_data))\n",
"\n",
" # Extracting the label column\n",
" y_test = test_df.pop(\"default payment next month\")\n",
"\n",
" # convert the dataframe values to array\n",
" X_test = test_df.values\n",
"\n",
" print(f\"Training with data of shape {X_train.shape}\")\n",
"\n",
" clf = GradientBoostingClassifier(\n",
" n_estimators=args.n_estimators, learning_rate=args.learning_rate\n",
" )\n",
" clf.fit(X_train, y_train)\n",
"\n",
" y_pred = clf.predict(X_test)\n",
"\n",
" print(classification_report(y_test, y_pred))\n",
"\n",
" # Registering the model to the workspace\n",
" print(\"Registering the model via MLFlow\")\n",
" mlflow.sklearn.log_model(\n",
" sk_model=clf,\n",
" registered_model_name=args.registered_model_name,\n",
" artifact_path=args.registered_model_name,\n",
" )\n",
"\n",
" # Saving the model to a file\n",
" mlflow.sklearn.save_model(\n",
" sk_model=clf,\n",
" path=os.path.join(args.model, \"trained_model\"),\n",
" )\n",
"\n",
" # Stop Logging\n",
" mlflow.end_run()\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see in this training script, once the model is trained, the model file is saved and registered to the workspace. Now you can use the registered model in inferencing endpoints.\n",
"\n",
"For the environment of this step, you use one of the built-in (curated) Azure Machine Learning environments. The tag `azureml`, tells the system to use look for the name in curated environments.\n",
"First, create the *yaml* file describing the component:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "train.yml"
},
"outputs": [],
"source": [
"%%writefile {train_src_dir}/train.yml\n",
"# <component>\n",
"name: train_credit_defaults_model\n",
"display_name: Train Credit Defaults Model\n",
"# version: 1 # Not specifying a version will automatically update the version\n",
"type: command\n",
"inputs:\n",
" train_data: \n",
" type: uri_folder\n",
" test_data: \n",
" type: uri_folder\n",
" learning_rate:\n",
" type: number \n",
" registered_model_name:\n",
" type: string\n",
"outputs:\n",
" model:\n",
" type: uri_folder\n",
"code: .\n",
"environment:\n",
" # for this step, we'll use an AzureML curate environment\n",
" azureml://registries/azureml/environments/sklearn-1.5/labels/latest\n",
"command: >-\n",
" python train.py \n",
" --train_data ${{inputs.train_data}} \n",
" --test_data ${{inputs.test_data}} \n",
" --learning_rate ${{inputs.learning_rate}}\n",
" --registered_model_name ${{inputs.registered_model_name}} \n",
" --model ${{outputs.model}}\n",
"# </component>\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create and register the component. Registering it allows you to re-use it in other pipelines. Also, anyone else with access to your workspace can use the registered component."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "train_component"
},
"outputs": [],
"source": [
"# importing the Component Package\n",
"from azure.ai.ml import load_component\n",
"\n",
"# Loading the component from the yml file\n",
"train_component = load_component(source=os.path.join(train_src_dir, \"train.yml\"))\n",
"\n",
"# Now we register the component to the workspace\n",
"train_component = ml_client.create_or_update(train_component)\n",
"\n",
"# Create (register) the component in your workspace\n",
"print(\n",
" f\"Component {train_component.name} with Version {train_component.version} is registered\"\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the pipeline from components\n",
"\n",
"Now that both your components are defined and registered, you can start implementing the pipeline.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, you use *input data*, *split ratio* and *registered model name* as input variables. Then call the components and connect them via their inputs/outputs identifiers. The outputs of each step can be accessed via the `.outputs` property.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
}
},
"source": [
"The Python functions returned by `load_component()` work as any regular Python function that we use within a pipeline to call each step.\n",
"\n",
"To code the pipeline, you use a specific `@dsl.pipeline` decorator that identifies the Azure Machine Learning pipelines. In the decorator, we can specify the pipeline description and default resources like compute and storage. Like a Python function, pipelines can have inputs. You can then create multiple instances of a single pipeline with different inputs.\n",
"\n",
"Here, we used *input data*, *split ratio* and *registered model name* as input variables. We then call the components and connect them via their inputs/outputs identifiers. The outputs of each step can be accessed via the `.outputs` property."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "pipeline"
},
"outputs": [],
"source": [
"# the dsl decorator tells the sdk that we are defining an Azure Machine Learning pipeline\n",
"from azure.ai.ml import dsl, Input, Output\n",
"\n",
"\n",
"@dsl.pipeline(\n",
" compute=\"serverless\", # \"serverless\" value runs pipeline on serverless compute\n",
" description=\"E2E data_perp-train pipeline\",\n",
")\n",
"def credit_defaults_pipeline(\n",
" pipeline_job_data_input,\n",
" pipeline_job_test_train_ratio,\n",
" pipeline_job_learning_rate,\n",
" pipeline_job_registered_model_name,\n",
"):\n",
" # using data_prep_function like a python call with its own inputs\n",
" data_prep_job = data_prep_component(\n",
" data=pipeline_job_data_input,\n",
" test_train_ratio=pipeline_job_test_train_ratio,\n",
" )\n",
"\n",
" # using train_func like a python call with its own inputs\n",
" train_job = train_component(\n",
" train_data=data_prep_job.outputs.train_data, # note: using outputs from previous step\n",
" test_data=data_prep_job.outputs.test_data, # note: using outputs from previous step\n",
" learning_rate=pipeline_job_learning_rate, # note: using a pipeline input as parameter\n",
" registered_model_name=pipeline_job_registered_model_name,\n",
" )\n",
"\n",
" # a pipeline returns a dictionary of outputs\n",
" # keys will code for the pipeline output identifier\n",
" return {\n",
" \"pipeline_job_train_data\": data_prep_job.outputs.train_data,\n",
" \"pipeline_job_test_data\": data_prep_job.outputs.test_data,\n",
" }"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now use your pipeline definition to instantiate a pipeline with your dataset, split rate of choice and the name you picked for your model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"Python"
],
"id": ""
},
"name": "registered_model_name"
},
"outputs": [],
"source": [
"registered_model_name = \"credit_defaults_model\"\n",
"\n",
"# Let's instantiate the pipeline with the parameters of our choice\n",
"pipeline = credit_defaults_pipeline(\n",
" pipeline_job_data_input=Input(type=\"uri_file\", path=credit_data.path),\n",
" pipeline_job_test_train_ratio=0.25,\n",
" pipeline_job_learning_rate=0.05,\n",
" pipeline_job_registered_model_name=registered_model_name,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit the job \n",
"\n",
"It's now time to submit the job to run in Azure Machine Learning. This time you use `create_or_update` on `ml_client.jobs`.\n",
"\n",
"Here you also pass an experiment name. An experiment is a container for all the iterations one does on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in Azure Machine Learning studio.\n",
"\n",
"Once completed, the pipeline registers a model in your workspace as a result of training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "returned_job"
},
"outputs": [],
"source": [
"# submit the pipeline job\n",
"pipeline_job = ml_client.jobs.create_or_update(\n",
" pipeline,\n",
" # Project's name\n",
" experiment_name=\"e2e_registered_components\",\n",
")\n",
"ml_client.jobs.stream(pipeline_job.name)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can track the progress of your pipeline, by using the link generated in the previous cell. When you first select this link, you may see that the pipeline is still running. Once it's complete, you can examine each component's results.\n",
"\n",
"Double-click the **Train Credit Defaults Model** component. \n",
"\n",
"There are two important results you'll want to see about training:\n",
"\n",
"* View your logs:\n",
" 1. Select the **Outputs+logs** tab.\n",
" 1. Open the folders to `user_logs` > `std_log.txt`\n",
" This section shows the script run stdout.\n",
" \n",
"\n",
"* View your metrics: Select the **Metrics** tab. This section shows different logged metrics. In this example. mlflow `autologging`, has automatically logged the training metrics.\n",
" \n",
" "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy the model as an online endpoint\n",
"To learn how to deploy your model to an online endpoint, see [Deploy a model as an online endpoint tutorial](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-deploy-model).\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps\n",
"\n",
"Learn how to [Schedule machine learning pipeline jobs](https://learn.microsoft.com/azure/machine-learning/how-to-schedule-pipeline-job)"
]
}
],
"metadata": {
"description": {
"description": "Create production ML pipelines with Python SDK v2 in a Jupyter notebook"
},
"kernel_info": {
"name": "python310-sdkv2"
},
"kernelspec": {
"display_name": "Python 3.10 - SDK v2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}