sdk/python/featurestore_sample/notebooks/sdk_only/2.Experiment-train-models-using-features.ipynb (1,220 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# Tutorial #2: Experiment and train models using features"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"In this tutorial series you will experience how features seamlessly integrates all the phases of ML lifecycle: Prototyping features, training and operationalizing.\n",
"\n",
"In part 1 of the tutorial you learnt how to create a feature set spec with custom transformations, enable materialization and perform backfill. In this tutorial you will will learn how to experiment with features to improve model performance. You will see how feature store increasing agility in the experimentation and training flows. \n",
"\n",
"You will perform the following:\n",
"- Prototype a create new `acccounts` feature set spec using existing precomputed values as features, unlike part 1 of the tutorial where we created feature set that had custom transformations. You will then Register the local feature set spec as a feature set in the feature store\n",
"- Select features for the model: You will select features from the `transactions` and `accounts` feature sets and save them as a feature-retrieval spec\n",
"- Run training pipeline that uses the Feature retrieval spec to train a new model. This pipeline will use the built in feature-retrieval component to generate the training data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# Prerequisites\n",
"1. Please ensure you have executed part 1 of the tutorial"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# Setup"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Configure Azure ML spark notebook\n",
"\n",
"1. In the \"Compute\" dropdown in the top nav, select \"Serverless Spark Compute\". \n",
"1. Click on \"configure session\" in top status bar -> click on \"Python packages\" -> click on \"upload conda file\" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Start spark session"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550147448
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "start-spark-session",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.\n",
"print(\"start spark session\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Setup root directory for the samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550160587
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "root-dir",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"# please update the dir to ./Users/<your_user_alias> (or any custom directory you uploaded the samples to).\n",
"# You can find the name from the directory structure in the left nav\n",
"root_dir = \"./Users/<your_user_alias>/featurestore_sample\"\n",
"\n",
"if os.path.isdir(root_dir):\n",
" print(\"The folder exists.\")\n",
"else:\n",
" print(\"The folder does not exist. Please create or fix the path\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Initialize the project workspace CRUD client\n",
"This is the current workspace where you will be running the tutorial notebook from"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550177764
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "init-ws-crud-client",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"### Initialize the MLClient of this project workspace\n",
"import os\n",
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.identity import AzureMLOnBehalfOfCredential\n",
"\n",
"project_ws_sub_id = os.environ[\"AZUREML_ARM_SUBSCRIPTION\"]\n",
"project_ws_rg = os.environ[\"AZUREML_ARM_RESOURCEGROUP\"]\n",
"project_ws_name = os.environ[\"AZUREML_ARM_WORKSPACE_NAME\"]\n",
"\n",
"# connect to the project workspace\n",
"ws_client = MLClient(\n",
" AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Initialize the feature store CRUD client\n",
"Ensure you update the `featurestore_name` to reflect what you created in part 1 of this tutorial"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550201549
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "init-fs-crud-client",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.identity import AzureMLOnBehalfOfCredential\n",
"\n",
"# feature store\n",
"featurestore_name = (\n",
" \"<FEATURESTORE_NAME>\" # use the same name from part #1 of the tutorial\n",
")\n",
"featurestore_subscription_id = os.environ[\"AZUREML_ARM_SUBSCRIPTION\"]\n",
"featurestore_resource_group_name = os.environ[\"AZUREML_ARM_RESOURCEGROUP\"]\n",
"\n",
"# feature store ml client\n",
"fs_client = MLClient(\n",
" AzureMLOnBehalfOfCredential(),\n",
" featurestore_subscription_id,\n",
" featurestore_resource_group_name,\n",
" featurestore_name,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Initialize the feature store core sdk client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550214535
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "init-fs-core-sdk",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# feature store client\n",
"from azureml.featurestore import FeatureStoreClient\n",
"from azure.ai.ml.identity import AzureMLOnBehalfOfCredential\n",
"\n",
"featurestore = FeatureStoreClient(\n",
" credential=AzureMLOnBehalfOfCredential(),\n",
" subscription_id=featurestore_subscription_id,\n",
" resource_group_name=featurestore_resource_group_name,\n",
" name=featurestore_name,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Create compute cluster with name `cpu-cluster-fs` in the project workspace\n",
"This will be needed when we run training/batch inference jobs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550257007
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "create-compute-cluster",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azure.ai.ml.entities import AmlCompute\n",
"\n",
"cluster_basic = AmlCompute(\n",
" name=\"cpu-cluster-fs\",\n",
" type=\"amlcompute\",\n",
" size=\"STANDARD_F4S_V2\", # you can replace it with other supported VM SKUs\n",
" location=ws_client.workspaces.get(ws_client.workspace_name).location,\n",
" min_instances=0,\n",
" max_instances=1,\n",
" idle_time_before_scale_down=360,\n",
")\n",
"ws_client.begin_create_or_update(cluster_basic).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Step 1: Create accounts featureset locally from precomputed data\n",
"In tutorial part 1, we created a transactions featureset that had custom transformations. Now we will create an accounts featureset that will use precomputed values. \n",
"\n",
"For onboarding precomputed features, you can create a featureset spec without writing any transformation code. Featureset spec is a specification to develop and test a featureset in a fully local/dev environment without connecting to any featurestore. In this step you will create the feature set spec locally and sample the values from it. If you want to get managed featurestore capabilities, you need to register the featureset spec with a feature store using a feature asset definition (a future step in this tutorial)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 1a: Explore the source data for accounts\n",
"\n",
"##### Note\n",
" Note that the sample data used in this notebook is hosted in a public accessible blob container. It can only be read in Spark via `wasbs` driver. When you create feature sets using your own source data, please host them in adls gen2 account and use `abfss` driver in the data path. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550281756
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "explore-accts-fset-src-data",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"accounts_data_path = \"wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet\"\n",
"accounts_df = spark.read.parquet(accounts_data_path)\n",
"\n",
"display(accounts_df.head(5))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 1b: Create `accounts` feature set spec in local from these precomputed features\n",
"Note that we do not need any transformation code here since we are referencing precomputed features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550310950
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "create-accts-fset-spec",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azureml.featurestore import create_feature_set_spec, FeatureSetSpec\n",
"from azureml.featurestore.contracts import (\n",
" DateTimeOffset,\n",
" Column,\n",
" ColumnType,\n",
" SourceType,\n",
" TimestampColumn,\n",
")\n",
"from azureml.featurestore.feature_source import ParquetFeatureSource\n",
"\n",
"\n",
"accounts_featureset_spec = create_feature_set_spec(\n",
" source=ParquetFeatureSource(\n",
" path=\"wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet\",\n",
" timestamp_column=TimestampColumn(name=\"timestamp\"),\n",
" ),\n",
" index_columns=[Column(name=\"accountID\", type=ColumnType.string)],\n",
" # account profiles in the source are updated once a year. set temporal_join_lookback to 365 days\n",
" temporal_join_lookback=DateTimeOffset(days=365, hours=0, minutes=0),\n",
" infer_schema=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"active-ipynb"
]
},
"outputs": [],
"source": [
"# Generate a spark dataframe from the feature set specification\n",
"accounts_fset_df = accounts_featureset_spec.to_spark_dataframe()\n",
"# display few records\n",
"display(accounts_fset_df.head(5))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 1c: Export as feature set spec\n",
"In order to register the feature set spec with the feature store, it needs to be saved in a specific format. \n",
"Action: After running the below cell, please inspect the generated `accounts` FeatureSetSpec: Open this file from the file tree to see the spec: `featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml`\n",
"\n",
"Spec contains these important elements:\n",
"\n",
"1. `source`: reference to a storage. In this case a parquet file in a blob storage.\n",
"1. `features`: list of features and their datatypes. If you provide transformation code (see Day 2 section), the code has to return a dataframe that maps to the features and datatypes. In case where you do not provide transformation code (in this case of `accounts` because it is precomputed), the system will build the query to to map these to the source \n",
"1. `index_columns`: the join keys required to access values from the feature set\n",
"\n",
"Learn more about it in the [top level feature store entities document](fs-concepts-todo) and the [feature set spec yaml reference](reference-yaml-featureset-spec.md).\n",
"\n",
"The additional benefit of persisting it is that it can be source controlled."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550317763
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "dump-accts-fset-spec",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"# create a new folder to dump the feature set spec\n",
"accounts_featureset_spec_folder = root_dir + \"/featurestore/featuresets/accounts/spec\"\n",
"\n",
"# check if the folder exists, create one if not\n",
"if not os.path.exists(accounts_featureset_spec_folder):\n",
" os.makedirs(accounts_featureset_spec_folder)\n",
"\n",
"accounts_featureset_spec.dump(accounts_featureset_spec_folder, overwrite=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Step 2: Experiment with unregistered features locally and register with feature store when ready\n",
"When you are developing features, you might want to test/validate locally before registering with the feature store or running training pipelines in the cloud. In this step you will generate training data for the ML model from combination of features from a local unregistered feature set (`accounts`) and feature set registered in the feature store (`transactions`)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 2a: Select features for model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550328371
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "select-unreg-features-for-model",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# get the registered transactions feature set, version 1\n",
"transactions_featureset = featurestore.feature_sets.get(\"transactions\", \"1\")\n",
"# Notice that account feature set spec is in your local dev environment (this notebook): not registered with feature store yet\n",
"features = [\n",
" accounts_featureset_spec.get_feature(\"accountAge\"),\n",
" accounts_featureset_spec.get_feature(\"numPaymentRejects1dPerUser\"),\n",
" transactions_featureset.get_feature(\"transaction_amount_7d_sum\"),\n",
" transactions_featureset.get_feature(\"transaction_amount_3d_sum\"),\n",
" transactions_featureset.get_feature(\"transaction_amount_7d_avg\"),\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 2b: Generate training data locally\n",
"In this step we generate training data for illustrative purpose. You can optionally train models locally with this. In the upcoming steps in this tutorial, you will train a model in the cloud."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550539714
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "load-obs-data",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azureml.featurestore import get_offline_features\n",
"\n",
"# Load the observation data. To understand observatio ndata, refer to part 1 of this tutorial\n",
"observation_data_path = \"wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet\"\n",
"observation_data_df = spark.read.parquet(observation_data_path)\n",
"obs_data_timestamp_column = \"timestamp\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"name": "gen-training-data-locally",
"tags": [
"active-ipynb"
]
},
"outputs": [],
"source": [
"# generate training dataframe by using feature data and observation data\n",
"training_df = get_offline_features(\n",
" features=features,\n",
" observation_data=observation_data_df,\n",
" timestamp_column=obs_data_timestamp_column,\n",
")\n",
"\n",
"# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.\n",
"display(training_df)\n",
"# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call training_df.show() to see correctly formatted value"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 2c: Register the `accounts` featureset with the featurestore\n",
"Once you have experimented with different feature definitions locally and sanity tested it, you can register it with the feature store.\n",
"For this you will register a featureset asset definition with the feature store.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550813095
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "reg-accts-fset",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification\n",
"\n",
"accounts_fset_config = FeatureSet(\n",
" name=\"accounts\",\n",
" version=\"1\",\n",
" description=\"accounts featureset\",\n",
" entities=[f\"azureml:account:1\"],\n",
" stage=\"Development\",\n",
" specification=FeatureSetSpecification(path=accounts_featureset_spec_folder),\n",
" tags={\"data_type\": \"nonPII\"},\n",
")\n",
"\n",
"poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)\n",
"print(poller.result())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 2d: Get registered featureset and sanity test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550826621
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "sample-accts-fset-data",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# look up the featureset by providing name and version\n",
"accounts_featureset = featurestore.feature_sets.get(\"accounts\", \"1\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"active-ipynb"
]
},
"outputs": [],
"source": [
"# get access to the feature data\n",
"accounts_feature_df = accounts_featureset.to_spark_dataframe()\n",
"display(accounts_feature_df.head(5))\n",
"# Note: Please ignore this warning: Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Step 3: Run training experiment\n",
"In this step you will select a list of features, run a training pipeline, and register the model. You can repeat this step till you are happy with the model performance."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### (Optional) Step 3a: Discover features from Feature Store UI\n",
"You have already done this in part 1 of the tutorial after registering the `transactions` feature set. Since you also have `accounts` featureset, you can browse the available features:\n",
"* Goto the [Azure ML global landing page](https://ml.azure.com/home?flight=FeatureStores).\n",
"* Click on `Feature stores` in the left nav\n",
"* You will see the list of feature stores that you have access to. Click on the feature store that you created above.\n",
"\n",
"You can see the feature sets and entity that you created. Click on the feature sets to browse the feature definitions. You can also search for feature sets across feature stores by using the global search box."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### (Optional) Step 3b: Discover features from SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550851764
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "discover-features-from-sdk",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# List available feature sets\n",
"all_featuresets = featurestore.feature_sets.list()\n",
"for fs in all_featuresets:\n",
" print(fs)\n",
"\n",
"# List of versions for transactions feature set\n",
"all_transactions_featureset_versions = featurestore.feature_sets.list(\n",
" name=\"transactions\"\n",
")\n",
"for fs in all_transactions_featureset_versions:\n",
" print(fs)\n",
"\n",
"# See properties of the transactions featureset including list of features\n",
"featurestore.feature_sets.get(name=\"transactions\", version=\"1\").features"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 3c: Select features for the model and export it as a feature-retrieval spec\n",
"In the previous steps, you selected features from a combination unregistered and registered feature sets for local experimentation and testing. Now you are ready to experiment in the cloud. Saving the selected features as a feature-retrieval spec and using it in the mlops/cicd flow for training/inference increases your agility in shipping models."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"Select features for the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550863111
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "select-reg-features",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# you can select features in pythonic way\n",
"features = [\n",
" accounts_featureset.get_feature(\"accountAge\"),\n",
" transactions_featureset.get_feature(\"transaction_amount_7d_sum\"),\n",
" transactions_featureset.get_feature(\"transaction_amount_3d_sum\"),\n",
"]\n",
"\n",
"# you can also specify features in string form: featurestore:featureset:version:feature\n",
"more_features = [\n",
" f\"accounts:1:numPaymentRejects1dPerUser\",\n",
" f\"transactions:1:transaction_amount_7d_avg\",\n",
"]\n",
"\n",
"more_features = featurestore.resolve_feature_uri(more_features)\n",
"\n",
"features.extend(more_features)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"Export selected features as a feature-retrieval spec\n",
"\n",
"#### Note\n",
"Feature retrieval spec is a portable definition of list of features associated with a model. This can help streamline ML model development and operationalizing.This will be an input to the training pipeline (used to generate the training data), then will be packaged along with the model, and will be used during inference to lookup the features. It will be a glue that integrates all phases of the ML lifecycle. Changes to training/inference pipeline can be kept minimal as you experiment and deploy. \n",
"\n",
"Using feature retrieval spec and the built-in feature retrieval component is optional: you can directly use `get_offline_features()` api as shown above.\n",
"\n",
"Note that the name of the spec should be `feature_retrieval_spec.yaml` when it is packaged with the model for the system to recognize it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1696550891771
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "export-as-frspec",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# Create feature retrieval spec\n",
"feature_retrieval_spec_folder = root_dir + \"/project/fraud_model/feature_retrieval_spec\"\n",
"\n",
"# check if the folder exists, create one if not\n",
"if not os.path.exists(feature_retrieval_spec_folder):\n",
" os.makedirs(feature_retrieval_spec_folder)\n",
"\n",
"featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Step 4: Train in the cloud using pipelines and register model if satisfactory\n",
"In this step you will manually trigger the training pipeline. In a production scenario, this could be triggered by a ci/cd pipeline based on changes to the feature-retrieval spec in the source repository."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 4a: Run the training pipeline\n",
"The training pipeline has the following steps:\n",
"\n",
"1. Feature retrieval step: This is a built-in component takes as input the feature retrieval spec, the observation data and timestamp column name. It then generates the training data as output. It runs this as a managed spark job.\n",
"1. Training step: This step trains the model based on the training data and generates a model (not registered yet)\n",
"1. Evaluation step: This step validates whether model performance/quailty is within threshold (in our case it is a placeholder/dummy step for illustration purpose)\n",
"1. Register model step: This step registers the model\n",
"\n",
"Note: In part 2 of this tutorial you ran a backfill job to materialize data for `transactions` feature set. Feature retrieval step will read feature values from offline store for this feature set. The behavior will same even if you use `get_offline_features()` api."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1683429020656
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "run-training-pipeline",
"nteract": {
"transient": {
"deleting": false
}
},
"tags": [
"active-ipynb"
]
},
"outputs": [],
"source": [
"from azure.ai.ml import load_job # will be used later\n",
"\n",
"training_pipeline_path = (\n",
" root_dir + \"/project/fraud_model/pipelines/training_pipeline.yaml\"\n",
")\n",
"training_pipeline_definition = load_job(source=training_pipeline_path)\n",
"training_pipeline_job = ws_client.jobs.create_or_update(training_pipeline_definition)\n",
"ws_client.jobs.stream(training_pipeline_job.name)\n",
"# Note: First time it runs, each step in pipeline can take ~ 15 mins. However subsequent runs can be faster (assuming spark pool is warm - default timeout is 30 mins)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Inspect the training pipeline and the model\n",
"Open the above pipeline run \"web view\" in new window to see the steps in the pipeline.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 4b: Notice the feature retrieval spec in the model artifacts\n",
"1. In the left nav of the current workspace -> right click on `Models` -> Open in new tab or window\n",
"1. Click on `fraud_model`\n",
"1. Click on `Artifacts` in the top nav\n",
"\n",
"You can notice that the feature retrieval spec is packaged along with the model. The model registration step in the training pipeline has done this. You created feature retrieval spec during experimentation, now it has become part of the model definition. In the next tutorial you will see how this will be used during inferencing.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Step 5: View the feature set and model dependencies"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 5a: View the list of feature sets associated with the model\n",
"In the same models page, click on the `feature sets` tab. Here you can see both `transactions` and `accounts` featuresets that this model depends on."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Step 5b: View the list of models using the feature sets\n",
"1. Open the feature store UI (expalined in a previous step in this tutorial)\n",
"1. Click on `Feature sets` on the left nav\n",
"1. Click on any of the feature set -> click on `Models` tab\n",
"\n",
"You can see the list of models that are using the feature sets (determined from the feature retrieval spec when the model was registered)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Cleanup\n",
"\n",
"Tutorial \"5. Develop a feature set with custom source\" has instructions for deleting the resources."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## Next steps\n",
"* Enable recurrent materialization and run batch inference"
]
}
],
"metadata": {
"celltoolbar": "Edit Metadata",
"kernel_info": {
"name": "synapse_pyspark"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
},
"microsoft": {
"host": {
"AzureML": {
"notebookHasBeenCompleted": true
}
},
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}