sdk/python/assets/data/data.ipynb (1,000 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with Data\n",
"\n",
"In this notebook you will learn how to use the AzureML SDK to:\n",
"\n",
"1. Read/write data in a job.\n",
"1. Create a data asset to share with others in your team.\n",
"1. Abstract schema for tabular data using `MLTable`.\n",
"\n",
"## Connect to Azure Machine Learning Workspace\n",
"\n",
"To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../jobs/configuration.ipynb) for more details on how to configure credentials and connect to a workspace.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"# enter details of your AML workspace\n",
"subscription_id = \"<SUBSCRIPTION_ID>\"\n",
"resource_group = \"<RESOURCE_GROUP>\"\n",
"workspace = \"<AML_WORKSPACE_NAME>\"\n",
"\n",
"# get a handle to the workspace\n",
"ml_client = MLClient(\n",
" DefaultAzureCredential(), subscription_id, resource_group, workspace\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading/writing data in a job\n",
"\n",
"In this example we will use the titanic dataset in this repo - ([./sample_data/titanic.csv](./sample_data/titanic.csv)) and set-up a command that executes the following python code:\n",
"\n",
"```python\n",
"df = pd.read_csv(args.input_data)\n",
"print(df.head(10))\n",
"```\n",
"\n",
"Below is the code for submitting the command to the cloud - note that both the code *and* the data is automatically uploaded to the cloud. Note: The data is only re-uploaded on subsequent job submissions if data has changed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# === Note on path ===\n",
"# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.\n",
"# Local paths are automatically uploaded to the default datastore in the cloud.\n",
"# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths\n",
"\n",
"inputs = {\n",
" \"input_data\": Input(type=AssetTypes.URI_FILE, path=\"./sample_data/titanic.csv\")\n",
"}\n",
"\n",
"job = command(\n",
" code=\"./src\", # local path where the code is stored\n",
" command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n",
" inputs=inputs,\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
" compute=\"cpu-cluster\",\n",
")\n",
"\n",
"# submit the command\n",
"returned_job = ml_client.jobs.create_or_update(job)\n",
"# get a URL for the status of the job\n",
"returned_job.studio_url"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading *and* writing data in a job\n",
"\n",
"By design, you cannot *write* to `Inputs` only `Outputs`. The code below creates an `Output` that will mount your AzureML default datastore (Azure Blob) in Read-*Write* mode. The python code simply takes the CSV as import and exports it as a parquet file, i.e.\n",
"\n",
"```python\n",
"df = pd.read_csv(args.input_data)\n",
"output_path = os.path.join(args.output_folder, \"my_output.parquet\")\n",
"df.to_parquet(output_path)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml import Input, Output\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"inputs = {\n",
" \"input_data\": Input(type=AssetTypes.URI_FILE, path=\"./sample_data/titanic.csv\")\n",
"}\n",
"\n",
"outputs = {\n",
" \"output_folder\": Output(\n",
" type=AssetTypes.URI_FOLDER,\n",
" path=f\"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/\",\n",
" )\n",
"}\n",
"\n",
"job = command(\n",
" code=\"./src\", # local path where the code is stored\n",
" command=\"python read_write_data.py --input_data ${{inputs.input_data}} --output_folder ${{outputs.output_folder}}\",\n",
" inputs=inputs,\n",
" outputs=outputs,\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
" compute=\"cpu-cluster\",\n",
")\n",
"\n",
"# submit the command\n",
"returned_job = ml_client.create_or_update(job)\n",
"# get a URL for the status of the job\n",
"returned_job.studio_url"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Data Assets\n",
"\n",
"You can create a data asset in Azure Machine Learning, which has the following benefits:\n",
"\n",
"- Easy to share with other members of the team (no need to remember file locations)\n",
"- Versioning of the metadata (location, description, etc)\n",
"- Lineage tracking\n",
"\n",
"Below we show an example of versioning the sample data in this repo. The data is uploaded to cloud storage and registered as an asset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"try:\n",
" registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n",
" print(\"Found data asset. Will not create again\")\n",
"except Exception as ex:\n",
" my_data = Data(\n",
" path=\"./sample_data/titanic.csv\",\n",
" type=AssetTypes.URI_FILE,\n",
" description=\"Titanic Data\",\n",
" name=\"titanic\",\n",
" version=\"1\",\n",
" )\n",
" ml_client.data.create_or_update(my_data)\n",
" registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n",
" print(\"Created data asset\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Authenticate with user identity\n",
"\n",
"When running a job on a compute cluster, you can also use your user identity to access data. To enable job to access data on behald of you, specify **identity=UserIdentity()** in job definition, as shown below. For more details, see [Accessing storage services](https://learn.microsoft.com/azure/machine-learning/how-to-identity-based-service-authentication)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import UserIdentityConfiguration\n",
"\n",
"my_job_inputs = {\"input_data\": Input(type=AssetTypes.MLTABLE, path=\"./sample_data\")}\n",
"job = command(\n",
" code=\"./src\",\n",
" command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n",
" inputs=my_job_inputs,\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
" compute=\"cpu-cluster\",\n",
" identity=UserIdentityConfiguration(),\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## MLTable\n",
"\n",
"`MLTable` is a way to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe. [A more detailed explanation and motivation is provided on docs.microsoft.com.](https://docs.microsoft.com/azure/machine-learning/concept-data#mltable).\n",
"\n",
"The ideal scenarios to use `MLTable` are:\n",
"\n",
"- The schema of your data is complex and/or changes frequently.\n",
"- You only need a subset of data (for example: a sample of rows or files, specific columns, etc).\n",
"- AutoML jobs requiring tabular data.\n",
"\n",
"If your scenario does not fit the above then it is likely that URIs are a more suitable type.\n",
"\n",
"### The `MLTable` file\n",
"\n",
"The `MLTable` file defines the schema for tabular data. Below is a sample:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!cat ./sample-mltable/MLTable"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We recommend that you co-locate your `MLTable` file with the underlying data (i.e. the `MLTable` file should be in the same (or parent) directory. You can can load an `MLTable` artefact using the `mltable` library - below below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import mltable\n",
"\n",
"# Note: the uri below can be a local folder or folder located in cloud storage. The folder must contain a valid MLTable file.\n",
"tbl = mltable.load(uri=\"./sample-mltable\")\n",
"tbl.to_pandas_dataframe()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Data from external sources\n",
"\n",
"You can create a data asset in Azure Machine Learning by importing data from external sources like Snowflake or Amazon S3 and caching it, which has the following benefits:\n",
"\n",
"- All the benefits of caching including faster and more reliable access of data for the training jobs\n",
"- Less points of failure than connecting to data in external sources directly\n",
"- Easy to share with other members of the team (no need to remember file locations)\n",
"- Versioning of the metadata (location, description, etc) even for data from SQL/ relational database \n",
"- Lineage tracking\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly, you need to create a workspace connection with details on how to connect to the external source. \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a connection to Snowflake DB when you want to import data from Snowflake"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.entities import WorkspaceConnection\n",
"from azure.ai.ml.entities import UsernamePasswordConfiguration\n",
"\n",
"name = \"my_snowflakedb_connection\"\n",
"\n",
"target = \"jdbc:snowflake://<myaccount>.snowflakecomputing.com/?db=<mydb>&warehouse=<mywarehouse>&role=<myrole>\"\n",
"# add the Snowflake account, database, warehouse name and role name here. If no role name provided it will default to PUBLIC\n",
"\n",
"wps_connection = WorkspaceConnection(\n",
" name=name,\n",
" type=\"snowflake\",\n",
" target=target,\n",
" credentials=UsernamePasswordConfiguration(username=\"XXXXX\", password=\"XXXXXX\"),\n",
")\n",
"ml_client.connections.create_or_update(workspace_connection=wps_connection)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a connection to Azure SQL DB when you want to import data from Azure SQL"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.entities import WorkspaceConnection\n",
"from azure.ai.ml.entities import UsernamePasswordConfiguration\n",
"\n",
"name = \"my_sqldb_connection\"\n",
"target = \"Server=tcp:<myservername>,<port>;Database=<mydatabase>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30\"\n",
"# add the sql servername, port address and database\n",
"\n",
"wps_connection = WorkspaceConnection(\n",
" name=name,\n",
" type=\"azure_sql_db\",\n",
" target=target,\n",
" credentials=UsernamePasswordConfiguration(username=\"XXXXX\", password=\"XXXXXX\"),\n",
")\n",
"ml_client.connections.create_or_update(workspace_connection=wps_connection)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a connection to Amazon S3 when you want to import data from S3 bucket"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.entities import WorkspaceConnection\n",
"from azure.ai.ml.entities import AccessKeyConfiguration\n",
"\n",
"name = \"my_s3_connection\"\n",
"target = \"<mybucket>\" # add the s3 bucket details\n",
"wps_connection = WorkspaceConnection(\n",
" name=name,\n",
" type=\"s3\",\n",
" target=target,\n",
" credentials=AccessKeyConfiguration(\n",
" access_key_id=\"XXXXXX\", secret_access_key=\"XXXXXXXX\"\n",
" ),\n",
")\n",
"ml_client.connections.create_or_update(workspace_connection=wps_connection)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the connection is created, you need to initiate an Import job"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of importing data from Snowflake"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import DataImport\n",
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml import MLClient\n",
"\n",
"# Supported connections include:\n",
"# Connection: azureml:<workspace_connection_name>\n",
"# Supported paths include:\n",
"# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n",
"\n",
"data_import = DataImport(\n",
" name=\"snowflake_sample\",\n",
" source=Database(\n",
" connection=\"azureml:my_snowflakedb_connection\",\n",
" query=\"select * from my_sample_table\",\n",
" ),\n",
" path=\"azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}\",\n",
")\n",
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of importing data from Azure SQL"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import DataImport\n",
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml import MLClient\n",
"\n",
"# Supported connections include:\n",
"# Connection: azureml:<workspace_connection_name>\n",
"# Supported paths include:\n",
"# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n",
"\n",
"data_import = DataImport(\n",
" name=\"azuresql_sample\",\n",
" source=Database(\n",
" connection=\"azureml:my_sqldb_connection\", query=\"select * from my_table\"\n",
" ),\n",
" path=\"azureml://datastores/workspaceblobstore/paths/azuresql/${{name}}\",\n",
")\n",
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of importing data from Amazon S3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import DataImport\n",
"from azure.ai.ml.data_transfer import FileSystem\n",
"from azure.ai.ml import MLClient\n",
"\n",
"# Supported connections include:\n",
"# Connection: azureml:<workspace_connection_name>\n",
"# Supported paths include:\n",
"# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n",
"\n",
"data_import = DataImport(\n",
" name=\"s3_sample\",\n",
" source=FileSystem(\n",
" connection=\"azureml:my_s3_connection\", path=\"myfiles/titanic.csv\"\n",
" ),\n",
" path=\"azureml://datastores/workspaceblobstore/paths/s3/${{name}}\",\n",
")\n",
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Importing data on a Schedule. \n",
"You can import data on a schedule created on a recurrence trigger or a cron trigger"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of importing data from Snowflake on Recurrence trigger"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml.constants import TimeZone\n",
"from azure.ai.ml.entities import (\n",
" ImportDataSchedule,\n",
" RecurrenceTrigger,\n",
" RecurrencePattern,\n",
")\n",
"from datetime import datetime, timedelta\n",
"\n",
"source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n",
"\n",
"path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n",
"\n",
"\n",
"my_data = DataImport(\n",
" type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n",
")\n",
"\n",
"schedule_name = \"my_simple_sdk_create_schedule_recurrence\"\n",
"\n",
"schedule_start_time = datetime.utcnow()\n",
"\n",
"recurrence_trigger = RecurrenceTrigger(\n",
" frequency=\"day\",\n",
" interval=1,\n",
" schedule=RecurrencePattern(hours=1, minutes=[0, 1]),\n",
" start_time=schedule_start_time,\n",
" time_zone=TimeZone.UTC,\n",
")\n",
"\n",
"\n",
"import_schedule = ImportDataSchedule(\n",
" name=schedule_name, trigger=recurrence_trigger, import_data=my_data\n",
")\n",
"\n",
"ml_client.schedules.begin_create_or_update(import_schedule).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a similar example of creating an import data on a schedule - this time it is a Cron Trigger"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import CronTrigger, ImportDataSchedule\n",
"\n",
"source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n",
"\n",
"path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n",
"\n",
"\n",
"my_data = DataImport(\n",
" type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n",
")\n",
"\n",
"schedule_name = \"my_simple_sdk_create_schedule_cron\"\n",
"start_time = datetime.utcnow()\n",
"end_time = start_time + timedelta(days=7) # Set end_time to 7 days later\n",
"\n",
"cron_trigger = CronTrigger(\n",
" expression=\"15 10 * * 1\",\n",
" start_time=start_time,\n",
" end_time=end_time,\n",
")\n",
"import_schedule = ImportDataSchedule(\n",
" name=schedule_name, trigger=cron_trigger, import_data=my_data\n",
")\n",
"ml_client.schedules.begin_create_or_update(import_schedule).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"NOTE: The import schedule is a schedule, so all the other CRUD operations of Schedule are available on this as well."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Disable the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.schedules.begin_disable(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the detail of the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"created_schedule = ml_client.schedules.get(name=schedule_name)\n",
"[created_schedule.name]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"List schedules in a workspace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schedules = ml_client.schedules.list()\n",
"[s.name for s in schedules]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Enable a schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.schedules.begin_enable(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update a schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Update trigger expression\n",
"import_schedule.trigger.expression = \"10 10 * * 1\"\n",
"import_schedule = ml_client.schedules.begin_create_or_update(\n",
" schedule=import_schedule\n",
").result()\n",
"print(import_schedule)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Delete the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Only disabled schedules can be deleted\n",
"ml_client.schedules.begin_disable(name=schedule_name).result()\n",
"ml_client.schedules.begin_delete(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Management on Workspace managed datastore:\n",
"\n",
"Data import can be performed to an AzureML managed HOBO storage called \"workspacemanageddatastore\" by specifying \n",
"path: azureml://datastores/workspacemanageddatastore\n",
"The datastore will be automatically back-filled if not present.\n",
"\n",
"When done so, it comes with the added benefit of data life cycle management.\n",
"\n",
"The following shows a simple data import on to the workspacemanageddatastore. Same can be done using the schedules defined above - "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import DataImport\n",
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml import MLClient\n",
"\n",
"# Supported connections include:\n",
"# Connection: azureml:<workspace_connection_name>\n",
"# Supported paths include:\n",
"# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n",
"\n",
"data_import = DataImport(\n",
" name=\"my_sf_managedasset\",\n",
" source=Database(\n",
" connection=\"azureml:my_snowflakedb_connection\",\n",
" query=\"select * from my_sample_table\",\n",
" ),\n",
" path=\"azureml://datastores/workspacemanageddatastore\",\n",
")\n",
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Following are the examples of doing data lifecycle management aka altering the auto_delete_settings"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Get imported data asset details:\n",
"```python\n",
"\n",
"\n",
"# Get data asset details\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# update auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = AutoDeleteSetting(\n",
" condition=\"created_greater_than\", value=\"45d\"\n",
")\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Update auto delete setting:\", my_data)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# update auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = AutoDeleteSetting(\n",
" condition=\"last_accessed_greater_than\", value=\"30d\"\n",
")\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Update auto delete setting:\", my_data)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Delete auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# remove auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = None\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Remove auto delete setting:\", my_data)\n",
"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: Whilst the above example shows a local file. Remember that `path` supports cloud storage (`https`, `abfss`, `wasbs` protocols). Therefore, if you want to register data in a cloud location just specify the path with any of the supported protocols.\n",
"\n",
"### Consume data assets in a job\n",
"\n",
"Below shows how to consume a data asset in the job:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import command, Input, Output\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n",
"\n",
"my_job_inputs = {\n",
" \"input_data\": Input(type=AssetTypes.URI_FILE, path=registered_data_asset.id)\n",
"}\n",
"\n",
"job = command(\n",
" code=\"./src\",\n",
" command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n",
" inputs=my_job_inputs,\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
" compute=\"cpu-cluster\",\n",
")\n",
"\n",
"# submit the command\n",
"returned_job = ml_client.create_or_update(job)\n",
"# get a URL for the status of the job\n",
"returned_job.studio_url"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read an MLTable in a job\n",
"\n",
"#### Create an environment\n",
"\n",
"Firstly, you need to create an environment that contains the mltable Python Library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Environment\n",
"\n",
"env_docker_conda = Environment(\n",
" image=\"mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04\",\n",
" conda_file=\"env-mltable.yml\",\n",
" name=\"mltable\",\n",
" description=\"Environment created for consuming MLTable.\",\n",
")\n",
"\n",
"ml_client.environments.create_or_update(env_docker_conda)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"inputs = {\"input_data\": Input(type=AssetTypes.MLTABLE, path=\"./sample-mltable\")}\n",
"\n",
"job = command(\n",
" code=\"./src\", # local path where the code is stored\n",
" command=\"python read_mltable.py --input_data ${{inputs.input_data}}\",\n",
" inputs=inputs,\n",
" environment=env_docker_conda,\n",
" compute=\"cpu-cluster\",\n",
")\n",
"\n",
"# submit the command\n",
"returned_job = ml_client.jobs.create_or_update(job)\n",
"# get a URL for the status of the job\n",
"returned_job.studio_url"
]
}
],
"metadata": {
"description": {
"description": "Read, write and register a data asset"
},
"kernelspec": {
"display_name": "Python 3.10 - SDK V2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"name": "python",
"version": ""
},
"vscode": {
"interpreter": {
"hash": "0ece9b0d22cc202945275ade981e664a4c51236f9f70f1a68cccb779b759da7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}