sdk/python/assets/data/data.ipynb

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Data\n", "\n", "In this notebook you will learn how to use the AzureML SDK to:\n", "\n", "1. Read/write data in a job.\n", "1. Create a data asset to share with others in your team.\n", "1. Abstract schema for tabular data using `MLTable`.\n", "\n", "## Connect to Azure Machine Learning Workspace\n", "\n", "To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../jobs/configuration.ipynb) for more details on how to configure credentials and connect to a workspace.\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import MLClient\n", "from azure.identity import DefaultAzureCredential\n", "\n", "# enter details of your AML workspace\n", "subscription_id = \"<SUBSCRIPTION_ID>\"\n", "resource_group = \"<RESOURCE_GROUP>\"\n", "workspace = \"<AML_WORKSPACE_NAME>\"\n", "\n", "# get a handle to the workspace\n", "ml_client = MLClient(\n", " DefaultAzureCredential(), subscription_id, resource_group, workspace\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Reading/writing data in a job\n", "\n", "In this example we will use the titanic dataset in this repo - ([./sample_data/titanic.csv](./sample_data/titanic.csv)) and set-up a command that executes the following python code:\n", "\n", "```python\n", "df = pd.read_csv(args.input_data)\n", "print(df.head(10))\n", "```\n", "\n", "Below is the code for submitting the command to the cloud - note that both the code *and* the data is automatically uploaded to the cloud. Note: The data is only re-uploaded on subsequent job submissions if data has changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml import Input\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "# === Note on path ===\n", "# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.\n", "# Local paths are automatically uploaded to the default datastore in the cloud.\n", "# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths\n", "\n", "inputs = {\n", " \"input_data\": Input(type=AssetTypes.URI_FILE, path=\"./sample_data/titanic.csv\")\n", "}\n", "\n", "job = command(\n", " code=\"./src\", # local path where the code is stored\n", " command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n", " inputs=inputs,\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.jobs.create_or_update(job)\n", "# get a URL for the status of the job\n", "returned_job.studio_url" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Reading *and* writing data in a job\n", "\n", "By design, you cannot *write* to `Inputs` only `Outputs`. The code below creates an `Output` that will mount your AzureML default datastore (Azure Blob) in Read-*Write* mode. The python code simply takes the CSV as import and exports it as a parquet file, i.e.\n", "\n", "```python\n", "df = pd.read_csv(args.input_data)\n", "output_path = os.path.join(args.output_folder, \"my_output.parquet\")\n", "df.to_parquet(output_path)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml import Input, Output\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "inputs = {\n", " \"input_data\": Input(type=AssetTypes.URI_FILE, path=\"./sample_data/titanic.csv\")\n", "}\n", "\n", "outputs = {\n", " \"output_folder\": Output(\n", " type=AssetTypes.URI_FOLDER,\n", " path=f\"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/\",\n", " )\n", "}\n", "\n", "job = command(\n", " code=\"./src\", # local path where the code is stored\n", " command=\"python read_write_data.py --input_data ${{inputs.input_data}} --output_folder ${{outputs.output_folder}}\",\n", " inputs=inputs,\n", " outputs=outputs,\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.create_or_update(job)\n", "# get a URL for the status of the job\n", "returned_job.studio_url" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Create Data Assets\n", "\n", "You can create a data asset in Azure Machine Learning, which has the following benefits:\n", "\n", "- Easy to share with other members of the team (no need to remember file locations)\n", "- Versioning of the metadata (location, description, etc)\n", "- Lineage tracking\n", "\n", "Below we show an example of versioning the sample data in this repo. The data is uploaded to cloud storage and registered as an asset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "try:\n", " registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n", " print(\"Found data asset. Will not create again\")\n", "except Exception as ex:\n", " my_data = Data(\n", " path=\"./sample_data/titanic.csv\",\n", " type=AssetTypes.URI_FILE,\n", " description=\"Titanic Data\",\n", " name=\"titanic\",\n", " version=\"1\",\n", " )\n", " ml_client.data.create_or_update(my_data)\n", " registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n", " print(\"Created data asset\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Authenticate with user identity\n", "\n", "When running a job on a compute cluster, you can also use your user identity to access data. To enable job to access data on behald of you, specify **identity=UserIdentity()** in job definition, as shown below. For more details, see [Accessing storage services](https://learn.microsoft.com/azure/machine-learning/how-to-identity-based-service-authentication)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import UserIdentityConfiguration\n", "\n", "my_job_inputs = {\"input_data\": Input(type=AssetTypes.MLTABLE, path=\"./sample_data\")}\n", "job = command(\n", " code=\"./src\",\n", " command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n", " inputs=my_job_inputs,\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster\",\n", " identity=UserIdentityConfiguration(),\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## MLTable\n", "\n", "`MLTable` is a way to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe. [A more detailed explanation and motivation is provided on docs.microsoft.com.](https://docs.microsoft.com/azure/machine-learning/concept-data#mltable).\n", "\n", "The ideal scenarios to use `MLTable` are:\n", "\n", "- The schema of your data is complex and/or changes frequently.\n", "- You only need a subset of data (for example: a sample of rows or files, specific columns, etc).\n", "- AutoML jobs requiring tabular data.\n", "\n", "If your scenario does not fit the above then it is likely that URIs are a more suitable type.\n", "\n", "### The `MLTable` file\n", "\n", "The `MLTable` file defines the schema for tabular data. Below is a sample:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./sample-mltable/MLTable" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We recommend that you co-locate your `MLTable` file with the underlying data (i.e. the `MLTable` file should be in the same (or parent) directory. You can can load an `MLTable` artefact using the `mltable` library - below below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mltable\n", "\n", "# Note: the uri below can be a local folder or folder located in cloud storage. The folder must contain a valid MLTable file.\n", "tbl = mltable.load(uri=\"./sample-mltable\")\n", "tbl.to_pandas_dataframe()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Import Data from external sources\n", "\n", "You can create a data asset in Azure Machine Learning by importing data from external sources like Snowflake or Amazon S3 and caching it, which has the following benefits:\n", "\n", "- All the benefits of caching including faster and more reliable access of data for the training jobs\n", "- Less points of failure than connecting to data in external sources directly\n", "- Easy to share with other members of the team (no need to remember file locations)\n", "- Versioning of the metadata (location, description, etc) even for data from SQL/ relational database \n", "- Lineage tracking\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, you need to create a workspace connection with details on how to connect to the external source. \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Creating a connection to Snowflake DB when you want to import data from Snowflake" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import MLClient\n", "from azure.ai.ml.entities import WorkspaceConnection\n", "from azure.ai.ml.entities import UsernamePasswordConfiguration\n", "\n", "name = \"my_snowflakedb_connection\"\n", "\n", "target = \"jdbc:snowflake://<myaccount>.snowflakecomputing.com/?db=<mydb>&warehouse=<mywarehouse>&role=<myrole>\"\n", "# add the Snowflake account, database, warehouse name and role name here. If no role name provided it will default to PUBLIC\n", "\n", "wps_connection = WorkspaceConnection(\n", " name=name,\n", " type=\"snowflake\",\n", " target=target,\n", " credentials=UsernamePasswordConfiguration(username=\"XXXXX\", password=\"XXXXXX\"),\n", ")\n", "ml_client.connections.create_or_update(workspace_connection=wps_connection)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Creating a connection to Azure SQL DB when you want to import data from Azure SQL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import MLClient\n", "from azure.ai.ml.entities import WorkspaceConnection\n", "from azure.ai.ml.entities import UsernamePasswordConfiguration\n", "\n", "name = \"my_sqldb_connection\"\n", "target = \"Server=tcp:<myservername>,<port>;Database=<mydatabase>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30\"\n", "# add the sql servername, port address and database\n", "\n", "wps_connection = WorkspaceConnection(\n", " name=name,\n", " type=\"azure_sql_db\",\n", " target=target,\n", " credentials=UsernamePasswordConfiguration(username=\"XXXXX\", password=\"XXXXXX\"),\n", ")\n", "ml_client.connections.create_or_update(workspace_connection=wps_connection)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Creating a connection to Amazon S3 when you want to import data from S3 bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import MLClient\n", "from azure.ai.ml.entities import WorkspaceConnection\n", "from azure.ai.ml.entities import AccessKeyConfiguration\n", "\n", "name = \"my_s3_connection\"\n", "target = \"<mybucket>\" # add the s3 bucket details\n", "wps_connection = WorkspaceConnection(\n", " name=name,\n", " type=\"s3\",\n", " target=target,\n", " credentials=AccessKeyConfiguration(\n", " access_key_id=\"XXXXXX\", secret_access_key=\"XXXXXXXX\"\n", " ),\n", ")\n", "ml_client.connections.create_or_update(workspace_connection=wps_connection)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Once the connection is created, you need to initiate an Import job" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of importing data from Snowflake" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import DataImport\n", "from azure.ai.ml.data_transfer import Database\n", "from azure.ai.ml import MLClient\n", "\n", "# Supported connections include:\n", "# Connection: azureml:<workspace_connection_name>\n", "# Supported paths include:\n", "# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n", "\n", "data_import = DataImport(\n", " name=\"snowflake_sample\",\n", " source=Database(\n", " connection=\"azureml:my_snowflakedb_connection\",\n", " query=\"select * from my_sample_table\",\n", " ),\n", " path=\"azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}\",\n", ")\n", "ml_client.data.import_data(data_import=data_import)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of importing data from Azure SQL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import DataImport\n", "from azure.ai.ml.data_transfer import Database\n", "from azure.ai.ml import MLClient\n", "\n", "# Supported connections include:\n", "# Connection: azureml:<workspace_connection_name>\n", "# Supported paths include:\n", "# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n", "\n", "data_import = DataImport(\n", " name=\"azuresql_sample\",\n", " source=Database(\n", " connection=\"azureml:my_sqldb_connection\", query=\"select * from my_table\"\n", " ),\n", " path=\"azureml://datastores/workspaceblobstore/paths/azuresql/${{name}}\",\n", ")\n", "ml_client.data.import_data(data_import=data_import)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of importing data from Amazon S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import DataImport\n", "from azure.ai.ml.data_transfer import FileSystem\n", "from azure.ai.ml import MLClient\n", "\n", "# Supported connections include:\n", "# Connection: azureml:<workspace_connection_name>\n", "# Supported paths include:\n", "# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n", "\n", "data_import = DataImport(\n", " name=\"s3_sample\",\n", " source=FileSystem(\n", " connection=\"azureml:my_s3_connection\", path=\"myfiles/titanic.csv\"\n", " ),\n", " path=\"azureml://datastores/workspaceblobstore/paths/s3/${{name}}\",\n", ")\n", "ml_client.data.import_data(data_import=data_import)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Importing data on a Schedule. \n", "You can import data on a schedule created on a recurrence trigger or a cron trigger" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of importing data from Snowflake on Recurrence trigger" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.data_transfer import Database\n", "from azure.ai.ml.constants import TimeZone\n", "from azure.ai.ml.entities import (\n", " ImportDataSchedule,\n", " RecurrenceTrigger,\n", " RecurrencePattern,\n", ")\n", "from datetime import datetime, timedelta\n", "\n", "source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n", "\n", "path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n", "\n", "\n", "my_data = DataImport(\n", " type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n", ")\n", "\n", "schedule_name = \"my_simple_sdk_create_schedule_recurrence\"\n", "\n", "schedule_start_time = datetime.utcnow()\n", "\n", "recurrence_trigger = RecurrenceTrigger(\n", " frequency=\"day\",\n", " interval=1,\n", " schedule=RecurrencePattern(hours=1, minutes=[0, 1]),\n", " start_time=schedule_start_time,\n", " time_zone=TimeZone.UTC,\n", ")\n", "\n", "\n", "import_schedule = ImportDataSchedule(\n", " name=schedule_name, trigger=recurrence_trigger, import_data=my_data\n", ")\n", "\n", "ml_client.schedules.begin_create_or_update(import_schedule).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here is a similar example of creating an import data on a schedule - this time it is a Cron Trigger" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import CronTrigger, ImportDataSchedule\n", "\n", "source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n", "\n", "path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n", "\n", "\n", "my_data = DataImport(\n", " type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n", ")\n", "\n", "schedule_name = \"my_simple_sdk_create_schedule_cron\"\n", "start_time = datetime.utcnow()\n", "end_time = start_time + timedelta(days=7) # Set end_time to 7 days later\n", "\n", "cron_trigger = CronTrigger(\n", " expression=\"15 10 * * 1\",\n", " start_time=start_time,\n", " end_time=end_time,\n", ")\n", "import_schedule = ImportDataSchedule(\n", " name=schedule_name, trigger=cron_trigger, import_data=my_data\n", ")\n", "ml_client.schedules.begin_create_or_update(import_schedule).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "NOTE: The import schedule is a schedule, so all the other CRUD operations of Schedule are available on this as well." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Disable the schedule" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.schedules.begin_disable(name=schedule_name).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Check the detail of the schedule" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "created_schedule = ml_client.schedules.get(name=schedule_name)\n", "[created_schedule.name]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "List schedules in a workspace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schedules = ml_client.schedules.list()\n", "[s.name for s in schedules]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Enable a schedule" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.schedules.begin_enable(name=schedule_name).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Update a schedule" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Update trigger expression\n", "import_schedule.trigger.expression = \"10 10 * * 1\"\n", "import_schedule = ml_client.schedules.begin_create_or_update(\n", " schedule=import_schedule\n", ").result()\n", "print(import_schedule)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Delete the schedule" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Only disabled schedules can be deleted\n", "ml_client.schedules.begin_disable(name=schedule_name).result()\n", "ml_client.schedules.begin_delete(name=schedule_name).result()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Data Management on Workspace managed datastore:\n", "\n", "Data import can be performed to an AzureML managed HOBO storage called \"workspacemanageddatastore\" by specifying \n", "path: azureml://datastores/workspacemanageddatastore\n", "The datastore will be automatically back-filled if not present.\n", "\n", "When done so, it comes with the added benefit of data life cycle management.\n", "\n", "The following shows a simple data import on to the workspacemanageddatastore. Same can be done using the schedules defined above - " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import DataImport\n", "from azure.ai.ml.data_transfer import Database\n", "from azure.ai.ml import MLClient\n", "\n", "# Supported connections include:\n", "# Connection: azureml:<workspace_connection_name>\n", "# Supported paths include:\n", "# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n", "\n", "data_import = DataImport(\n", " name=\"my_sf_managedasset\",\n", " source=Database(\n", " connection=\"azureml:my_snowflakedb_connection\",\n", " query=\"select * from my_sample_table\",\n", " ),\n", " path=\"azureml://datastores/workspacemanageddatastore\",\n", ")\n", "ml_client.data.import_data(data_import=data_import)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Following are the examples of doing data lifecycle management aka altering the auto_delete_settings" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Get imported data asset details:\n", "```python\n", "\n", "\n", "# Get data asset details\n", "name = \"my_sf_managedasset\"\n", "version = \"1\"\n", "my_data = ml_client.data.get(name=name, version=version)\n", "\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Update auto delete settings - " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "# update auto delete setting\n", "name = \"my_sf_managedasset\"\n", "version = \"1\"\n", "my_data = ml_client.data.get(name=name, version=version)\n", "my_data.auto_delete_setting = AutoDeleteSetting(\n", " condition=\"created_greater_than\", value=\"45d\"\n", ")\n", "my_data = ml_client.data.create_or_update(my_data)\n", "print(\"Update auto delete setting:\", my_data)\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Update auto delete settings - " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "# update auto delete setting\n", "name = \"my_sf_managedasset\"\n", "version = \"1\"\n", "my_data = ml_client.data.get(name=name, version=version)\n", "my_data.auto_delete_setting = AutoDeleteSetting(\n", " condition=\"last_accessed_greater_than\", value=\"30d\"\n", ")\n", "my_data = ml_client.data.create_or_update(my_data)\n", "print(\"Update auto delete setting:\", my_data)\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Delete auto delete settings - " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "# remove auto delete setting\n", "name = \"my_sf_managedasset\"\n", "version = \"1\"\n", "my_data = ml_client.data.get(name=name, version=version)\n", "my_data.auto_delete_setting = None\n", "my_data = ml_client.data.create_or_update(my_data)\n", "print(\"Remove auto delete setting:\", my_data)\n", "\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "> Note: Whilst the above example shows a local file. Remember that `path` supports cloud storage (`https`, `abfss`, `wasbs` protocols). Therefore, if you want to register data in a cloud location just specify the path with any of the supported protocols.\n", "\n", "### Consume data assets in a job\n", "\n", "Below shows how to consume a data asset in the job:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command, Input, Output\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "registered_data_asset = ml_client.data.get(name=\"titanic\", version=\"1\")\n", "\n", "my_job_inputs = {\n", " \"input_data\": Input(type=AssetTypes.URI_FILE, path=registered_data_asset.id)\n", "}\n", "\n", "job = command(\n", " code=\"./src\",\n", " command=\"python read_data.py --input_data ${{inputs.input_data}}\",\n", " inputs=my_job_inputs,\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.create_or_update(job)\n", "# get a URL for the status of the job\n", "returned_job.studio_url" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Read an MLTable in a job\n", "\n", "#### Create an environment\n", "\n", "Firstly, you need to create an environment that contains the mltable Python Library:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import Environment\n", "\n", "env_docker_conda = Environment(\n", " image=\"mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04\",\n", " conda_file=\"env-mltable.yml\",\n", " name=\"mltable\",\n", " description=\"Environment created for consuming MLTable.\",\n", ")\n", "\n", "ml_client.environments.create_or_update(env_docker_conda)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml import Input\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "inputs = {\"input_data\": Input(type=AssetTypes.MLTABLE, path=\"./sample-mltable\")}\n", "\n", "job = command(\n", " code=\"./src\", # local path where the code is stored\n", " command=\"python read_mltable.py --input_data ${{inputs.input_data}}\",\n", " inputs=inputs,\n", " environment=env_docker_conda,\n", " compute=\"cpu-cluster\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.jobs.create_or_update(job)\n", "# get a URL for the status of the job\n", "returned_job.studio_url" ] } ], "metadata": { "description": { "description": "Read, write and register a data asset" }, "kernelspec": { "display_name": "Python 3.10 - SDK V2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "name": "python", "version": "" }, "vscode": { "interpreter": { "hash": "0ece9b0d22cc202945275ade981e664a4c51236f9f70f1a68cccb779b759da7e" } } }, "nbformat": 4, "nbformat_minor": 4 }

sdk/python/assets/data/data.ipynb (1,000 lines of code) (raw):