sdk/python/assets/data/versioning.ipynb (314 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset Versioning in Azure Machine Learning\n",
"\n",
"In this notebook, you will:\n",
"1. Compute a hash for a dataset.\n",
"2. Check if a dataset with the same hash already exists in Azure ML.\n",
"3. If not, upload the dataset to Azure Blob Storage and register it as an asset with a hash tag.\n",
"4. If it exists, retrieve the asset name, version, and tag.\n",
"\n",
"> **Note**: Ensure you update the configuration values before running the notebook. \n",
"> **Note**: If you encounter any issues, refer to the troubleshooting section at the end.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install azure-ai-ml\n",
"!pip install azure-identity\n",
"!pip install azure-storage-blob"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import MLClient\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes\n",
"import hashlib\n",
"import os\n",
"from azure.identity import DefaultAzureCredential\n",
"from azure.storage.blob import BlobServiceClient"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subscription_id = \"<SUBSCRIPTION_ID>\"\n",
"resource_group = \"<RESOURCE_GROUP>\"\n",
"workspace = \"<AML_WORKSPACE_NAME>\"\n",
"\n",
"asset_name = \"asset_name\" # Replace with the name you want to give your asset.\n",
"asset_description = \"asset_description\" # Provide a description of your asset.\n",
"asset_path = os.path.abspath(\n",
" \"./sample_data/\"\n",
") # Provide the absolute path to your local DATA FOLDERS.\n",
"asset_type = \"dataset\"\n",
"\n",
"enable_blob_upload = (\n",
" False # Set to True if you want to upload the asset to Azure Blob Storage.\n",
")\n",
"azure_storage_account_name = \"azure_storage_account_name\"\n",
"container_name = \"container_name\"\n",
"assets_folder = \"assets_folder\" # Provide a unique folder path to store the assets in Azure Blob Storage."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to Azure Machine Learning Workspace\n",
"\n",
"Connect to the Azure ML workspace using `MLClient`. Ensure the configuration values in the previous cell are correct before running this code.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize the Azure ML client\n",
"ml_client = MLClient(\n",
" DefaultAzureCredential(),\n",
" subscription_id,\n",
" resource_group,\n",
" workspace,\n",
")\n",
"\n",
"if enable_blob_upload:\n",
" # Initialize the Azure Blob client\n",
" blob_client = BlobServiceClient(\n",
" account_url=f\"https://{azure_storage_account_name}.blob.core.windows.net\",\n",
" credential=DefaultAzureCredential(),\n",
" ).get_container_client(container_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compute Hash for Dataset\n",
"\n",
"Compute a hash for the dataset to identify its uniqueness. This will help in checking if the dataset already exists in Azure ML.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compute the hash of the asset folder\n",
"hash_algo = hashlib.sha256()\n",
"for root, _, files in os.walk(asset_path):\n",
" for file in sorted(files): # Sort files for consistent hash\n",
" file_path = os.path.join(root, file)\n",
" with open(file_path, \"rb\") as f:\n",
" for chunk in iter(lambda: f.read(4096), b\"\"):\n",
" hash_algo.update(chunk)\n",
"asset_hash = hash_algo.hexdigest()\n",
"print(f\"Computed hash: {asset_hash}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check for Existing Asset in Azure ML\n",
"\n",
"Check if a dataset with the same hash already exists in the Azure ML workspace.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check if an asset with this hash already exists in Azure ML\n",
"asset_exists = False\n",
"existing_asset_info = None\n",
"for asset in ml_client.data.list():\n",
" for asset_version_info in ml_client.data.list(name=asset.name):\n",
" if asset_version_info.tags.get(\"hash\") == asset_hash:\n",
" asset_exists = True\n",
" existing_asset_info = {\n",
" \"asset_name\": asset_version_info.name,\n",
" \"asset_version\": asset_version_info.version,\n",
" }\n",
" break\n",
" if asset_exists:\n",
" break\n",
"\n",
"if asset_exists:\n",
" print(f\"Asset with hash {asset_hash} already exists in the workspace.\")\n",
" print(\n",
" f\"Asset name: {existing_asset_info['asset_name']}, version: {existing_asset_info['asset_version']}\"\n",
" )\n",
"else:\n",
" print(\n",
" \"No existing asset found with the same hash. Uploading and registering the asset.\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload to Azure Blob Storage (If Not Exists)\n",
"\n",
"If the dataset doesn't already exist in Azure ML, upload it to Azure Blob Storage.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not asset_exists:\n",
" # Determine the latest version number\n",
" try:\n",
" latest = ml_client.data._get_latest_version(asset_name)\n",
" latest_version = str(int(latest.version) + 1) if latest else \"1\"\n",
" except Exception as e:\n",
" print(f\"Error getting latest version: {e}, setting it to 1.\")\n",
" latest_version = \"1\"\n",
"\n",
" # Upload files to Azure Blob Storage\n",
" unique_folder_path = f\"{asset_name}_{latest_version}\"\n",
" print(f\"Uploading files from {asset_path} to {unique_folder_path}\")\n",
" if enable_blob_upload:\n",
" for root, _, files in os.walk(asset_path):\n",
" for file_name in files:\n",
" file_path = os.path.join(root, file_name)\n",
" blob_path = os.path.join(\n",
" unique_folder_path, os.path.relpath(file_path, asset_path)\n",
" )\n",
" blob_client_instance = blob_client.get_blob_client(blob_path)\n",
" with open(file_path, \"rb\") as data:\n",
" blob_client_instance.upload_blob(data, overwrite=True)\n",
" print(f\"Uploaded {file_path} to {blob_path}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register the Dataset in Azure ML\n",
"\n",
"After uploading to Blob Storage, register the dataset as an asset in Azure ML.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not asset_exists:\n",
" # Register the asset in Azure ML\n",
" blob_url = f\"https://{azure_storage_account_name}.blob.core.windows.net/{container_name}/{unique_folder_path}\"\n",
" data_asset = Data(\n",
" path=blob_url,\n",
" type=AssetTypes.URI_FOLDER,\n",
" description=asset_description,\n",
" name=asset_name,\n",
" tags={\"hash\": asset_hash}, # Tagging the asset with the computed hash\n",
" )\n",
"\n",
" registered_asset = ml_client.data.create_or_update(data_asset)\n",
" print(\n",
" f\"New {asset_type} registered in the workspace: {asset_name} with version {registered_asset.version}\"\n",
" )\n",
" existing_asset_info = {\n",
" f\"asset_name\": registered_asset.name,\n",
" f\"asset_version\": registered_asset.version,\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results\n",
"\n",
"If the asset was uploaded and registered, or if it already existed, the asset name and version are displayed below.\n",
"\n",
"**You can use these results to add a tag to the AML job, creating a link to the dataset used.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if existing_asset_info:\n",
" print(\n",
" f\"Asset name: {existing_asset_info['asset_name']}, version: {existing_asset_info['asset_version']} found\"\n",
" )\n",
"else:\n",
" print(\"No action taken.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Troubleshooting\n",
"\n",
"1. **Permission Issues:** \n",
" - Ensure that your Azure Active Directory (AAD) account has the appropriate permissions:\n",
" - **Azure Machine Learning:** 'Contributor' or 'Owner' role for the resource group containing the Azure ML workspace.\n",
" - **Storage Account:** 'Storage Blob Data Contributor' role for the specific storage account used in your operations.\n",
"\n",
"2. **Path Error:** \n",
" - The dataset path (`asset_path`) must be an absolute path. Modify this path based on your environment to point to the correct location of the data files.\n",
"\n",
"3. **Asset Registration Issues:** \n",
" - If you encounter an error stating that an asset already exists, ensure that you have a unique asset name or update the existing asset version if necessary.\n",
"\n",
"4. **Tagging:** \n",
" - Each dataset is tagged with its computed hash. This tag is used to verify the uniqueness of the dataset in Azure ML. \n",
" Make sure to include the hash tag when registering a new dataset.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}