# Dataset Versioning in Azure Machine Learning

In this notebook, you will:
1. Compute a hash for a dataset.
2. Check if a dataset with the same hash already exists in Azure ML.
3. If not, upload the dataset to Azure Blob Storage and register it as an asset with a hash tag.
4. If it exists, retrieve the asset name, version, and tag.

> **Note**: Ensure you update the configuration values before running the notebook.  
> **Note**: If you encounter any issues, refer to the troubleshooting section at the end.


In [None]:
!pip install azure-ai-ml
!pip install azure-identity
!pip install azure-storage-blob

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import hashlib
import os
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

asset_name = "asset_name"  # Replace with the name you want to give your asset.
asset_description = "asset_description"  # Provide a description of your asset.
asset_path = os.path.abspath(
    "./sample_data/"
)  # Provide the absolute path to your local DATA FOLDERS.
asset_type = "dataset"

enable_blob_upload = (
    False  # Set to True if you want to upload the asset to Azure Blob Storage.
)
azure_storage_account_name = "azure_storage_account_name"
container_name = "container_name"
assets_folder = "assets_folder"  # Provide a unique folder path to store the assets in Azure Blob Storage.

## Connect to Azure Machine Learning Workspace

Connect to the Azure ML workspace using `MLClient`. Ensure the configuration values in the previous cell are correct before running this code.


In [None]:
# Initialize the Azure ML client
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace,
)

if enable_blob_upload:
    # Initialize the Azure Blob client
    blob_client = BlobServiceClient(
        account_url=f"https://{azure_storage_account_name}.blob.core.windows.net",
        credential=DefaultAzureCredential(),
    ).get_container_client(container_name)

## Compute Hash for Dataset

Compute a hash for the dataset to identify its uniqueness. This will help in checking if the dataset already exists in Azure ML.


In [None]:
# Compute the hash of the asset folder
hash_algo = hashlib.sha256()
for root, _, files in os.walk(asset_path):
    for file in sorted(files):  # Sort files for consistent hash
        file_path = os.path.join(root, file)
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_algo.update(chunk)
asset_hash = hash_algo.hexdigest()
print(f"Computed hash: {asset_hash}")

## Check for Existing Asset in Azure ML

Check if a dataset with the same hash already exists in the Azure ML workspace.


In [None]:
# Check if an asset with this hash already exists in Azure ML
asset_exists = False
existing_asset_info = None
for asset in ml_client.data.list():
    for asset_version_info in ml_client.data.list(name=asset.name):
        if asset_version_info.tags.get("hash") == asset_hash:
            asset_exists = True
            existing_asset_info = {
                "asset_name": asset_version_info.name,
                "asset_version": asset_version_info.version,
            }
            break
    if asset_exists:
        break

if asset_exists:
    print(f"Asset with hash {asset_hash} already exists in the workspace.")
    print(
        f"Asset name: {existing_asset_info['asset_name']}, version: {existing_asset_info['asset_version']}"
    )
else:
    print(
        "No existing asset found with the same hash. Uploading and registering the asset."
    )

## Upload to Azure Blob Storage (If Not Exists)

If the dataset doesn't already exist in Azure ML, upload it to Azure Blob Storage.


In [None]:
if not asset_exists:
    # Determine the latest version number
    try:
        latest = ml_client.data._get_latest_version(asset_name)
        latest_version = str(int(latest.version) + 1) if latest else "1"
    except Exception as e:
        print(f"Error getting latest version: {e}, setting it to 1.")
        latest_version = "1"

    # Upload files to Azure Blob Storage
    unique_folder_path = f"{asset_name}_{latest_version}"
    print(f"Uploading files from {asset_path} to {unique_folder_path}")
    if enable_blob_upload:
        for root, _, files in os.walk(asset_path):
            for file_name in files:
                file_path = os.path.join(root, file_name)
                blob_path = os.path.join(
                    unique_folder_path, os.path.relpath(file_path, asset_path)
                )
                blob_client_instance = blob_client.get_blob_client(blob_path)
                with open(file_path, "rb") as data:
                    blob_client_instance.upload_blob(data, overwrite=True)
                print(f"Uploaded {file_path} to {blob_path}")

## Register the Dataset in Azure ML

After uploading to Blob Storage, register the dataset as an asset in Azure ML.


In [None]:
if not asset_exists:
    # Register the asset in Azure ML
    blob_url = f"https://{azure_storage_account_name}.blob.core.windows.net/{container_name}/{unique_folder_path}"
    data_asset = Data(
        path=blob_url,
        type=AssetTypes.URI_FOLDER,
        description=asset_description,
        name=asset_name,
        tags={"hash": asset_hash},  # Tagging the asset with the computed hash
    )

    registered_asset = ml_client.data.create_or_update(data_asset)
    print(
        f"New {asset_type} registered in the workspace: {asset_name} with version {registered_asset.version}"
    )
    existing_asset_info = {
        f"asset_name": registered_asset.name,
        f"asset_version": registered_asset.version,
    }

## Results

If the asset was uploaded and registered, or if it already existed, the asset name and version are displayed below.

**You can use these results to add a tag to the AML job, creating a link to the dataset used.**

In [None]:
if existing_asset_info:
    print(
        f"Asset name: {existing_asset_info['asset_name']}, version: {existing_asset_info['asset_version']} found"
    )
else:
    print("No action taken.")

### Troubleshooting

1. **Permission Issues:**  
   - Ensure that your Azure Active Directory (AAD) account has the appropriate permissions:
     - **Azure Machine Learning:** 'Contributor' or 'Owner' role for the resource group containing the Azure ML workspace.
     - **Storage Account:** 'Storage Blob Data Contributor' role for the specific storage account used in your operations.

2. **Path Error:**  
   - The dataset path (`asset_path`) must be an absolute path. Modify this path based on your environment to point to the correct location of the data files.

3. **Asset Registration Issues:**  
   - If you encounter an error stating that an asset already exists, ensure that you have a unique asset name or update the existing asset version if necessary.

4. **Tagging:**  
   - Each dataset is tagged with its computed hash. This tag is used to verify the uniqueness of the dataset in Azure ML. 
   Make sure to include the hash tag when registering a new dataset.
