## 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section, we will connect to the workspace in which the job will be run.

### 1.1 Import the required libraries

In [None]:
## Import required libraries

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    OnlineRequestSettings,
    ProbeSettings,
)


### 1.2 Configure credential
We are using DefaultAzureCredential to get access to the workspace. DefaultAzureCredential should be capable of handling most Azure SDK authentication scenarios.

Reference for more available credentials if it does not work for you: configure credential example, azure-identity reference doc.

In [None]:
## Get credential to access workspace/registry assets

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

### 1.3 Get a handle to the workspace and the registry

We use the config file to connect to a workspace. The Azure ML workspace should be configured with a computer cluster. [Check this notebook for configure a workspace](https://aka.ms/azureml-workspace-configuration)

If config file is not available user can update following parameters in place holders
- SUBSCRIPTION_ID
- RESOURCE_GROUP
- WORKSPACE_NAME

In [None]:
# Get a handle to workspace
try:
    ml_client_ws = MLClient.from_config(credential=credential)
except:
    ml_client_ws = MLClient(
        credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<WORKSPACE_NAME>",
    )

## 2. Select the model that needs to be deployed 

### 2.1 Models can be selected either from registry, workspace or from local system. 

Please check the different paths that sdk-v2 supports [here.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-models?view=azureml-api-2&tabs=cli%2Cuse-local#supported-paths)

Examples for passing model from different workspace, registry, local system

- Registry - "azureml://registries/nvidia-ai/models/Nemotron-3-8B-4k/versions/3"

- Workspace - "azureml://locations/westus3/workspaces/f713c34a-3dbd-45ab-b91f-843ab890ce2f/models/GPT-2B/versions/1" 

- Local path - Model(
                     path = "./path/to/local_file", 
                     name = "<MODEL_NAME>, 
                     version = "<MODEL_VERSION>", 
                     type = "triton_model"
                     )
 
 
 
 #### Note:- Type of model should be "triton_model".

In [None]:
###  Selecting the model fron nvidia-ai registy

model = "azureml://registries/nvidia-ai/models/Nemotron-3-8B-QA-4k/versions/1"

## 3. Select the Environment to deploy the model. 

We have provided the environment to support the deployment of nvidia-triton models in the nvidia-ai registry.User can create their own [environment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?view=azureml-api-2&tabs=cli) to support deployment.

In [None]:
environment = "azureml://registries/nvidia-ai/environments/nemo-inference/labels/latest"

## 4. Create the endpoint in workspace for deploying the model


In [None]:
endpoint_name = "nvidia-model-endpoint-test"
endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode="aml_token")
ml_client_ws.online_endpoints.begin_create_or_update(endpoint).wait()

## 5. Create the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. We will create a deployment for our endpoint using the `ManagedOnlineDeployment` class. This class allows user to configure the following key aspects.

- `name` - Name of the deployment.
- `endpoint_name` - Name of the endpoint to create the deployment under.
- `model` - The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.
- `instance_type` - The VM size to use for the deployment. Compute Instance that Nvidia-models supports are 
                    Standard_ND96asr_v4, 
                    Standard_ND96amsr_A100_v, 
                    Standard_ND96amsr_v4, 
                    Standard_NC24ads_A100_v4, 
                    Standard_NC48ads_A100_v4, 
                    Standard_NC96ads_A100_v4
- `instance_count` - The number of instances to use for the deployment

### 5.1 Deployment settings

In [None]:
deployment_name = "blue"

request_timeout_ms = max_queue_wait_ms = 10000
max_concurrent_requests_per_instance = 32
failure_threshold = 119
success_threshold = 1
timeout = 300
period = 300
initial_delay = 500
instance_count = 1
instance_type = "Standard_ND96amsr_A100_v4"


####################################################################################################
request_settings = OnlineRequestSettings(
    max_concurrent_requests_per_instance=max_concurrent_requests_per_instance,
    request_timeout_ms=request_timeout_ms,
    max_queue_wait_ms=max_queue_wait_ms,
)
liveness_probe_settings = ProbeSettings(
    failure_threshold=failure_threshold,
    timeout=timeout,
    period=period,
    initial_delay=initial_delay,
)
readiness_probe_settings = ProbeSettings(
    failure_threshold=failure_threshold,
    success_threshold=success_threshold,
    timeout=timeout,
    period=period,
    initial_delay=initial_delay,
)

### 5.2 Create deployment

In [None]:
## Create Deployment

deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    environment=environment,
    model=model,
    instance_type=instance_type,
    instance_count=instance_count,
    request_settings=request_settings,
    liveness_probe=liveness_probe_settings,
    readiness_probe=readiness_probe_settings,
    environment_variables={
        "NVTE_FLASH_ATTN": 0,
        "NVTE_FUSED_ATTN": 0,
        "NVTE_MASKED_SOFTMAX_FUSION": 0,
    },
)

ml_client_ws.online_deployments.begin_create_or_update(deployment).wait()

## 6. Inferencing against the endpoint

### 6.1 Comment out below lines to install the libraries if not installed in the system

In [None]:
# ! pip install tritonclient==2.39.0
# ! pip install gevent==23.9.1
# ! pip install geventhttpclient

### 6.2 Import the libraries required to do inferencing

In [None]:
import os
import re
from functools import partial
from operator import is_not
from typing import List
import re
import gevent.ssl

import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype

### 6.3 Define functions to support inferencing

In [None]:
RANDOM_SEED = 0


def prepare_tensor(name, input):
    t = httpclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t


def generate_inputs(
    prompt: str,
    tokens: int = 300,
    temperature: float = 1.0,
    top_k: float = 1,
    top_p: float = 0,
    beam_width: int = 1,
    repetition_penalty: float = 1,
    length_penalty: float = 1.0,
    stream: bool = False,
) -> httpclient.InferInput:
    """Create the input for the triton inference server."""
    query = np.array(prompt).astype(object)
    request_output_len = np.array([tokens]).astype(np.uint32).reshape((1, -1))
    runtime_top_k = np.array([top_k]).astype(np.uint32).reshape((1, -1))
    runtime_top_p = np.array([top_p]).astype(np.float32).reshape((1, -1))
    temperature_array = np.array([temperature]).astype(np.float32).reshape((1, -1))
    len_penalty = np.array([length_penalty]).astype(np.float32).reshape((1, -1))
    repetition_penalty_array = (
        np.array([repetition_penalty]).astype(np.float32).reshape((1, -1))
    )
    random_seed = np.array([RANDOM_SEED]).astype(np.uint64).reshape((1, -1))
    beam_width_array = np.array([beam_width]).astype(np.uint32).reshape((1, -1))
    streaming_data = np.array([[stream]], dtype=bool)

    inputs = [
        prepare_tensor("text_input", query),
        prepare_tensor("max_tokens", request_output_len),
        prepare_tensor("top_k", runtime_top_k),
        prepare_tensor("top_p", runtime_top_p),
        prepare_tensor("temperature", temperature_array),
        prepare_tensor("length_penalty", len_penalty),
        prepare_tensor("repetition_penalty", repetition_penalty_array),
        prepare_tensor("random_seed", random_seed),
        prepare_tensor("beam_width", beam_width_array),
        prepare_tensor("stream", streaming_data),
    ]
    return inputs

### 6.4 Set the Prompt as per the model type and create input tensor to invoke the endpoint

In [None]:
PROMPT_TEMPLATE_QA = (
    "System: This is a chat between a user and an artificial intelligence assistant."
    "The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.\n"
    "{context}\n"
    "User: Please give a full and complete answer for the question. {question}\n"
    "Assistant:\n"
)

PROMPT_TEMPLATE_CHAT_STEERLM = (
    "<extra_id_0>System\n"
    "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
    "<extra_id_1>User\n"
    "{prompt}\n"
    "<extra_id_1>Assistant\n"
    "<extra_id_2>quality:4,understanding:4,correctness:4,coherence:4,complexity:4,verbosity:4,toxicity:0,humor:0,creativity:0,violence:0,helpfulness:4,not_appropriate:0,hate_speech:0,sexual_content:0,fails_task:0,political_content:0,moral_judgement:0,lang:en\n"
)

PROMPT_TEMPLATE_CHAT_RLHF_SFT = (
    "<extra_id_0>System\n"
    "{system}\n"
    "<extra_id_1>User{prompt}\n"
    "<extra_id_1>Assistant\n"
)

In [None]:
context = "Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in the sunâ€™s activity or large volcanic eruptions. But since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels like coal, oil and gas."
question = "Who is the fastest water animal?"

prompt = PROMPT_TEMPLATE_QA.format(context=context, question=question)

inputs = generate_inputs(
    [[prompt]],
    tokens=100,
    temperature=0.2,
    top_k=1,
    top_p=0,
    beam_width=1,
    repetition_penalty=1.0,
    length_penalty=1.0,
)

### 6.5 Get the endpoint api_key and set up http Client for Inferencing

In [None]:
endpoint = ml_client_ws.online_endpoints.get(name=endpoint_name)
keys = ml_client_ws.online_endpoints.get_keys(endpoint_name)

api_key = keys.__dict__["access_token"]
url = endpoint.scoring_uri.replace("https://", "")
client = httpclient.InferenceServerClient(
    url=url,
    ssl=True,
    ssl_context_factory=gevent.ssl._create_default_https_context,
    concurrency=1000,
)

headers = {
    "Content-Type": "application/json",
    "Authorization": ("Bearer " + api_key),
    "azureml-model-deployment": deployment_name,
}

### 6.6 Invoke the endpoint for inferencing the model

In [None]:
# Check status of triton server
health_ctx = client.is_server_ready(headers=headers)
print("Is server ready - {}".format(health_ctx))

# Check status of model
model_name = "ensemble"
status_ctx = client.is_model_ready(model_name, "1", headers)
print("Is model ready - {}".format(status_ctx))

result = client.infer(model_name, inputs=inputs, headers=headers)
result_str = "".join(
    [val.decode("utf-8") for val in result.as_numpy("text_output").tolist()]
)

print(result_str)