sdk/python/foundation-models/nvidia-nemo/deploy_infer.ipynb (566 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "7b29bbb0",
"metadata": {},
"source": [
"## 1. Connect to Azure Machine Learning Workspace\n",
"\n",
"The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section, we will connect to the workspace in which the job will be run.\n",
"\n",
"### 1.1 Import the required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "744d2ab0",
"metadata": {},
"outputs": [],
"source": [
"## Import required libraries\n",
"\n",
"from azure.ai.ml import MLClient\n",
"from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n",
"\n",
"from azure.ai.ml.entities import (\n",
" ManagedOnlineEndpoint,\n",
" ManagedOnlineDeployment,\n",
" OnlineRequestSettings,\n",
" ProbeSettings,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "15546847",
"metadata": {},
"source": [
"\n",
"### 1.2 Configure credential\n",
"We are using DefaultAzureCredential to get access to the workspace. DefaultAzureCredential should be capable of handling most Azure SDK authentication scenarios.\n",
"\n",
"Reference for more available credentials if it does not work for you: configure credential example, azure-identity reference doc."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2bde1642",
"metadata": {},
"outputs": [],
"source": [
"## Get credential to access workspace/registry assets\n",
"\n",
"try:\n",
" credential = DefaultAzureCredential()\n",
" # Check if given credential can get token successfully.\n",
" credential.get_token(\"https://management.azure.com/.default\")\n",
"except Exception as ex:\n",
" # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work\n",
" credential = InteractiveBrowserCredential()"
]
},
{
"cell_type": "markdown",
"id": "df25f48f",
"metadata": {},
"source": [
"### 1.3 Get a handle to the workspace and the registry\n",
"\n",
"We use the config file to connect to a workspace. The Azure ML workspace should be configured with a computer cluster. [Check this notebook for configure a workspace](https://aka.ms/azureml-workspace-configuration)\n",
"\n",
"If config file is not available user can update following parameters in place holders\n",
"- SUBSCRIPTION_ID\n",
"- RESOURCE_GROUP\n",
"- WORKSPACE_NAME"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "845a3e9e",
"metadata": {},
"outputs": [],
"source": [
"# Get a handle to workspace\n",
"try:\n",
" ml_client_ws = MLClient.from_config(credential=credential)\n",
"except:\n",
" ml_client_ws = MLClient(\n",
" credential,\n",
" subscription_id=\"<SUBSCRIPTION_ID>\",\n",
" resource_group_name=\"<RESOURCE_GROUP>\",\n",
" workspace_name=\"<WORKSPACE_NAME>\",\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "bfc07de3",
"metadata": {},
"source": [
"## 2. Select the model that needs to be deployed "
]
},
{
"cell_type": "markdown",
"id": "51213074",
"metadata": {},
"source": [
"### 2.1 Models can be selected either from registry, workspace or from local system. \n",
"\n",
"Please check the different paths that sdk-v2 supports [here.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-models?view=azureml-api-2&tabs=cli%2Cuse-local#supported-paths)\n",
"\n",
"Examples for passing model from different workspace, registry, local system\n",
"\n",
"- Registry - \"azureml://registries/nvidia-ai/models/Nemotron-3-8B-4k/versions/3\"\n",
"\n",
"- Workspace - \"azureml://locations/westus3/workspaces/f713c34a-3dbd-45ab-b91f-843ab890ce2f/models/GPT-2B/versions/1\" \n",
"\n",
"- Local path - Model(\n",
" path = \"./path/to/local_file\", \n",
" name = \"<MODEL_NAME>, \n",
" version = \"<MODEL_VERSION>\", \n",
" type = \"triton_model\"\n",
" )\n",
" \n",
" \n",
" \n",
" #### Note:- Type of model should be \"triton_model\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01924f1f",
"metadata": {},
"outputs": [],
"source": [
"### Selecting the model fron nvidia-ai registy\n",
"\n",
"model = \"azureml://registries/nvidia-ai/models/Nemotron-3-8B-QA-4k/versions/1\""
]
},
{
"cell_type": "markdown",
"id": "59714468",
"metadata": {},
"source": [
"## 3. Select the Environment to deploy the model. \n",
"\n",
"We have provided the environment to support the deployment of nvidia-triton models in the nvidia-ai registry.User can create their own [environment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?view=azureml-api-2&tabs=cli) to support deployment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "956c3611",
"metadata": {},
"outputs": [],
"source": [
"environment = \"azureml://registries/nvidia-ai/environments/nemo-inference/labels/latest\""
]
},
{
"cell_type": "markdown",
"id": "2b407659",
"metadata": {},
"source": [
"## 4. Create the endpoint in workspace for deploying the model\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95b3e2c4",
"metadata": {},
"outputs": [],
"source": [
"endpoint_name = \"nvidia-model-endpoint-test\"\n",
"endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode=\"aml_token\")\n",
"ml_client_ws.online_endpoints.begin_create_or_update(endpoint).wait()"
]
},
{
"cell_type": "markdown",
"id": "4cbc2f5b",
"metadata": {},
"source": [
"## 5. Create the deployment\n",
"\n",
"A deployment is a set of resources required for hosting the model that does the actual inferencing. We will create a deployment for our endpoint using the `ManagedOnlineDeployment` class. This class allows user to configure the following key aspects.\n",
"\n",
"- `name` - Name of the deployment.\n",
"- `endpoint_name` - Name of the endpoint to create the deployment under.\n",
"- `model` - The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.\n",
"- `instance_type` - The VM size to use for the deployment. Compute Instance that Nvidia-models supports are \n",
" Standard_ND96asr_v4, \n",
" Standard_ND96amsr_A100_v, \n",
" Standard_ND96amsr_v4, \n",
" Standard_NC24ads_A100_v4, \n",
" Standard_NC48ads_A100_v4, \n",
" Standard_NC96ads_A100_v4\n",
"- `instance_count` - The number of instances to use for the deployment"
]
},
{
"cell_type": "markdown",
"id": "aa03d547",
"metadata": {},
"source": [
"### 5.1 Deployment settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "913f6ea4",
"metadata": {},
"outputs": [],
"source": [
"deployment_name = \"blue\"\n",
"\n",
"request_timeout_ms = max_queue_wait_ms = 10000\n",
"max_concurrent_requests_per_instance = 32\n",
"failure_threshold = 119\n",
"success_threshold = 1\n",
"timeout = 300\n",
"period = 300\n",
"initial_delay = 500\n",
"instance_count = 1\n",
"instance_type = \"Standard_ND96amsr_A100_v4\"\n",
"\n",
"\n",
"####################################################################################################\n",
"request_settings = OnlineRequestSettings(\n",
" max_concurrent_requests_per_instance=max_concurrent_requests_per_instance,\n",
" request_timeout_ms=request_timeout_ms,\n",
" max_queue_wait_ms=max_queue_wait_ms,\n",
")\n",
"liveness_probe_settings = ProbeSettings(\n",
" failure_threshold=failure_threshold,\n",
" timeout=timeout,\n",
" period=period,\n",
" initial_delay=initial_delay,\n",
")\n",
"readiness_probe_settings = ProbeSettings(\n",
" failure_threshold=failure_threshold,\n",
" success_threshold=success_threshold,\n",
" timeout=timeout,\n",
" period=period,\n",
" initial_delay=initial_delay,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "046b4d45",
"metadata": {},
"source": [
"### 5.2 Create deployment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e11dc7c2",
"metadata": {},
"outputs": [],
"source": [
"## Create Deployment\n",
"\n",
"deployment = ManagedOnlineDeployment(\n",
" name=deployment_name,\n",
" endpoint_name=endpoint_name,\n",
" environment=environment,\n",
" model=model,\n",
" instance_type=instance_type,\n",
" instance_count=instance_count,\n",
" request_settings=request_settings,\n",
" liveness_probe=liveness_probe_settings,\n",
" readiness_probe=readiness_probe_settings,\n",
" environment_variables={\n",
" \"NVTE_FLASH_ATTN\": 0,\n",
" \"NVTE_FUSED_ATTN\": 0,\n",
" \"NVTE_MASKED_SOFTMAX_FUSION\": 0,\n",
" },\n",
")\n",
"\n",
"ml_client_ws.online_deployments.begin_create_or_update(deployment).wait()"
]
},
{
"cell_type": "markdown",
"id": "19617757",
"metadata": {},
"source": [
"## 6. Inferencing against the endpoint"
]
},
{
"cell_type": "markdown",
"id": "1db7254a",
"metadata": {},
"source": [
"### 6.1 Comment out below lines to install the libraries if not installed in the system"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a46d36cb",
"metadata": {},
"outputs": [],
"source": [
"# ! pip install tritonclient==2.39.0\n",
"# ! pip install gevent==23.9.1\n",
"# ! pip install geventhttpclient"
]
},
{
"cell_type": "markdown",
"id": "a1062e32",
"metadata": {},
"source": [
"### 6.2 Import the libraries required to do inferencing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73563c8e",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"from functools import partial\n",
"from operator import is_not\n",
"from typing import List\n",
"import re\n",
"import gevent.ssl\n",
"\n",
"import numpy as np\n",
"import tritonclient.http as httpclient\n",
"from tritonclient.utils import np_to_triton_dtype"
]
},
{
"cell_type": "markdown",
"id": "234e4e40",
"metadata": {},
"source": [
"### 6.3 Define functions to support inferencing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e370b0be",
"metadata": {},
"outputs": [],
"source": [
"RANDOM_SEED = 0\n",
"\n",
"\n",
"def prepare_tensor(name, input):\n",
" t = httpclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))\n",
" t.set_data_from_numpy(input)\n",
" return t\n",
"\n",
"\n",
"def generate_inputs(\n",
" prompt: str,\n",
" tokens: int = 300,\n",
" temperature: float = 1.0,\n",
" top_k: float = 1,\n",
" top_p: float = 0,\n",
" beam_width: int = 1,\n",
" repetition_penalty: float = 1,\n",
" length_penalty: float = 1.0,\n",
" stream: bool = False,\n",
") -> httpclient.InferInput:\n",
" \"\"\"Create the input for the triton inference server.\"\"\"\n",
" query = np.array(prompt).astype(object)\n",
" request_output_len = np.array([tokens]).astype(np.uint32).reshape((1, -1))\n",
" runtime_top_k = np.array([top_k]).astype(np.uint32).reshape((1, -1))\n",
" runtime_top_p = np.array([top_p]).astype(np.float32).reshape((1, -1))\n",
" temperature_array = np.array([temperature]).astype(np.float32).reshape((1, -1))\n",
" len_penalty = np.array([length_penalty]).astype(np.float32).reshape((1, -1))\n",
" repetition_penalty_array = (\n",
" np.array([repetition_penalty]).astype(np.float32).reshape((1, -1))\n",
" )\n",
" random_seed = np.array([RANDOM_SEED]).astype(np.uint64).reshape((1, -1))\n",
" beam_width_array = np.array([beam_width]).astype(np.uint32).reshape((1, -1))\n",
" streaming_data = np.array([[stream]], dtype=bool)\n",
"\n",
" inputs = [\n",
" prepare_tensor(\"text_input\", query),\n",
" prepare_tensor(\"max_tokens\", request_output_len),\n",
" prepare_tensor(\"top_k\", runtime_top_k),\n",
" prepare_tensor(\"top_p\", runtime_top_p),\n",
" prepare_tensor(\"temperature\", temperature_array),\n",
" prepare_tensor(\"length_penalty\", len_penalty),\n",
" prepare_tensor(\"repetition_penalty\", repetition_penalty_array),\n",
" prepare_tensor(\"random_seed\", random_seed),\n",
" prepare_tensor(\"beam_width\", beam_width_array),\n",
" prepare_tensor(\"stream\", streaming_data),\n",
" ]\n",
" return inputs"
]
},
{
"cell_type": "markdown",
"id": "680e72ed",
"metadata": {},
"source": [
"### 6.4 Set the Prompt as per the model type and create input tensor to invoke the endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a14e77b6",
"metadata": {},
"outputs": [],
"source": [
"PROMPT_TEMPLATE_QA = (\n",
" \"System: This is a chat between a user and an artificial intelligence assistant.\"\n",
" \"The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.\\n\"\n",
" \"{context}\\n\"\n",
" \"User: Please give a full and complete answer for the question. {question}\\n\"\n",
" \"Assistant:\\n\"\n",
")\n",
"\n",
"PROMPT_TEMPLATE_CHAT_STEERLM = (\n",
" \"<extra_id_0>System\\n\"\n",
" \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\\n\"\n",
" \"<extra_id_1>User\\n\"\n",
" \"{prompt}\\n\"\n",
" \"<extra_id_1>Assistant\\n\"\n",
" \"<extra_id_2>quality:4,understanding:4,correctness:4,coherence:4,complexity:4,verbosity:4,toxicity:0,humor:0,creativity:0,violence:0,helpfulness:4,not_appropriate:0,hate_speech:0,sexual_content:0,fails_task:0,political_content:0,moral_judgement:0,lang:en\\n\"\n",
")\n",
"\n",
"PROMPT_TEMPLATE_CHAT_RLHF_SFT = (\n",
" \"<extra_id_0>System\\n\"\n",
" \"{system}\\n\"\n",
" \"<extra_id_1>User{prompt}\\n\"\n",
" \"<extra_id_1>Assistant\\n\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2950e215",
"metadata": {},
"outputs": [],
"source": [
"context = \"Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in the sun’s activity or large volcanic eruptions. But since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels like coal, oil and gas.\"\n",
"question = \"Who is the fastest water animal?\"\n",
"\n",
"prompt = PROMPT_TEMPLATE_QA.format(context=context, question=question)\n",
"\n",
"inputs = generate_inputs(\n",
" [[prompt]],\n",
" tokens=100,\n",
" temperature=0.2,\n",
" top_k=1,\n",
" top_p=0,\n",
" beam_width=1,\n",
" repetition_penalty=1.0,\n",
" length_penalty=1.0,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8a7791d4",
"metadata": {},
"source": [
"### 6.5 Get the endpoint api_key and set up http Client for Inferencing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3da29869",
"metadata": {},
"outputs": [],
"source": [
"endpoint = ml_client_ws.online_endpoints.get(name=endpoint_name)\n",
"keys = ml_client_ws.online_endpoints.get_keys(endpoint_name)\n",
"\n",
"api_key = keys.__dict__[\"access_token\"]\n",
"url = endpoint.scoring_uri.replace(\"https://\", \"\")\n",
"client = httpclient.InferenceServerClient(\n",
" url=url,\n",
" ssl=True,\n",
" ssl_context_factory=gevent.ssl._create_default_https_context,\n",
" concurrency=1000,\n",
")\n",
"\n",
"headers = {\n",
" \"Content-Type\": \"application/json\",\n",
" \"Authorization\": (\"Bearer \" + api_key),\n",
" \"azureml-model-deployment\": deployment_name,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "90d1bfd8",
"metadata": {},
"source": [
"### 6.6 Invoke the endpoint for inferencing the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46c104e8",
"metadata": {},
"outputs": [],
"source": [
"# Check status of triton server\n",
"health_ctx = client.is_server_ready(headers=headers)\n",
"print(\"Is server ready - {}\".format(health_ctx))\n",
"\n",
"# Check status of model\n",
"model_name = \"ensemble\"\n",
"status_ctx = client.is_model_ready(model_name, \"1\", headers)\n",
"print(\"Is model ready - {}\".format(status_ctx))\n",
"\n",
"result = client.infer(model_name, inputs=inputs, headers=headers)\n",
"result_str = \"\".join(\n",
" [val.decode(\"utf-8\") for val in result.as_numpy(\"text_output\").tolist()]\n",
")\n",
"\n",
"print(result_str)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}