florence2-VQA/2_serving_florence2.ipynb (811 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "796cf06d-e4d3-422e-8301-a961e9520f52",
"metadata": {},
"source": [
"# Open Source LLM serving using the Azure ML Python SDK\n",
"\n",
"[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "538d1c11-a30d-4342-85d5-ba59e1890b76",
"metadata": {},
"outputs": [],
"source": [
"%store -r job_name\n",
"try:\n",
" job_name\n",
"except NameError:\n",
" print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n",
" print(\"[ERROR] Please run the previous notebook (model training) again.\")\n",
" print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")"
]
},
{
"cell_type": "markdown",
"id": "bb139b1d-500c-4ef2-9af3-728f2a5ea05f",
"metadata": {},
"source": [
"## 1. Load config file\n",
"\n",
"---\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44ef0e90",
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import os, sys\n",
"lab_prep_dir = os.getcwd().split(\"slm-innovator-lab\")[0] + \"slm-innovator-lab/0_lab_preparation\"\n",
"sys.path.append(os.path.abspath(lab_prep_dir))\n",
"\n",
"from common import check_kernel\n",
"check_kernel()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5234c47-b3e5-4218-8a98-3988c8991643",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import yaml\n",
"from datetime import datetime\n",
"snapshot_date = datetime.now().strftime(\"%Y-%m-%d\")\n",
"\n",
"with open('config.yml') as f:\n",
" d = yaml.load(f, Loader=yaml.FullLoader)\n",
" \n",
"AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']\n",
"AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']\n",
"AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']\n",
"AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME'] \n",
"DATA_DIR = d['config']['DATA_DIR']\n",
"CLOUD_DIR = d['config']['CLOUD_DIR']\n",
"HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']\n",
"IS_DEBUG = d['config']['IS_DEBUG']\n",
"\n",
"azure_env_name = d['serve']['azure_env_name']\n",
"azure_model_name = d['serve']['azure_model_name']\n",
"azure_endpoint_name = d['serve']['azure_endpoint_name']\n",
"azure_deployment_name = d['serve']['azure_deployment_name']\n",
"azure_serving_cluster_size = d['serve']['azure_serving_cluster_size']"
]
},
{
"cell_type": "markdown",
"id": "d9843e0f-3cf1-4e86-abb7-a49919fac8d4",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 2. Serving preparation\n",
"\n",
"---\n",
"\n",
"### 2.1. Configure workspace details\n",
"\n",
"To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccb4a273-ba31-4f47-a2fd-dc8cdea390f6",
"metadata": {},
"outputs": [],
"source": [
"# import required libraries\n",
"import time\n",
"from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n",
"from azure.ai.ml import MLClient, Input\n",
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Model\n",
"from azure.ai.ml.constants import AssetTypes\n",
"from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError\n",
"\n",
"credential = DefaultAzureCredential()\n",
"ml_client = None\n",
"try:\n",
" ml_client = MLClient.from_config(credential)\n",
"except Exception as ex:\n",
" print(ex)\n",
" ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)"
]
},
{
"cell_type": "markdown",
"id": "ba2a377b-d10c-413e-a67b-2c11a3cff7fd",
"metadata": {},
"source": [
"### 2.2. Create model asset\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73532c39-3fdd-40a7-b2be-9f5a2f22443a",
"metadata": {},
"outputs": [],
"source": [
"def get_or_create_model_asset(ml_client, model_name, job_name, model_dir=\"outputs\", model_type=\"custom_model\", update=False):\n",
" \n",
" try:\n",
" latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])\n",
" if update:\n",
" raise ResourceExistsError('Found Model asset, but will update the Model.')\n",
" else:\n",
" model_asset = ml_client.models.get(name=model_name, version=latest_model_version)\n",
" print(f\"Found Model asset: {model_name}. Will not create again\")\n",
" except (ResourceNotFoundError, ResourceExistsError) as e:\n",
" print(f\"Exception: {e}\") \n",
" model_path = f\"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/\" \n",
" run_model = Model(\n",
" name=model_name, \n",
" path=model_path,\n",
" description=\"Model created from run.\",\n",
" type=model_type # mlflow_model, custom_model, triton_model\n",
" )\n",
" model_asset = ml_client.models.create_or_update(run_model)\n",
" print(f\"Created Model asset: {model_name}\")\n",
"\n",
" return model_asset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9aaa9671-fc98-4e5e-a70c-7771caa1c7d6",
"metadata": {},
"outputs": [],
"source": [
"model_dir = d['train']['model_dir']\n",
"model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type=\"custom_model\", update=False)"
]
},
{
"cell_type": "markdown",
"id": "b0df561a-7846-4450-a2dd-4af6396b1719",
"metadata": {},
"source": [
"### 2.3. Create AzureML environment\n",
"\n",
"Azure ML defines containers (called environment asset) in which your code will run. We can use the built-in environment or build a custom environment (Docker container, conda). This hands-on uses Docker container.\n"
]
},
{
"cell_type": "markdown",
"id": "ec0b27b9-6a00-42ff-a3e4-ceb5fbd9d2cf",
"metadata": {},
"source": [
"#### Docker environment\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67b4adcf-b53b-4b71-9826-a31663236a2d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%writefile {CLOUD_DIR}/serve/Dockerfile\n",
"FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2\n",
"\n",
"# Install pip dependencies\n",
"COPY requirements.txt .\n",
"RUN pip install -r requirements.txt --no-cache-dir\n",
"\n",
"# Inference requirements\n",
"COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/\n",
"\n",
"RUN /var/requirements/install_system_requirements.sh && \\\n",
" cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \\\n",
" cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \\\n",
" ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \\\n",
" rm -f /etc/nginx/sites-enabled/default\n",
"ENV SVDIR=/var/runit\n",
"ENV WORKER_TIMEOUT=400\n",
"EXPOSE 5001 8883 8888\n",
"\n",
"# support Deepspeed launcher requirement of passwordless ssh login\n",
"RUN apt-get update\n",
"RUN apt-get install -y openssh-server openssh-client\n",
"\n",
"RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbf91c6a-cf93-453b-9281-ac64f239eacf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%writefile {CLOUD_DIR}/serve/requirements.txt\n",
"azureml-core==1.58.0\n",
"azureml-dataset-runtime==1.58.0\n",
"azureml-defaults==1.58.0\n",
"azure-ml==0.0.1\n",
"azure-ml-component==0.9.18.post2\n",
"azureml-mlflow==1.58.0\n",
"azureml-contrib-services==1.58.0\n",
"azureml-contrib-services==1.58.0\n",
"azureml-automl-common-tools==1.58.0\n",
"torch-tb-profiler==0.4.3\n",
"azureml-inference-server-http~=1.3\n",
"inference-schema==1.8.0\n",
"MarkupSafe==3.0.2\n",
"regex\n",
"pybind11\n",
"bitsandbytes==0.44.1\n",
"transformers==4.46.1\n",
"peft==0.13.2\n",
"accelerate==1.1.0\n",
"datasets\n",
"scipy\n",
"azure-identity\n",
"packaging==24.1\n",
"timm==1.0.11\n",
"einops"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7b19bc6-dc43-4948-8c71-97a033a5f5a3",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Environment, BuildContext\n",
"\n",
"def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False):\n",
" \n",
" try:\n",
" latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)])\n",
" if update:\n",
" raise ResourceExistsError('Found Environment asset, but will update the Environment.')\n",
" else:\n",
" env_asset = ml_client.environments.get(name=env_name, version=latest_env_version)\n",
" print(f\"Found Environment asset: {env_name}. Will not create again\")\n",
" except (ResourceNotFoundError, ResourceExistsError) as e:\n",
" print(f\"Exception: {e}\")\n",
" env_docker_image = Environment(\n",
" build=BuildContext(path=docker_dir),\n",
" name=env_name,\n",
" description=\"Environment created from a Docker context.\",\n",
" )\n",
" env_asset = ml_client.environments.create_or_update(env_docker_image)\n",
" print(f\"Created Environment asset: {env_name}\")\n",
" \n",
" return env_asset\n",
"\n",
"env = get_or_create_docker_environment_asset(ml_client, azure_env_name, f\"{CLOUD_DIR}/serve\", update=False)"
]
},
{
"cell_type": "markdown",
"id": "e496e364-c8ed-43c1-a248-2530c1eb44a4",
"metadata": {},
"source": [
"### 2.4. Serving script\n",
"\n",
"If you are not serving with MLflow but with a custom model, you are free to write your own code.The `score.py` example below shows how to write the code.\n",
"\n",
"- `init()`: This function is the place to write logic for global initialization operations like loading the LLM model.\n",
"- `run()`: Inference logic is called for every invocation of the endpoint.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7be0f024-66c1-4ddf-9b54-8fcf45abd5e1",
"metadata": {},
"outputs": [],
"source": [
"%%writefile src_serve/score.py\n",
"import os\n",
"import re\n",
"import json\n",
"import torch\n",
"import base64\n",
"import logging\n",
"\n",
"from io import BytesIO\n",
"from PIL import Image\n",
"from transformers import AutoTokenizer, AutoProcessor, BitsAndBytesConfig, get_scheduler\n",
"from transformers import AutoModelForCausalLM, AutoProcessor\n",
"from PIL import Image, ImageDraw, ImageFont\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"\n",
"def run_example_base64(task_prompt, text_input, base64_image, params):\n",
" \n",
" max_new_tokens = params[\"max_new_tokens\"]\n",
" num_beams = params[\"num_beams\"]\n",
" \n",
" image = Image.open(BytesIO(base64.b64decode(base64_image)))\n",
" prompt = task_prompt + text_input\n",
"\n",
" # Ensure the image is in RGB mode\n",
" if image.mode != \"RGB\":\n",
" image = image.convert(\"RGB\")\n",
"\n",
" inputs = processor(text=prompt, images=image, return_tensors=\"pt\").to(device)\n",
" generated_ids = model.generate(\n",
" input_ids=inputs[\"input_ids\"],\n",
" pixel_values=inputs[\"pixel_values\"],\n",
" max_new_tokens=max_new_tokens,\n",
" num_beams=num_beams\n",
" )\n",
" generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]\n",
" parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))\n",
" return parsed_answer\n",
"\n",
"\n",
"def init():\n",
" \"\"\"\n",
" This function is called when the container is initialized/started, typically after create/update of the deployment.\n",
" You can write the logic here to perform init operations like caching the model in memory\n",
" \"\"\"\n",
" global model\n",
" global processor\n",
" # AZUREML_MODEL_DIR is an environment variable created during deployment.\n",
" # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)\n",
" # Please provide your model's folder name if there is one\n",
" model_name_or_path = os.path.join(\n",
" os.getenv(\"AZUREML_MODEL_DIR\"), \"outputs\"\n",
" )\n",
" \n",
" model_kwargs = dict(\n",
" trust_remote_code=True,\n",
" revision=\"refs/pr/6\", \n",
" device_map=device\n",
" )\n",
" \n",
" processor_kwargs = dict(\n",
" trust_remote_code=True,\n",
" revision=\"refs/pr/6\"\n",
" )\n",
" \n",
" model = AutoModelForCausalLM.from_pretrained(model_name_or_path, **model_kwargs)\n",
" processor = AutoProcessor.from_pretrained(model_name_or_path, **processor_kwargs) \n",
"\n",
" logging.info(\"Loaded model.\")\n",
" \n",
"def run(json_data: str):\n",
" logging.info(\"Request received\")\n",
" data = json.loads(json_data)\n",
" task_prompt = data[\"task_prompt\"]\n",
" text_input = data[\"text_input\"]\n",
" base64_image = data[\"image_input\"]\n",
" params = data['params']\n",
"\n",
" generated_text = run_example_base64(task_prompt, text_input, base64_image, params)\n",
" json_result = {\"result\": str(generated_text)}\n",
" \n",
" return json_result "
]
},
{
"cell_type": "markdown",
"id": "aafb187e-370f-4481-9d82-a38ae982c1e3",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 3. Serving\n",
"\n",
"---\n",
"\n",
"### 3.1. Create endpoint\n",
"\n",
"Create an endpoint. This process does not provision a GPU cluster yet.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77c22433-4ab8-4db9-956b-7f437b86dfa6",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"from azure.ai.ml.entities import (\n",
" ManagedOnlineEndpoint,\n",
" IdentityConfiguration,\n",
" ManagedIdentityConfiguration,\n",
")\n",
"\n",
"# Check if the endpoint already exists in the workspace\n",
"try:\n",
" endpoint = ml_client.online_endpoints.get(azure_endpoint_name)\n",
" print(\"---Endpoint already exists---\")\n",
"except:\n",
" # Create an online endpoint if it doesn't exist\n",
"\n",
" # Define the endpoint\n",
" endpoint = ManagedOnlineEndpoint(\n",
" name=azure_endpoint_name,\n",
" description=f\"Test endpoint for {model.name}\",\n",
" # identity=IdentityConfiguration(\n",
" # type=\"user_assigned\",\n",
" # user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],\n",
" # )\n",
" # if uai_id != \"\"\n",
" # else None,\n",
" )\n",
"\n",
"# Trigger the endpoint creation\n",
"try:\n",
" ml_client.begin_create_or_update(endpoint).wait()\n",
" print(\"\\n---Endpoint created successfully---\\n\")\n",
"except Exception as err:\n",
" raise RuntimeError(\n",
" f\"Endpoint creation failed. Detailed Response:\\n{err}\"\n",
" ) from err"
]
},
{
"cell_type": "markdown",
"id": "7f8a05af-0aca-4d36-9c0f-bfa4dcc6203b",
"metadata": {},
"source": [
"### 3.2. Create Deployment\n",
"\n",
"Create a Deployment. This takes a lot of time as GPU clusters must be provisioned and the serving environment must be built.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3afaa8b-5af1-49d1-990f-414da0effe8a",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"from azure.ai.ml.entities import ( \n",
" OnlineRequestSettings,\n",
" CodeConfiguration,\n",
" ManagedOnlineDeployment,\n",
" ProbeSettings,\n",
" Environment\n",
")\n",
"\n",
"deployment = ManagedOnlineDeployment(\n",
" name=azure_deployment_name,\n",
" endpoint_name=azure_endpoint_name,\n",
" model=model,\n",
" instance_type=azure_serving_cluster_size,\n",
" instance_count=1,\n",
" #code_configuration=code_configuration,\n",
" environment = env,\n",
" scoring_script=\"score.py\",\n",
" code_path=\"./src_serve\",\n",
" #environment_variables=deployment_env_vars,\n",
" request_settings=OnlineRequestSettings(max_concurrent_requests_per_instance=3,\n",
" request_timeout_ms=90000, max_queue_wait_ms=60000),\n",
" liveness_probe=ProbeSettings(\n",
" failure_threshold=30,\n",
" success_threshold=1,\n",
" period=100,\n",
" initial_delay=500,\n",
" ),\n",
" readiness_probe=ProbeSettings(\n",
" failure_threshold=30,\n",
" success_threshold=1,\n",
" period=100,\n",
" initial_delay=500,\n",
" ),\n",
")\n",
"\n",
"# Trigger the deployment creation\n",
"try:\n",
" ml_client.begin_create_or_update(deployment).wait()\n",
" print(\"\\n---Deployment created successfully---\\n\")\n",
"except Exception as err:\n",
" raise RuntimeError(\n",
" f\"Deployment creation failed. Detailed Response:\\n{err}\"\n",
" ) from err\n",
" \n",
"endpoint.traffic = {azure_deployment_name: 100}\n",
"endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bc7b788-01f1-47d8-8142-5fdfb0014063",
"metadata": {},
"outputs": [],
"source": [
"endpoint_poller.result()"
]
},
{
"cell_type": "markdown",
"id": "72f437f6-153a-42d5-ab22-0011d0fe2481",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 4. Test\n",
"\n",
"---\n",
"\n",
"### 4.1. Invocation\n",
"\n",
"Try calling the endpoint.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adf37c63-4d1e-44a3-8c94-268ffc716da9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import base64\n",
"\n",
"\n",
"with open('./test_images/DocumentVQA_Test_01.jpg', 'rb') as img:\n",
" base64_img = base64.b64encode(img.read()).decode('utf-8')\n",
" \n",
"sample = {\n",
" \"task_prompt\": \"DocVQA\",\n",
" \"image_input\": base64_img,\n",
" \"text_input\": \"What do you see in this image\", \n",
" \"params\": {\n",
" \"max_new_tokens\": 512,\n",
" \"num_beams\": 3\n",
" }\n",
"}\n",
"\n",
"test_src_dir = \"./inference-test\"\n",
"os.makedirs(test_src_dir, exist_ok=True)\n",
"print(f\"test script directory: {test_src_dir}\")\n",
"sample_data_path = os.path.join(test_src_dir, \"sample-request.json\")\n",
"\n",
"with open(sample_data_path, \"w\") as f:\n",
" json.dump(sample, f)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53b3079d-a412-497d-aa37-795a1683ac78",
"metadata": {},
"outputs": [],
"source": [
"result = ml_client.online_endpoints.invoke(\n",
" endpoint_name=azure_endpoint_name,\n",
" deployment_name=azure_deployment_name,\n",
" request_file=sample_data_path,\n",
")\n",
"\n",
"result_json = json.loads(result)\n",
"print(result_json['result'])"
]
},
{
"cell_type": "markdown",
"id": "6a9d96bd-75da-4c10-923c-edad899fc4d3",
"metadata": {},
"source": [
"### 4.2. LLM latency/throughput benchmarking\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20bf4fff-915b-45a8-9f46-bff0e80373df",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from time import perf_counter\n",
"\n",
"def benchmark_latency(endpoint_name, deployment_name, sample_data_path, num_warmups=1, num_infers=5):\n",
" print(f\"Measuring latency for Endpoint '{endpoint_name}' and Deployment '{deployment_name}', num_infers={num_infers}\")\n",
"\n",
" latencies = []\n",
" # warm up\n",
" for _ in range(num_warmups):\n",
" result = ml_client.online_endpoints.invoke(\n",
" endpoint_name=endpoint_name,\n",
" deployment_name=deployment_name,\n",
" request_file=sample_data_path,\n",
" ) \n",
" \n",
" begin = time.time() \n",
" # Timed run\n",
" for _ in range(num_infers):\n",
" start_time = perf_counter()\n",
" result = ml_client.online_endpoints.invoke(\n",
" endpoint_name=endpoint_name,\n",
" deployment_name=deployment_name,\n",
" request_file=sample_data_path,\n",
" )\n",
" latency = perf_counter() - start_time\n",
" latencies.append(latency)\n",
" end = time.time() \n",
" \n",
" # Compute run statistics\n",
" duration = end - begin \n",
" time_avg_sec = np.mean(latencies)\n",
" time_std_sec = np.std(latencies)\n",
" time_p95_sec = np.percentile(latencies, 95)\n",
" time_p99_sec = np.percentile(latencies, 99)\n",
" \n",
" # Metrics\n",
" metrics = {\n",
" 'duration': duration,\n",
" 'avg_sec': time_avg_sec,\n",
" 'std_sec': time_std_sec, \n",
" 'p95_sec': time_p95_sec,\n",
" 'p99_sec': time_p99_sec \n",
" }\n",
" \n",
" return metrics\n",
"\n",
"def benchmark_latency_multicore(endpoint_name, deployment_name, sample_data_path, num_warmups=1, num_infers=5, num_threads=2):\n",
" import time\n",
" import concurrent.futures\n",
"\n",
" # Warmup\n",
" for _ in range(num_warmups):\n",
" result = ml_client.online_endpoints.invoke(\n",
" endpoint_name=endpoint_name,\n",
" deployment_name=deployment_name,\n",
" request_file=sample_data_path,\n",
" ) \n",
" \n",
" latencies = []\n",
"\n",
" # Thread task: Each of these thread tasks executes in a serial loop for a single model.\n",
" # Multiple of these threads are launched to achieve parallelism.\n",
" def task(model):\n",
" for _ in range(num_infers):\n",
" start = time.time()\n",
" result = ml_client.online_endpoints.invoke(\n",
" endpoint_name=endpoint_name,\n",
" deployment_name=deployment_name,\n",
" request_file=sample_data_path,\n",
" ) \n",
" finish = time.time()\n",
" latencies.append(finish - start)\n",
" \n",
" # Submit tasks\n",
" begin = time.time()\n",
" with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as pool:\n",
" for i in range(num_threads):\n",
" pool.submit(task, model)\n",
" end = time.time()\n",
"\n",
" # Compute metrics\n",
" duration = end - begin\n",
" inferences = len(latencies)\n",
" throughput = inferences / duration\n",
" avg_latency = sum(latencies) / len(latencies)\n",
" \n",
" # Compute run statistics\n",
" time_avg_sec = np.mean(latencies)\n",
" time_std_sec = np.std(latencies)\n",
" time_p95_sec = np.percentile(latencies, 95)\n",
" time_p99_sec = np.percentile(latencies, 99)\n",
" \n",
" time_std_sec = np.std(latencies)\n",
" time_p95_sec = np.percentile(latencies, 95)\n",
" time_p99_sec = np.percentile(latencies, 99)\n",
"\n",
" # Metrics\n",
" metrics = {\n",
" 'threads': num_threads,\n",
" 'duration': duration,\n",
" 'throughput': throughput,\n",
" 'avg_sec': avg_latency,\n",
" 'std_sec': time_std_sec, \n",
" 'p95_sec': time_p95_sec,\n",
" 'p99_sec': time_p99_sec \n",
" }\n",
" \n",
" return metrics"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8827fb45-c7ab-4aa8-b9f4-1c4e610d695a",
"metadata": {},
"outputs": [],
"source": [
"benchmark_result = benchmark_latency(azure_endpoint_name, azure_deployment_name, sample_data_path, num_warmups=1, num_infers=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c62e9289-fe65-4cbf-abc8-f19a4f486bde",
"metadata": {},
"outputs": [],
"source": [
"print(benchmark_result)"
]
},
{
"cell_type": "markdown",
"id": "5834a237-e751-446a-ac21-7272c29b0c2c",
"metadata": {},
"source": [
"## Clean up\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59508aad",
"metadata": {},
"outputs": [],
"source": [
"!rm -rf {test_src_dir}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc72663d-d773-435b-871f-cc51b1e51763",
"metadata": {},
"outputs": [],
"source": [
"ml_client.online_endpoints.begin_delete(azure_endpoint_name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py312-dev",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}