2_slm-fine-tuning-mlstudio/phi/2_serving_llm_phi.ipynb (871 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "796cf06d-e4d3-422e-8301-a961e9520f52",
"metadata": {},
"source": [
"# vLLM/SGLang serving using the Azure ML Python SDK\n",
"\n",
"vLLM/SGLang is a high-throughput, memory-efficient engine for serving LLMs. It utilizes PagedAttention for optimized memory management, enabling continuous batching and fast inference. It supports Hugging Face models, multiple decoding algorithms, and distributed inference techniques, making it a scalable and flexible solution.\n",
"\n",
"- vLLM: https://docs.vllm.ai/en/latest/\n",
"- SGLang: https://docs.sglang.ai/\n",
"\n",
"[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "840136a2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import os, sys\n",
"lab_prep_dir = os.getcwd().split(\"slm-innovator-lab\")[0] + \"slm-innovator-lab/0_lab_preparation\"\n",
"sys.path.append(os.path.abspath(lab_prep_dir))\n",
"\n",
"from common import check_kernel\n",
"check_kernel()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "538d1c11-a30d-4342-85d5-ba59e1890b76",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%store -r job_name\n",
"try:\n",
" job_name\n",
"except NameError:\n",
" print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n",
" print(\"[ERROR] Please run the previous notebook (model training) again.\")\n",
" print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")"
]
},
{
"cell_type": "markdown",
"id": "bb139b1d-500c-4ef2-9af3-728f2a5ea05f",
"metadata": {},
"source": [
"## 1. Load config file\n",
"\n",
"---\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5234c47-b3e5-4218-8a98-3988c8991643",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"import yaml\n",
"from logger import logger\n",
"from datetime import datetime\n",
"\n",
"snapshot_date = datetime.now().strftime(\"%Y-%m-%d\")\n",
"\n",
"with open(\"config.yml\") as f:\n",
" d = yaml.load(f, Loader=yaml.FullLoader)\n",
"\n",
"AZURE_SUBSCRIPTION_ID = d[\"config\"][\"AZURE_SUBSCRIPTION_ID\"]\n",
"AZURE_RESOURCE_GROUP = d[\"config\"][\"AZURE_RESOURCE_GROUP\"]\n",
"AZURE_WORKSPACE = d[\"config\"][\"AZURE_WORKSPACE\"]\n",
"AZURE_DATA_NAME = d[\"config\"][\"AZURE_DATA_NAME\"]\n",
"DATA_DIR = d[\"config\"][\"DATA_DIR\"]\n",
"CLOUD_DIR = d[\"config\"][\"CLOUD_DIR\"]\n",
"HF_MODEL_NAME_OR_PATH = d[\"config\"][\"HF_MODEL_NAME_OR_PATH\"]\n",
"IS_DEBUG = d[\"config\"][\"IS_DEBUG\"]\n",
"\n",
"azure_env_name = d[\"serve\"][\"azure_env_name\"]\n",
"azure_model_name = d[\"serve\"][\"azure_model_name\"]\n",
"azure_endpoint_name = d[\"serve\"][\"azure_endpoint_name\"]\n",
"azure_deployment_name = d[\"serve\"][\"azure_deployment_name\"]\n",
"azure_serving_cluster_size = d[\"serve\"][\"azure_serving_cluster_size\"]\n",
"port = d[\"serve\"][\"port\"]\n",
"engine = d[\"serve\"][\"engine\"]\n",
"\n",
"logger.info(\"===== 0. Azure ML Deployment Info =====\")\n",
"logger.info(f\"AZURE_SUBSCRIPTION_ID={AZURE_SUBSCRIPTION_ID}\")\n",
"logger.info(f\"AZURE_RESOURCE_GROUP={AZURE_RESOURCE_GROUP}\")\n",
"logger.info(f\"AZURE_WORKSPACE={AZURE_WORKSPACE}\")\n",
"logger.info(f\"AZURE_DATA_NAME={AZURE_DATA_NAME}\")\n",
"logger.info(f\"DATA_DIR={DATA_DIR}\")\n",
"logger.info(f\"CLOUD_DIR={CLOUD_DIR}\")\n",
"logger.info(f\"HF_MODEL_NAME_OR_PATH={HF_MODEL_NAME_OR_PATH}\")\n",
"logger.info(f\"IS_DEBUG={IS_DEBUG}\")\n",
"\n",
"logger.info(f\"azure_env_name={azure_env_name}\")\n",
"logger.info(f\"azure_model_name={azure_model_name}\")\n",
"logger.info(f\"azure_endpoint_name={azure_endpoint_name}\")\n",
"logger.info(f\"azure_deployment_name={azure_deployment_name}\")\n",
"logger.info(f\"azure_serving_cluster_size={azure_serving_cluster_size}\")\n",
"logger.info(f\"port={port}\")\n",
"logger.info(f\"engine={engine}\")"
]
},
{
"cell_type": "markdown",
"id": "d9843e0f-3cf1-4e86-abb7-a49919fac8d4",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 2. Serving preparation\n",
"\n",
"---\n",
"\n",
"### 2.1. Configure workspace details\n",
"\n",
"To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccb4a273-ba31-4f47-a2fd-dc8cdea390f6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# import required libraries\n",
"import time\n",
"from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n",
"from azure.ai.ml import MLClient, Input\n",
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Model\n",
"from azure.ai.ml.constants import AssetTypes\n",
"from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError\n",
"\n",
"logger.info(f\"===== 2. Serving preparation =====\")\n",
"logger.info(f\"Calling DefaultAzureCredential.\")\n",
"credential = DefaultAzureCredential()\n",
"ml_client = MLClient(\n",
" credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ba2a377b-d10c-413e-a67b-2c11a3cff7fd",
"metadata": {},
"source": [
"### 2.2. Create model asset\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73532c39-3fdd-40a7-b2be-9f5a2f22443a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def get_or_create_model_asset(\n",
" ml_client,\n",
" model_name,\n",
" job_name=None,\n",
" model_dir=\"outputs\",\n",
" model_type=\"custom_model\",\n",
" update=False,\n",
"):\n",
" try:\n",
" latest_model_version = max(\n",
" [int(m.version) for m in ml_client.models.list(name=model_name)]\n",
" )\n",
" if update:\n",
" raise ResourceExistsError(\"Found Model asset, but will update the Model.\")\n",
" else:\n",
" model_asset = ml_client.models.get(\n",
" name=model_name, version=latest_model_version\n",
" )\n",
" logger.info(f\"Found Model asset: {model_name}. Will not create again\")\n",
" except (ResourceNotFoundError, ResourceExistsError) as e:\n",
" logger.info(f\"Exception: {e}\")\n",
" if job_name is None:\n",
" model_path = model_dir\n",
" else:\n",
" model_path = (\n",
" f\"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/\"\n",
" )\n",
" run_model = Model(\n",
" name=model_name,\n",
" path=model_path,\n",
" description=\"Model created from run.\",\n",
" type=model_type, # mlflow_model, custom_model, triton_model\n",
" )\n",
" model_asset = ml_client.models.create_or_update(run_model)\n",
" logger.info(f\"Created Model asset: {model_name}\")\n",
"\n",
" return model_asset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9aaa9671-fc98-4e5e-a70c-7771caa1c7d6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"model_dir = d[\"train\"][\"model_dir\"]\n",
"model = get_or_create_model_asset(\n",
" ml_client,\n",
" azure_model_name,\n",
" job_name,\n",
" model_dir,\n",
" model_type=\"custom_model\",\n",
" update=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b0df561a-7846-4450-a2dd-4af6396b1719",
"metadata": {},
"source": [
"### 2.3. Create AzureML environment\n",
"\n",
"Azure ML defines containers (called environment asset) in which your code will run. We can use the built-in environment or build a custom environment (Docker container, conda). This hands-on uses Docker container.\n"
]
},
{
"cell_type": "markdown",
"id": "08ac357a-3944-4702-a338-b9f4b67dadc9",
"metadata": {},
"source": [
"#### Docker environment\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a65ce80",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Environment, BuildContext\n",
"\n",
"\n",
"def get_or_create_docker_environment_asset(\n",
" ml_client,\n",
" env_name,\n",
" docker_dir,\n",
" dockerfile_path=\"Dockerfile\",\n",
" inference_config=None,\n",
" update=False,\n",
"):\n",
"\n",
" try:\n",
" latest_env_version = max(\n",
" [int(e.version) for e in ml_client.environments.list(name=env_name)]\n",
" )\n",
" if update:\n",
" raise ResourceExistsError(\n",
" \"Found Environment asset, but will update the Environment.\"\n",
" )\n",
" else:\n",
" env_asset = ml_client.environments.get(\n",
" name=env_name, version=latest_env_version\n",
" )\n",
" logger.info(f\"Found Environment asset: {env_name}. Will not create again\")\n",
" except (ResourceNotFoundError, ResourceExistsError) as e:\n",
" logger.info(f\"Exception: {e}\")\n",
" env_docker_image = Environment(\n",
" build=BuildContext(dockerfile_path=dockerfile_path, path=docker_dir),\n",
" name=env_name,\n",
" description=\"Environment created from a Docker context.\",\n",
" inference_config=inference_config,\n",
" )\n",
" env_asset = ml_client.environments.create_or_update(env_docker_image)\n",
" logger.info(f\"Created Environment asset: {env_name}\")\n",
"\n",
" return env_asset\n",
"\n",
"\n",
"inference_config = {\n",
" \"liveness_route\": {\n",
" \"port\": port,\n",
" \"path\": \"/health\",\n",
" },\n",
" \"readiness_route\": {\n",
" \"port\": port,\n",
" \"path\": \"/health\",\n",
" },\n",
" \"scoring_route\": {\n",
" \"port\": port,\n",
" \"path\": \"/\",\n",
" },\n",
"}\n",
"\n",
"if engine == \"vllm\":\n",
" dockerfile_path = \"Dockerfile.vllm\"\n",
"elif engine == \"sglang\":\n",
" dockerfile_path = \"Dockerfile.sglang\"\n",
"else:\n",
" dockerfile_path = \"Dockerfile\"\n",
"\n",
"env = get_or_create_docker_environment_asset(\n",
" ml_client,\n",
" azure_env_name,\n",
" f\"{CLOUD_DIR}/serve\",\n",
" dockerfile_path=dockerfile_path,\n",
" inference_config=inference_config,\n",
" update=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "aafb187e-370f-4481-9d82-a38ae982c1e3",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 3. Serving\n",
"\n",
"---\n",
"\n",
"### 3.1. Create endpoint\n",
"\n",
"Create an endpoint. This process does not provision a GPU cluster yet.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77c22433-4ab8-4db9-956b-7f437b86dfa6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from azure.ai.ml.entities import (\n",
" ManagedOnlineEndpoint,\n",
" IdentityConfiguration,\n",
" ManagedIdentityConfiguration,\n",
")\n",
"\n",
"logger.info(f\"===== 3. Serving =====\")\n",
"\n",
"t0 = time.time()\n",
"\n",
"# Check if the endpoint already exists in the workspace\n",
"try:\n",
" endpoint = ml_client.online_endpoints.get(azure_endpoint_name)\n",
" logger.info(\"---Endpoint already exists---\")\n",
"except:\n",
" # Create an online endpoint if it doesn't exist\n",
"\n",
" # Define the endpoint\n",
" endpoint = ManagedOnlineEndpoint(\n",
" name=azure_endpoint_name,\n",
" description=f\"Test endpoint for {model.name}\",\n",
" # identity=IdentityConfiguration(\n",
" # type=\"user_assigned\",\n",
" # user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],\n",
" # )\n",
" # if uai_id != \"\"\n",
" # else None,\n",
" )\n",
"\n",
"# Trigger the endpoint creation\n",
"try:\n",
" ml_client.begin_create_or_update(endpoint).wait()\n",
" logger.info(\"\\n---Endpoint created successfully---\\n\")\n",
"except Exception as err:\n",
" raise RuntimeError(f\"Endpoint creation failed. Detailed Response:\\n{err}\") from err\n",
"\n",
"t1 = time.time()\n",
"\n",
"from humanfriendly import format_timespan\n",
"\n",
"timespan = format_timespan(t1 - t0)\n",
"logger.info(f\"Creating Endpoint took {timespan}\")"
]
},
{
"cell_type": "markdown",
"id": "7f8a05af-0aca-4d36-9c0f-bfa4dcc6203b",
"metadata": {},
"source": [
"### 3.2. Create Deployment\n",
"\n",
"Create a Deployment. This takes a lot of time as GPU clusters must be provisioned and the serving environment must be built.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3fd8cf53",
"metadata": {},
"outputs": [],
"source": [
"env_vars = {\n",
" \"MODEL_NAME\": \"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n",
" \"SGLANG_ARGS\": \"--mem-fraction-static 0.8 --chunked-prefill-size 4096\",\n",
" \"HUGGING_FACE_HUB_TOKEN\": \"[YOUR-HF-TOKEN]\",\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f1c99ba",
"metadata": {},
"outputs": [],
"source": [
"HF_TOKEN = d[\"config\"][\"HF_TOKEN\"]\n",
"HF_MODEL_NAME_OR_PATH = d[\"config\"][\"HF_MODEL_NAME_OR_PATH\"]\n",
"finetune_model_name = \"phi-finetune\"\n",
"\n",
"env_vars = {\n",
" \"MODEL_NAME\": HF_MODEL_NAME_OR_PATH,\n",
" # MODEL_NAME\": \"/models/outputs\",\n",
" \"VLLM_ARGS\": f\"--max-model-len 32768 --enable-lora --lora-modules {finetune_model_name}=/models/outputs\",\n",
" \"HUGGING_FACE_HUB_TOKEN\": HF_TOKEN,\n",
"}\n",
"deployment_env_vars = {**env_vars}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3afaa8b-5af1-49d1-990f-414da0effe8a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"import time\n",
"from azure.ai.ml.entities import ( \n",
" OnlineRequestSettings,\n",
" CodeConfiguration,\n",
" ManagedOnlineDeployment,\n",
" ProbeSettings,\n",
" Environment\n",
")\n",
"\n",
"t0 = time.time()\n",
"deployment = ManagedOnlineDeployment(\n",
" name=azure_deployment_name,\n",
" endpoint_name=azure_endpoint_name,\n",
" model=model,\n",
" model_mount_path=\"/models\",\n",
" instance_type=azure_serving_cluster_size,\n",
" instance_count=1,\n",
" environment_variables=deployment_env_vars, \n",
" environment=env,\n",
" # scoring_script=\"score.py\",\n",
" # code_path=\"./src_serve\",\n",
" request_settings=OnlineRequestSettings(\n",
" max_concurrent_requests_per_instance=2,\n",
" request_timeout_ms=20000, \n",
" max_queue_wait_ms=60000\n",
" ),\n",
" liveness_probe=ProbeSettings(\n",
" failure_threshold=5,\n",
" success_threshold=1,\n",
" timeout=10,\n",
" period=30,\n",
" initial_delay=120\n",
" ),\n",
" readiness_probe=ProbeSettings(\n",
" failure_threshold=30,\n",
" success_threshold=1,\n",
" timeout=2,\n",
" period=10,\n",
" initial_delay=120,\n",
" ),\n",
")\n",
"\n",
"# Trigger the deployment creation\n",
"try:\n",
" ml_client.begin_create_or_update(deployment).wait()\n",
" logger.info(\"\\n---Deployment created successfully---\\n\")\n",
"except Exception as err:\n",
" raise RuntimeError(\n",
" f\"Deployment creation failed. Detailed Response:\\n{err}\"\n",
" ) from err\n",
" \n",
"endpoint.traffic = {azure_deployment_name: 100}\n",
"endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint)\n",
"\n",
"t1 = time.time()\n",
"timespan = format_timespan(t1 - t0)\n",
"logger.info(f\"Creating deployment took {timespan}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bc7b788-01f1-47d8-8142-5fdfb0014063",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"endpoint_results = endpoint_poller.result()"
]
},
{
"cell_type": "markdown",
"id": "72f437f6-153a-42d5-ab22-0011d0fe2481",
"metadata": {},
"source": [
"<br>\n",
"\n",
"## 4. Test\n",
"\n",
"---\n",
"\n",
"### 4.1. Invocation\n",
"\n",
"Try calling the endpoint.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adf37c63-4d1e-44a3-8c94-268ffc716da9",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"url = os.path.join(endpoint_results.scoring_uri, \"v1\")\n",
"endpoint_name = (\n",
" endpoint_results.name if azure_endpoint_name is None else azure_endpoint_name\n",
")\n",
"keys = ml_client.online_endpoints.get_keys(name=endpoint_name)\n",
"primary_key = keys.primary_key # You can paste [YOUR Azure ML API KEY] here\n",
"llm = OpenAI(base_url=url, api_key=primary_key)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53b3079d-a412-497d-aa37-795a1683ac78",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Create your prompt\n",
"system_message = \"\"\"\n",
"You are an AI assistant that helps customers find information. As an assistant, you respond to questions in a concise and unique manner.\n",
"You can use Markdown to answer simply and concisely, and add a personal touch with appropriate emojis.\n",
"\n",
"Add a witty joke starting with \"By the way,\" at the end of your response. Do not mention the customer's name in the joke part.\n",
"The joke should be related to the specific question asked.\n",
"For example, if the question is about tents, the joke should be specifically related to tents.\n",
"\n",
"Use the given context to provide a more personalized response. Write each sentence on a new line:\n",
"\"\"\"\n",
"context = \"\"\"\n",
" The Alpine Explorer Tent features a detachable partition to ensure privacy, \n",
" numerous mesh windows and adjustable vents for ventilation, and a waterproof design. \n",
" It also includes a built-in gear loft for storing outdoor essentials. \n",
" In short, it offers a harmonious blend of privacy, comfort, and convenience, making it a second home in nature!\n",
"\"\"\"\n",
"question = \"What are features of the Alpine Explorer Tent?\"\n",
"\n",
"user_message = f\"\"\"\n",
"Context: {context}\n",
"Question: {question}\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "da55f893",
"metadata": {},
"source": [
"Simple API call"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8041cae",
"metadata": {},
"outputs": [],
"source": [
"# Simple API Call\n",
"response = llm.chat.completions.create(\n",
" model=finetune_model_name,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_message},\n",
" ],\n",
" temperature=0.7,\n",
" max_tokens=200,\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "737676e1",
"metadata": {},
"source": [
"Streaming API Call"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe9dbd6a",
"metadata": {},
"outputs": [],
"source": [
"response = llm.chat.completions.create(\n",
" model=finetune_model_name,\n",
" messages=[\n",
" {\"role\": \"saystem\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_message},\n",
" ],\n",
" temperature=0.7,\n",
" max_tokens=200,\n",
" stream=True, # Stream the response\n",
")\n",
"\n",
"print(\"Streaming response:\")\n",
"for chunk in response:\n",
" delta = chunk.choices[0].delta\n",
" if hasattr(delta, \"content\"):\n",
" print(delta.content, end=\"\", flush=True)"
]
},
{
"cell_type": "markdown",
"id": "dcfc683c",
"metadata": {},
"source": [
"Another method"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94cf7ccd",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"completions_url = os.path.join(endpoint_results.scoring_uri, \"v1/completions\")\n",
"headers = {\"Content-Type\": \"application/json\", \"Authorization\": f\"Bearer {primary_key}\"}\n",
"data = {\n",
" \"model\": finetune_model_name,\n",
" \"prompt\": \"San Francisco is a \",\n",
" \"max_tokens\": 200,\n",
" \"temperature\": 0.7,\n",
"}\n",
"\n",
"response = requests.post(completions_url, headers=headers, json=data)\n",
"print(response.json())"
]
},
{
"cell_type": "markdown",
"id": "6a9d96bd-75da-4c10-923c-edad899fc4d3",
"metadata": {},
"source": [
"### 4.2. LLM latency/throughput benchmarking\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20bf4fff-915b-45a8-9f46-bff0e80373df",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from time import perf_counter\n",
"\n",
"\n",
"def simple_llm_benchmark(\n",
" llm: OpenAI,\n",
" messages: list,\n",
" model_path: str = \"microsoft/Phi-4-mini-instruct\",\n",
" num_warmups: int = 1,\n",
" num_infers: int = 5,\n",
" **params: dict,\n",
") -> dict:\n",
"\n",
" print(\"=== Measuring latency ===\")\n",
" print(f\"model_path={model_path}, num_infers={num_infers}, params={params}\")\n",
"\n",
" latencies = []\n",
" # Warm up\n",
" for _ in range(num_warmups):\n",
" response = llm.chat.completions.create(\n",
" model=model_path,\n",
" messages=messages,\n",
" **params,\n",
" )\n",
" print(\"=== Warmup done. Start Benchmarking... ===\")\n",
" begin = time.time()\n",
" # Timed run\n",
" for curr_infer in range(num_infers):\n",
" start_time = perf_counter()\n",
" if (curr_infer % 5) == 0:\n",
" print(f\"Inferring {curr_infer}th...\")\n",
" response = llm.chat.completions.create(\n",
" model=model_path,\n",
" messages=messages,\n",
" **params,\n",
" )\n",
" latency = perf_counter() - start_time\n",
" latencies.append(latency)\n",
" end = time.time()\n",
"\n",
" # Compute run statistics\n",
" duration = end - begin\n",
" time_avg_sec = np.mean(latencies)\n",
" time_std_sec = np.std(latencies)\n",
" time_p95_sec = np.percentile(latencies, 95)\n",
" time_p99_sec = np.percentile(latencies, 99)\n",
"\n",
" # Metrics\n",
" metrics = {\n",
" \"duration\": duration,\n",
" \"avg_sec\": time_avg_sec,\n",
" \"std_sec\": time_std_sec,\n",
" \"p95_sec\": time_p95_sec,\n",
" \"p99_sec\": time_p99_sec,\n",
" }\n",
"\n",
" return metrics"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8827fb45-c7ab-4aa8-b9f4-1c4e610d695a",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_message},\n",
"]\n",
"params = {\n",
" \"max_tokens\": 100,\n",
" \"temperature\": 0.5,\n",
"}\n",
"\n",
"metrics = simple_llm_benchmark(\n",
" llm,\n",
" messages,\n",
" model_path=finetune_model_name,\n",
" num_warmups=1,\n",
" num_infers=10,\n",
" **params\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c62e9289-fe65-4cbf-abc8-f19a4f486bde",
"metadata": {},
"outputs": [],
"source": [
"import pprint\n",
"\n",
"pprint.pprint(metrics)"
]
},
{
"cell_type": "markdown",
"id": "5834a237-e751-446a-ac21-7272c29b0c2c",
"metadata": {},
"source": [
"## Clean up\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59508aad",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!rm -rf {test_src_dir}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "caed4c32",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"......................................................................."
]
}
],
"source": [
"ml_client.online_endpoints.begin_delete(azure_endpoint_name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py312-dev",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}