notebooks/community/model_garden/model_garden_phi4_deployment.ipynb (789 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "8v1fgX5Nd3gb"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xaG_TQiBd3gb"
},
"source": [
"# Vertex AI Model Garden - Phi-4 (Deployment)\n",
"\n",
"<table><tbody><tr>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_phi4_deployment.ipynb\">\n",
" <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_phi4_deployment.ipynb\">\n",
" <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</tr></tbody></table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iJs8Mk6Vd3gb"
},
"source": [
"## Overview\n",
"\n",
"This notebook demonstrates deploying prebuilt [Phi-4 models](https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4) with [vLLM](https://github.com/vllm-project/vllm) and [HexLLM](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-hex-llm?hl=en) to improve serving throughput.\n",
"\n",
"\n",
"### Objective\n",
"\n",
"- Download and deploy prebuilt Phi-4 models\n",
"- Deploy Phi-4 with [vLLM](https://github.com/vllm-project/vllm) to improve serving throughput\n",
"- Deploy Phi-4 with [HexLLM](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-hex-llm?hl=en)\n",
"\n",
"### Costs\n",
"\n",
"This tutorial uses billable components of Google Cloud:\n",
"\n",
"* Vertex AI\n",
"* Cloud Storage\n",
"\n",
"Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Cj84x0OUd3gb"
},
"source": [
"## Before you begin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "0QATZfrLd3gb"
},
"outputs": [],
"source": [
"# @title Setup Google Cloud project\n",
"\n",
"# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
"\n",
"# @markdown 2. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
"\n",
"BUCKET_URI = \"gs://\" # @param {type:\"string\"}\n",
"\n",
"# @markdown 3. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.\n",
"\n",
"REGION = \"\" # @param {type:\"string\"}\n",
"\n",
"# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
"\n",
"# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
"# @markdown | ----------- | ----------- | ----------- |\n",
"# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
"# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |\n",
"\n",
"# Import the necessary packages\n",
"import datetime\n",
"import importlib\n",
"import os\n",
"import uuid\n",
"from typing import Tuple\n",
"\n",
"from google.cloud import aiplatform\n",
"\n",
"! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n",
"\n",
"models, endpoints = {}, {}\n",
"\n",
"common_util = importlib.import_module(\n",
" \"vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util\"\n",
")\n",
"\n",
"# Get the default cloud project id.\n",
"PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n",
"\n",
"# Get the default region for launching jobs.\n",
"if not REGION:\n",
" if not os.environ.get(\"GOOGLE_CLOUD_REGION\"):\n",
" raise ValueError(\n",
" \"REGION must be set. See\"\n",
" \" https://cloud.google.com/vertex-ai/docs/general/locations for\"\n",
" \" available cloud locations.\"\n",
" )\n",
" REGION = os.environ[\"GOOGLE_CLOUD_REGION\"]\n",
"\n",
"# Enable the Vertex AI API and Compute Engine API, if not already.\n",
"print(\"Enabling Vertex AI API and Compute Engine API.\")\n",
"! gcloud services enable aiplatform.googleapis.com compute.googleapis.com\n",
"\n",
"# Cloud Storage bucket for storing the experiment artifacts.\n",
"# A unique GCS bucket will be created for the purpose of this notebook. If you\n",
"# prefer using your own GCS bucket, change the value yourself below.\n",
"now = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
"BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
"\n",
"if BUCKET_URI is None or BUCKET_URI.strip() == \"\" or BUCKET_URI == \"gs://\":\n",
" BUCKET_URI = f\"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}\"\n",
" BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
" ! gsutil mb -l {REGION} {BUCKET_URI}\n",
"else:\n",
" assert BUCKET_URI.startswith(\"gs://\"), \"BUCKET_URI must start with `gs://`.\"\n",
" shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep \"Location constraint:\" | sed \"s/Location constraint://\"\n",
" bucket_region = shell_output[0].strip().lower()\n",
" if bucket_region != REGION:\n",
" raise ValueError(\n",
" \"Bucket region %s is different from notebook region %s\"\n",
" % (bucket_region, REGION)\n",
" )\n",
"print(f\"Using this GCS Bucket: {BUCKET_URI}\")\n",
"\n",
"STAGING_BUCKET = os.path.join(BUCKET_URI, \"temporal\")\n",
"MODEL_BUCKET = os.path.join(BUCKET_URI, \"phi4\")\n",
"\n",
"\n",
"# Initialize Vertex AI API.\n",
"print(\"Initializing Vertex AI API.\")\n",
"aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)\n",
"\n",
"# Gets the default SERVICE_ACCOUNT.\n",
"shell_output = ! gcloud projects describe $PROJECT_ID\n",
"project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n",
"SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n",
"print(\"Using this default Service Account:\", SERVICE_ACCOUNT)\n",
"\n",
"\n",
"# Provision permissions to the SERVICE_ACCOUNT with the GCS bucket\n",
"! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin $BUCKET_NAME\n",
"\n",
"! gcloud config set project $PROJECT_ID\n",
"! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role=\"roles/storage.admin\"\n",
"! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role=\"roles/aiplatform.user\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "czbg_Jfed3gb"
},
"source": [
"## Deploy prebuilt Phi-4 models with vLLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "I-xYEPgVd3gb"
},
"outputs": [],
"source": [
"# @title Deploy\n",
"\n",
"# @markdown This section uploads prebuilt the Phi-4 model to Model Registry and deploys it to a Vertex AI Endpoint.\n",
"\n",
"# @markdown The Phi-4 model may take 15-30 minutes to deploy.\n",
"\n",
"\n",
"# @markdown | Model Version | Default Max Model Length | Available GPU configurations |\n",
"# @markdown |----------------------------|------------------|-----------------------------|\n",
"# @markdown | Phi-4 | 16384 | 1 NVIDIA_A100 80GB a2-ultragpu-1g, 2 NVIDIA_L4 g2-standard-24 |\n",
"\n",
"# The pre-built serving docker images.\n",
"VLLM_DOCKER_URI = \"us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/vllm-inference.cu121.0-6.ubuntu2204.py310\"\n",
"\n",
"MODEL_ID = \"Phi-4\"\n",
"model_path_prefix = \"microsoft\"\n",
"model_id = os.path.join(model_path_prefix, MODEL_ID)\n",
"\n",
"accelerator_type = \"NVIDIA_L4\" # @param [\"NVIDIA_L4\", \"NVIDIA_A100_80GB\"] {isTemplate: true}\n",
"machine_type = None\n",
"vllm_dtype = \"bfloat16\"\n",
"accelerator_count = None\n",
"max_model_len = None\n",
"gpu_memory_utilization = None\n",
"enable_trust_remote_code = False\n",
"\n",
"if \"Phi-4\" == MODEL_ID:\n",
" max_model_len = 16384\n",
" if accelerator_type == \"NVIDIA_L4\":\n",
" accelerator_count = 2\n",
" machine_type = \"g2-standard-24\"\n",
" gpu_memory_utilization = 0.85\n",
" elif accelerator_type == \"NVIDIA_A100_80GB\":\n",
" accelerator_count = 1\n",
" machine_type = \"a2-ultragpu-1g\"\n",
" gpu_memory_utilization = 0.85\n",
" else:\n",
" raise ValueError(\n",
" \"Recommended machine settings not found for accelerator type: %s\"\n",
" % accelerator_type\n",
" )\n",
"else:\n",
" raise ValueError(\"Invalid model id: %s\" % MODEL_ID)\n",
"\n",
"common_util.check_quota(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" is_for_training=False,\n",
")\n",
"\n",
"\n",
"def deploy_model_vllm(\n",
" model_name: str,\n",
" model_id: str,\n",
" publisher: str,\n",
" publisher_model_id: str,\n",
" service_account: str,\n",
" base_model_id: str = None,\n",
" machine_type: str = \"g2-standard-8\",\n",
" accelerator_type: str = \"NVIDIA_L4\",\n",
" accelerator_count: int = 1,\n",
" gpu_memory_utilization: float = 0.9,\n",
" max_model_len: int = 4096,\n",
" dtype: str = \"auto\",\n",
" enable_trust_remote_code: bool = False,\n",
" enforce_eager: bool = False,\n",
" enable_lora: bool = False,\n",
" enable_chunked_prefill: bool = False,\n",
" enable_prefix_cache: bool = False,\n",
" host_prefix_kv_cache_utilization_target: float = 0.0,\n",
" max_loras: int = 1,\n",
" max_cpu_loras: int = 8,\n",
" use_dedicated_endpoint: bool = False,\n",
" max_num_seqs: int = 256,\n",
" model_type: str = None,\n",
" enable_llama_tool_parser: bool = False,\n",
") -> Tuple[aiplatform.Model, aiplatform.Endpoint]:\n",
" \"\"\"Deploys trained models with vLLM into Vertex AI.\"\"\"\n",
" endpoint = aiplatform.Endpoint.create(\n",
" display_name=f\"{model_name}-endpoint\",\n",
" dedicated_endpoint_enabled=use_dedicated_endpoint,\n",
" )\n",
"\n",
" if not base_model_id:\n",
" base_model_id = model_id\n",
"\n",
" # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.\n",
" vllm_args = [\n",
" \"python\",\n",
" \"-m\",\n",
" \"vllm.entrypoints.api_server\",\n",
" \"--host=0.0.0.0\",\n",
" \"--port=8080\",\n",
" f\"--model={model_id}\",\n",
" f\"--tensor-parallel-size={accelerator_count}\",\n",
" \"--swap-space=16\",\n",
" f\"--gpu-memory-utilization={gpu_memory_utilization}\",\n",
" f\"--max-model-len={max_model_len}\",\n",
" f\"--dtype={dtype}\",\n",
" f\"--max-loras={max_loras}\",\n",
" f\"--max-cpu-loras={max_cpu_loras}\",\n",
" f\"--max-num-seqs={max_num_seqs}\",\n",
" \"--disable-log-stats\",\n",
" ]\n",
"\n",
" if enable_trust_remote_code:\n",
" vllm_args.append(\"--trust-remote-code\")\n",
"\n",
" if enforce_eager:\n",
" vllm_args.append(\"--enforce-eager\")\n",
"\n",
" if enable_lora:\n",
" vllm_args.append(\"--enable-lora\")\n",
"\n",
" if enable_chunked_prefill:\n",
" vllm_args.append(\"--enable-chunked-prefill\")\n",
"\n",
" if enable_prefix_cache:\n",
" vllm_args.append(\"--enable-prefix-caching\")\n",
"\n",
" if 0 < host_prefix_kv_cache_utilization_target < 1:\n",
" vllm_args.append(\n",
" f\"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}\"\n",
" )\n",
"\n",
" if model_type:\n",
" vllm_args.append(f\"--model-type={model_type}\")\n",
"\n",
" if enable_llama_tool_parser:\n",
" vllm_args.append(\"--enable-auto-tool-choice\")\n",
" vllm_args.append(\"--tool-call-parser=vertex-llama-3\")\n",
"\n",
" env_vars = {\n",
" \"MODEL_ID\": base_model_id,\n",
" \"DEPLOY_SOURCE\": \"notebook\",\n",
" }\n",
"\n",
" # HF_TOKEN is not a compulsory field and may not be defined.\n",
" try:\n",
" if HF_TOKEN:\n",
" env_vars[\"HF_TOKEN\"] = HF_TOKEN\n",
" except NameError:\n",
" pass\n",
"\n",
" model = aiplatform.Model.upload(\n",
" display_name=model_name,\n",
" serving_container_image_uri=VLLM_DOCKER_URI,\n",
" serving_container_args=vllm_args,\n",
" serving_container_ports=[8080],\n",
" serving_container_predict_route=\"/generate\",\n",
" serving_container_health_route=\"/ping\",\n",
" serving_container_environment_variables=env_vars,\n",
" serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB\n",
" serving_container_deployment_timeout=7200,\n",
" model_garden_source_model_name=(\n",
" f\"publishers/{publisher}/models/{publisher_model_id}\"\n",
" ),\n",
" )\n",
" print(\n",
" f\"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s).\"\n",
" )\n",
" model.deploy(\n",
" endpoint=endpoint,\n",
" machine_type=machine_type,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" deploy_request_timeout=1800,\n",
" service_account=service_account,\n",
" system_labels={\n",
" \"NOTEBOOK_NAME\": \"model_garden_phi4_deployment.ipynb\",\n",
" \"NOTEBOOK_ENVIRONMENT\": common_util.get_deploy_source(),\n",
" },\n",
" )\n",
" print(\"endpoint_name:\", endpoint.name)\n",
"\n",
" return model, endpoint\n",
"\n",
"\n",
"# @markdown Set `use_dedicated_endpoint` to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint).\n",
"use_dedicated_endpoint = True # @param {type:\"boolean\"}\n",
"\n",
"\n",
"models[\"vllm_gpu\"], endpoints[\"vllm_gpu\"] = deploy_model_vllm(\n",
" model_name=common_util.get_job_name_with_datetime(prefix=MODEL_ID),\n",
" model_id=model_id,\n",
" publisher=\"microsoft\",\n",
" publisher_model_id=\"phi-4\",\n",
" service_account=SERVICE_ACCOUNT,\n",
" machine_type=machine_type,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" max_model_len=max_model_len,\n",
" gpu_memory_utilization=gpu_memory_utilization,\n",
" dtype=vllm_dtype,\n",
" enable_trust_remote_code=enable_trust_remote_code,\n",
" use_dedicated_endpoint=use_dedicated_endpoint,\n",
")\n",
"\n",
"# @markdown Click \"Show Code\" to see more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "eksglm0Yd3gb"
},
"outputs": [],
"source": [
"# @title Predict\n",
"\n",
"# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).\n",
"\n",
"# @markdown Example:\n",
"\n",
"# @markdown ```\n",
"# @markdown Human: What is a car?\n",
"# @markdown Assistant: A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another. The term also encompasses a wide range of vehicles, including motorboats, trains, and aircrafts. Cars typically have four wheels, a cabin for passengers, and an engine or motor. They have been around since the early 19th century and are now one of the most popular forms of transportation, used for daily commuting, shopping, and other purposes.\n",
"# @markdown ```\n",
"# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.\n",
"\n",
"# Loads an existing endpoint instance using the endpoint name:\n",
"# - Using `endpoint_name = endpoint.name` allows us to get the\n",
"# endpoint name of the endpoint `endpoint` created in the cell\n",
"# above.\n",
"# - Alternatively, you can set `endpoint_name = \"1234567890123456789\"` to load\n",
"# an existing endpoint with the ID 1234567890123456789.\n",
"# You may uncomment the code below to load an existing endpoint.\n",
"\n",
"# endpoint_name = \"\" # @param {type:\"string\"}\n",
"# aip_endpoint_name = (\n",
"# f\"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}\"\n",
"# )\n",
"# endpoint = aiplatform.Endpoint(aip_endpoint_name)\n",
"\n",
"prompt = \"What is a car?\" # @param {type: \"string\"}\n",
"# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n",
"max_tokens = 50 # @param {type:\"integer\"}\n",
"temperature = 1.0 # @param {type:\"number\"}\n",
"top_p = 1.0 # @param {type:\"number\"}\n",
"top_k = 1 # @param {type:\"integer\"}\n",
"# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `\"Prompt:\\n{prompt.strip()}\\nOutput:\\n{output}\"`.\n",
"raw_response = False # @param {type:\"boolean\"}\n",
"\n",
"# Overrides parameters for inferences.\n",
"instances = [\n",
" {\n",
" \"prompt\": prompt,\n",
" \"max_tokens\": max_tokens,\n",
" \"temperature\": temperature,\n",
" \"top_p\": top_p,\n",
" \"top_k\": top_k,\n",
" \"raw_response\": raw_response,\n",
" },\n",
"]\n",
"response = endpoints[\"vllm_gpu\"].predict(\n",
" instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n",
")\n",
"\n",
"for prediction in response.predictions:\n",
" print(prediction)\n",
"\n",
"# @markdown Click \"Show Code\" to see more details."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OHKQj8V8d3gb"
},
"source": [
"## Deploy prebuilt Phi-4 models with HexLLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "5kkOzZ_jd3gb"
},
"outputs": [],
"source": [
"# @title Deploy\n",
"\n",
"# @markdown This section uploads prebuilt Phi-4 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.\n",
"\n",
"# @markdown Select one of the four model variations.\n",
"MODEL_ID = \"Phi-4\"\n",
"TPU_DEPLOYMENT_REGION = \"us-west1\" # @param [\"us-west1\"] {isTemplate:true}\n",
"model_path_prefix = \"microsoft\"\n",
"model_id = os.path.join(model_path_prefix, MODEL_ID)\n",
"\n",
"# The pre-built serving docker images.\n",
"HEXLLM_DOCKER_URI = \"us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/hex-llm-serve:20241210_2323_RC00\"\n",
"\n",
"# @markdown Find Vertex AI prediction TPUv5e machine types in\n",
"# @markdown https://cloud.google.com/vertex-ai/docs/predictions/use-tpu#deploy_a_model.\n",
"\n",
"# @markdown | Model Version | Default Max Model Length | Default TPU configuration |\n",
"# @markdown |----------------------------|------------------|-----------------------------|\n",
"# @markdown | Phi-4 | 16384 | 4 TPU_V5e ct5lp-hightpu-4t |\n",
"\n",
"\n",
"# Note: 1 TPU V5 chip has only one core.\n",
"tpu_type = \"TPU_V5e\"\n",
"\n",
"if \"Phi-4\" in MODEL_ID:\n",
" tpu_count = 4\n",
" tpu_topo = \"4x4\"\n",
" max_model_len = 16384\n",
" machine_type = \"ct5lp-hightpu-4t\"\n",
"else:\n",
" raise ValueError(f\"Unsupported MODEL_ID: {MODEL_ID}\")\n",
"\n",
"common_util.check_quota(\n",
" project_id=PROJECT_ID,\n",
" region=TPU_DEPLOYMENT_REGION,\n",
" accelerator_type=tpu_type,\n",
" accelerator_count=tpu_count,\n",
" is_for_training=False,\n",
")\n",
"\n",
"# Server parameters.\n",
"tensor_parallel_size = tpu_count\n",
"\n",
"# Fraction of HBM memory allocated for KV cache after model loading. A larger value improves throughput but gives higher risk of TPU out-of-memory errors with long prompts.\n",
"hbm_utilization_factor = 0.85\n",
"\n",
"max_running_seqs = 256\n",
"\n",
"# Endpoint configurations.\n",
"min_replica_count = 1\n",
"max_replica_count = 1\n",
"\n",
"\n",
"def deploy_model_hexllm(\n",
" model_name: str,\n",
" model_id: str,\n",
" publisher: str,\n",
" publisher_model_id: str,\n",
" service_account: str = None,\n",
" base_model_id: str = None,\n",
" data_parallel_size: int = 1,\n",
" tensor_parallel_size: int = 1,\n",
" machine_type: str = \"ct5lp-hightpu-1t\",\n",
" tpu_topology: str = \"1x1\",\n",
" disagg_topology: str = None,\n",
" hbm_utilization_factor: float = 0.6,\n",
" max_running_seqs: int = 256,\n",
" max_model_len: int = 4096,\n",
" enable_prefix_cache_hbm: bool = False,\n",
" endpoint_id: str = \"\",\n",
" min_replica_count: int = 1,\n",
" max_replica_count: int = 1,\n",
" use_dedicated_endpoint: bool = False,\n",
") -> Tuple[aiplatform.Model, aiplatform.Endpoint]:\n",
" \"\"\"Deploys models with Hex-LLM on TPU in Vertex AI.\"\"\"\n",
" if endpoint_id:\n",
" aip_endpoint_name = (\n",
" f\"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_id}\"\n",
" )\n",
" endpoint = aiplatform.Endpoint(aip_endpoint_name)\n",
" else:\n",
" endpoint = aiplatform.Endpoint.create(\n",
" display_name=f\"{model_name}-endpoint\",\n",
" location=TPU_DEPLOYMENT_REGION,\n",
" dedicated_endpoint_enabled=use_dedicated_endpoint,\n",
" )\n",
"\n",
" if not base_model_id:\n",
" base_model_id = model_id\n",
"\n",
" if not tensor_parallel_size:\n",
" tensor_parallel_size = int(machine_type[-2])\n",
"\n",
" num_hosts = int(tpu_topology.split(\"x\")[0])\n",
"\n",
" # Learn more about the supported arguments and environment variables at https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-hex-llm#config-server.\n",
" hexllm_args = [\n",
" \"--host=0.0.0.0\",\n",
" \"--port=7080\",\n",
" f\"--model={model_id}\",\n",
" f\"--data_parallel_size={data_parallel_size}\",\n",
" f\"--tensor_parallel_size={tensor_parallel_size}\",\n",
" f\"--num_hosts={num_hosts}\",\n",
" f\"--hbm_utilization_factor={hbm_utilization_factor}\",\n",
" f\"--max_running_seqs={max_running_seqs}\",\n",
" f\"--max_model_len={max_model_len}\",\n",
" ]\n",
" if disagg_topology:\n",
" hexllm_args.append(f\"--disagg_topo={disagg_topology}\")\n",
" if enable_prefix_cache_hbm and not disagg_topology:\n",
" hexllm_args.append(\"--enable_prefix_cache_hbm\")\n",
"\n",
" env_vars = {\n",
" \"MODEL_ID\": base_model_id,\n",
" \"HEX_LLM_LOG_LEVEL\": \"info\",\n",
" \"DEPLOY_SOURCE\": \"notebook\",\n",
" }\n",
"\n",
" # HF_TOKEN is not a compulsory field and may not be defined.\n",
" try:\n",
" if HF_TOKEN:\n",
" env_vars.update({\"HF_TOKEN\": HF_TOKEN})\n",
" except:\n",
" pass\n",
"\n",
" model = aiplatform.Model.upload(\n",
" display_name=model_name,\n",
" serving_container_image_uri=HEXLLM_DOCKER_URI,\n",
" serving_container_command=[\"python\", \"-m\", \"hex_llm.server.api_server\"],\n",
" serving_container_args=hexllm_args,\n",
" serving_container_ports=[7080],\n",
" serving_container_predict_route=\"/generate\",\n",
" serving_container_health_route=\"/ping\",\n",
" serving_container_environment_variables=env_vars,\n",
" serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB\n",
" serving_container_deployment_timeout=7200,\n",
" location=TPU_DEPLOYMENT_REGION,\n",
" model_garden_source_model_name=(\n",
" f\"publishers/{publisher}/models/{publisher_model_id}\"\n",
" ),\n",
" )\n",
"\n",
" model.deploy(\n",
" endpoint=endpoint,\n",
" machine_type=machine_type,\n",
" tpu_topology=tpu_topology if num_hosts > 1 else None,\n",
" deploy_request_timeout=1800,\n",
" service_account=service_account,\n",
" min_replica_count=min_replica_count,\n",
" max_replica_count=max_replica_count,\n",
" system_labels={\n",
" \"NOTEBOOK_NAME\": \"model_garden_phi4_deployment.ipynb\",\n",
" \"NOTEBOOK_ENVIRONMENT\": common_util.get_deploy_source(),\n",
" },\n",
" )\n",
" return model, endpoint\n",
"\n",
"\n",
"# @markdown Set `use_dedicated_endpoint` to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint).\n",
"use_dedicated_endpoint = True # @param {type:\"boolean\"}\n",
"\n",
"\n",
"models[\"hexllm_tpu\"], endpoints[\"hexllm_tpu\"] = deploy_model_hexllm(\n",
" model_name=common_util.get_job_name_with_datetime(prefix=MODEL_ID),\n",
" model_id=model_id,\n",
" publisher=\"microsoft\",\n",
" publisher_model_id=\"phi-4\",\n",
" service_account=SERVICE_ACCOUNT,\n",
" tensor_parallel_size=tensor_parallel_size,\n",
" machine_type=machine_type,\n",
" tpu_topology=tpu_topo,\n",
" hbm_utilization_factor=hbm_utilization_factor,\n",
" max_running_seqs=max_running_seqs,\n",
" max_model_len=max_model_len,\n",
" min_replica_count=min_replica_count,\n",
" max_replica_count=max_replica_count,\n",
" use_dedicated_endpoint=use_dedicated_endpoint,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "zxsr8p5Md3gb"
},
"outputs": [],
"source": [
"# @title Predict\n",
"\n",
"# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts based on your `template`. Note that the first few prompts will take longer to execute.\n",
"\n",
"# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.\n",
"\n",
"# @markdown Example:\n",
"\n",
"# @markdown ```\n",
"# @markdown > What is a car?\n",
"# @markdown > A car is a four-wheeled vehicle designed for the transportation of passengers and their belongings.\n",
"# @markdown ```\n",
"\n",
"# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.\n",
"\n",
"# Loads an existing endpoint instance using the endpoint name:\n",
"# - Using `endpoint_name = endpoint.name` allows us to get the endpoint\n",
"# name of the endpoint `endpoint` created in the cell above.\n",
"# - Alternatively, you can set `endpoint_name = \"1234567890123456789\"` to load\n",
"# an existing endpoint with the ID 1234567890123456789.\n",
"# You may uncomment the code below to load an existing endpoint:\n",
"# endpoint_name = endpoint_without_peft.name\n",
"# # endpoint_name = \"\" # @param {type:\"string\"}\n",
"# aip_endpoint_name = (\n",
"# f\"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}\"\n",
"# )\n",
"# endpoint = aiplatform.Endpoint(aip_endpoint_name)\n",
"\n",
"prompt = \"What is a car?\" # @param {type: \"string\"}\n",
"# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.\n",
"max_tokens = 50 # @param {type: \"integer\"}\n",
"temperature = 1.0 # @param {type: \"number\"}\n",
"top_p = 1.0 # @param {type: \"number\"}\n",
"top_k = 1 # @param {type: \"integer\"}\n",
"\n",
"# Overrides parameters for inferences.\n",
"instances = [\n",
" {\n",
" \"prompt\": prompt,\n",
" \"max_tokens\": max_tokens,\n",
" \"temperature\": temperature,\n",
" \"top_p\": top_p,\n",
" \"top_k\": top_k,\n",
" },\n",
"]\n",
"response = endpoints[\"hexllm_tpu\"].predict(\n",
" instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n",
")\n",
"\n",
"for prediction in response.predictions:\n",
" print(prediction)\n",
"\n",
"# @markdown Click \"Show Code\" to see more details."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cwiAYVHyd3gb"
},
"source": [
"## Clean up resources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "tN9dgpYfd3gb"
},
"outputs": [],
"source": [
"# @title Delete the models and endpoints\n",
"# @markdown Delete the experiment models and endpoints to recycle the resources\n",
"# @markdown and avoid unnecessary continuous charges that may incur.\n",
"\n",
"# Undeploy model and delete endpoint.\n",
"for endpoint in endpoints.values():\n",
" endpoint.delete(force=True)\n",
"\n",
"# Delete models.\n",
"for model in models.values():\n",
" model.delete()\n",
"\n",
"delete_bucket = False # @param {type:\"boolean\"}\n",
"if delete_bucket:\n",
" ! gsutil -m rm -r $BUCKET_NAME"
]
}
],
"metadata": {
"colab": {
"name": "model_garden_phi4_deployment.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}