notebooks/community/model_garden/model_garden_llama_guard_deployment.ipynb (859 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "7d9bbf86da5e"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EED-Xb0GP_IZ"
},
"source": [
"# Vertex AI Model Garden - Llama Guard\n",
"\n",
"<table><tbody><tr>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/instances\">\n",
" <img alt=\"Workbench logo\" src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" width=\"32px\"><br> Run in Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_llama_guard_deployment.ipynb\">\n",
" <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_llama_guard_deployment.ipynb\">\n",
" <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</tr></tbody></table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3de7470326a2"
},
"source": [
"## Overview\n",
"\n",
"This notebook demonstrates downloading and deploying [Llama Guard models](https://huggingface.co/meta-llama) with [vLLM](https://github.com/vllm-project/vllm) on GPU, and demonstrates using the Llama Guard model to safeguard LLM inputs and outputs with the Vertex Llama API service.\n",
"\n",
"### Objective\n",
"\n",
"- Download and deploy Llama Guard models with [vLLM](https://github.com/vllm-project/vllm) on GPU\n",
"- Use the Llama Guard models to safeguard LLM inputs and outputs with the Vertex Llama 3.1 API service\n",
"- Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 3.2 API service\n",
"- Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 4 API service\n",
"\n",
"### File a bug\n",
"\n",
"File a bug on [GitHub](https://github.com/GoogleCloudPlatform/vertex-ai-samples/issues/new) if you encounter any issue with the notebook.\n",
"\n",
"### Costs\n",
"\n",
"This tutorial uses billable components of Google Cloud:\n",
"\n",
"* Vertex AI\n",
"* Cloud Storage\n",
"\n",
"Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "264c07757582"
},
"source": [
"## Before you begin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "p0F2TOGoI72D"
},
"outputs": [],
"source": [
"# @title Request for quota\n",
"\n",
"# @markdown By default, the quota for A100_80GB and H100 deployment `Custom model serving per region` is 0. You need to request quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
"\n",
"# @markdown For better chance to get resources, we recommend to request A100_80GB quota in the regions `us-central1, us-east1`, and request H100 quota in the regions `us-central1, us-west1`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "9I36hYfmI72D"
},
"outputs": [],
"source": [
"# @title Setup Google Cloud project\n",
"\n",
"# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
"\n",
"# @markdown 2. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.\n",
"\n",
"REGION = \"\" # @param {type:\"string\"}\n",
"\n",
"# @markdown 3. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
"\n",
"# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
"# @markdown | ----------- | ----------- | ----------- |\n",
"# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
"# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |\n",
"\n",
"# Import the necessary packages\n",
"\n",
"# Upgrade Vertex AI SDK.\n",
"! pip3 install --upgrade --quiet 'google-cloud-aiplatform>=1.84.0'\n",
"! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n",
"\n",
"import importlib\n",
"import os\n",
"import re\n",
"from typing import Tuple\n",
"\n",
"from google.cloud import aiplatform\n",
"\n",
"if os.environ.get(\"VERTEX_PRODUCT\") != \"COLAB_ENTERPRISE\":\n",
" ! pip install --upgrade tensorflow\n",
"! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n",
"\n",
"common_util = importlib.import_module(\n",
" \"vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util\"\n",
")\n",
"\n",
"LABEL = \"vllm_gpu\"\n",
"models, endpoints = {}, {}\n",
"\n",
"\n",
"# Get the default cloud project id.\n",
"PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n",
"\n",
"# Get the default region for launching jobs.\n",
"if not REGION:\n",
" REGION = os.environ[\"GOOGLE_CLOUD_REGION\"]\n",
"\n",
"# Initialize Vertex AI API.\n",
"print(\"Initializing Vertex AI API.\")\n",
"aiplatform.init(project=PROJECT_ID, location=REGION)\n",
"\n",
"! gcloud config set project $PROJECT_ID\n",
"import vertexai\n",
"\n",
"vertexai.init(\n",
" project=PROJECT_ID,\n",
" location=REGION,\n",
")\n",
"\n",
"# @markdown # Access Llama Guard models on Vertex AI\n",
"# @markdown The original models from Meta are converted into the Hugging Face format for serving in Vertex AI.\n",
"# @markdown Accept the model agreement to access the models:\n",
"# @markdown 1. Open the [Llama Guard model card](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-guard) from [Vertex AI Model Garden](https://cloud.google.com/model-garden).\n",
"# @markdown 2. Review and accept the agreement in the pop-up window on the model card page. If you have previously accepted the model agreement, there will not be a pop-up window on the model card page and this step is not needed.\n",
"# @markdown 3. After accepting the agreement, a `gs://` URI containing Llama Guard pretrained and finetuned models will be shared.\n",
"# @markdown 4. Paste the URI in the `VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD` field below.\n",
"# @markdown 5. The Llama Guard models will be copied into `BUCKET_URI`.\n",
"\n",
"\n",
"VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD = \"\" # @param {type:\"string\", isTemplate:true}\n",
"assert (\n",
" VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD\n",
"), \"Click the agreement in Vertex AI Model Garden at https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-guard, and get the GCS path of Llama Guard model artifacts.\"\n",
"parsed_gcs_url = re.search(\"gs://.*?(?=[ ]|$)\", VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD)\n",
"if parsed_gcs_url:\n",
" VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD = parsed_gcs_url.group()\n",
"assert VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD.startswith(\n",
" \"gs://\"\n",
"), \"VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD is expected to be a GCS URI and must start with `gs://`.\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z-XybZjtgF9M"
},
"source": [
"## Deploy Llama Guard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "kRiRTAMxxUoq"
},
"outputs": [],
"source": [
"# @title Select the model variants\n",
"\n",
"# @markdown Select one of the three model variations.\n",
"\n",
"base_model_name = \"Llama-Guard-4-12B\" # @param [\"Llama-Guard-4-12B\", \"Llama-Guard-3-8B\", \"Llama-Guard-3-1B\", \"Llama-Guard-3-11B-Vision\"] {allow-input: true, isTemplate: true}\n",
"model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD, base_model_name)\n",
"hf_model_id = \"meta-llama/\" + base_model_name\n",
"version_id = base_model_name.lower()\n",
"PUBLISHER_MODEL_NAME = f\"publishers/meta/models/llama-guard@{version_id}\"\n",
"\n",
"# The pre-built serving docker images.\n",
"VLLM_DOCKER_URI = \"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250429_0916_RC01\"\n",
"\n",
"# @markdown Set use_dedicated_endpoint to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint). Note that [dedicated endpoint does not support VPC Service Controls](https://cloud.google.com/vertex-ai/docs/predictions/choose-endpoint-type), uncheck the box if you are using VPC-SC.\n",
"use_dedicated_endpoint = True # @param {type:\"boolean\"}\n",
"\n",
"# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.\n",
"if \"3-1B\" in base_model_name or \"3-8B\" in base_model_name:\n",
" accelerator_type = \"NVIDIA_L4\"\n",
" machine_type = \"g2-standard-12\"\n",
" accelerator_count = 1\n",
" max_num_seqs = 256\n",
"elif \"3-11B\" in base_model_name or \"4-12B\" in base_model_name:\n",
" accelerator_type = \"NVIDIA_TESLA_A100\"\n",
" machine_type = \"a2-highgpu-1g\"\n",
" accelerator_count = 1\n",
" max_num_seqs = 12\n",
"else:\n",
" raise ValueError(f\"Recommended GPU setting not found for: {base_model_name}.\")\n",
"\n",
"common_util.check_quota(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" is_for_training=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "2DiRl36FzauJ"
},
"outputs": [],
"source": [
"# @title [Option 1] Deploy with Model Garden SDK\n",
"\n",
"# @markdown Deploy with Gen AI model-centric SDK. This section uploads the prebuilt model to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model. See [use open models with Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-open-models) for documentation on other use cases.\n",
"from vertexai.preview import model_garden\n",
"\n",
"model = model_garden.OpenModel(PUBLISHER_MODEL_NAME)\n",
"endpoints[LABEL] = model.deploy(\n",
" machine_type=machine_type,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" use_dedicated_endpoint=use_dedicated_endpoint,\n",
" accept_eula=True, # Accept the End User License Agreement (EULA) on the model card before deploy. Otherwise, the deployment will be forbidden.\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "E8OiHHNNE_wj"
},
"outputs": [],
"source": [
"# @title [Option 2] Deploy with customized configurations\n",
"\n",
"# @markdown This section uploads Llama Guard models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.\n",
"\n",
"gpu_memory_utilization = 0.9\n",
"max_model_len = 4096\n",
"\n",
"\n",
"def deploy_model_vllm(\n",
" model_name: str,\n",
" model_id: str,\n",
" publisher: str,\n",
" publisher_model_id: str,\n",
" base_model_id: str = None,\n",
" machine_type: str = \"g2-standard-8\",\n",
" accelerator_type: str = \"NVIDIA_L4\",\n",
" accelerator_count: int = 1,\n",
" gpu_memory_utilization: float = 0.9,\n",
" max_model_len: int = 4096,\n",
" dtype: str = \"auto\",\n",
" enable_trust_remote_code: bool = False,\n",
" enforce_eager: bool = False,\n",
" enable_lora: bool = False,\n",
" enable_chunked_prefill: bool = False,\n",
" enable_prefix_cache: bool = False,\n",
" host_prefix_kv_cache_utilization_target: float = 0.0,\n",
" max_loras: int = 1,\n",
" max_cpu_loras: int = 8,\n",
" use_dedicated_endpoint: bool = False,\n",
" max_num_seqs: int = 256,\n",
" model_type: str = None,\n",
" enable_llama_tool_parser: bool = False,\n",
") -> Tuple[aiplatform.Model, aiplatform.Endpoint]:\n",
" \"\"\"Deploys trained models with vLLM into Vertex AI.\"\"\"\n",
" endpoint = aiplatform.Endpoint.create(\n",
" display_name=f\"{model_name}-endpoint\",\n",
" dedicated_endpoint_enabled=use_dedicated_endpoint,\n",
" )\n",
"\n",
" if not base_model_id:\n",
" base_model_id = model_id\n",
"\n",
" # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.\n",
" vllm_args = [\n",
" \"python\",\n",
" \"-m\",\n",
" \"vllm.entrypoints.api_server\",\n",
" \"--host=0.0.0.0\",\n",
" \"--port=8080\",\n",
" f\"--model={model_id}\",\n",
" f\"--tensor-parallel-size={accelerator_count}\",\n",
" \"--swap-space=16\",\n",
" f\"--gpu-memory-utilization={gpu_memory_utilization}\",\n",
" f\"--max-model-len={max_model_len}\",\n",
" f\"--dtype={dtype}\",\n",
" f\"--max-loras={max_loras}\",\n",
" f\"--max-cpu-loras={max_cpu_loras}\",\n",
" f\"--max-num-seqs={max_num_seqs}\",\n",
" \"--disable-log-stats\",\n",
" ]\n",
"\n",
" if enable_trust_remote_code:\n",
" vllm_args.append(\"--trust-remote-code\")\n",
"\n",
" if enforce_eager:\n",
" vllm_args.append(\"--enforce-eager\")\n",
"\n",
" if enable_lora:\n",
" vllm_args.append(\"--enable-lora\")\n",
"\n",
" if enable_chunked_prefill:\n",
" vllm_args.append(\"--enable-chunked-prefill\")\n",
"\n",
" if enable_prefix_cache:\n",
" vllm_args.append(\"--enable-prefix-caching\")\n",
"\n",
" if 0 < host_prefix_kv_cache_utilization_target < 1:\n",
" vllm_args.append(\n",
" f\"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}\"\n",
" )\n",
"\n",
" if model_type:\n",
" vllm_args.append(f\"--model-type={model_type}\")\n",
"\n",
" if enable_llama_tool_parser:\n",
" vllm_args.append(\"--enable-auto-tool-choice\")\n",
" vllm_args.append(\"--tool-call-parser=vertex-llama-3\")\n",
"\n",
" env_vars = {\n",
" \"MODEL_ID\": base_model_id,\n",
" \"DEPLOY_SOURCE\": \"notebook\",\n",
" }\n",
"\n",
" # HF_TOKEN is not a compulsory field and may not be defined.\n",
" try:\n",
" if HF_TOKEN:\n",
" env_vars[\"HF_TOKEN\"] = HF_TOKEN\n",
" except NameError:\n",
" pass\n",
"\n",
" model = aiplatform.Model.upload(\n",
" display_name=model_name,\n",
" serving_container_image_uri=VLLM_DOCKER_URI,\n",
" serving_container_args=vllm_args,\n",
" serving_container_ports=[8080],\n",
" serving_container_predict_route=\"/generate\",\n",
" serving_container_health_route=\"/ping\",\n",
" serving_container_environment_variables=env_vars,\n",
" serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB\n",
" serving_container_deployment_timeout=7200,\n",
" model_garden_source_model_name=(\n",
" f\"publishers/{publisher}/models/{publisher_model_id}\"\n",
" ),\n",
" )\n",
" print(\n",
" f\"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s).\"\n",
" )\n",
" model.deploy(\n",
" endpoint=endpoint,\n",
" machine_type=machine_type,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" deploy_request_timeout=1800,\n",
" system_labels={\n",
" \"NOTEBOOK_NAME\": \"model_garden_llama_guard_deployment.ipynb\",\n",
" \"NOTEBOOK_ENVIRONMENT\": common_util.get_deploy_source(),\n",
" },\n",
" )\n",
" print(\"endpoint_name:\", endpoint.name)\n",
"\n",
" return model, endpoint\n",
"\n",
"\n",
"models[LABEL], endpoints[LABEL] = deploy_model_vllm(\n",
" model_name=common_util.get_job_name_with_datetime(prefix=\"llama-guard\"),\n",
" model_id=model_id,\n",
" publisher=\"meta\",\n",
" publisher_model_id=\"llama-guard\",\n",
" base_model_id=hf_model_id,\n",
" machine_type=machine_type,\n",
" accelerator_type=accelerator_type,\n",
" accelerator_count=accelerator_count,\n",
" gpu_memory_utilization=gpu_memory_utilization,\n",
" max_model_len=max_model_len,\n",
" enforce_eager=False,\n",
" use_dedicated_endpoint=use_dedicated_endpoint,\n",
" max_num_seqs=max_num_seqs,\n",
" enable_llama_tool_parser=False,\n",
")\n",
"# @markdown Click \"Show Code\" to see more details."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "192a021iB_DE"
},
"source": [
"## Use the Llama Guard models to safeguard LLM inputs and outputs with the Vertex Llama 3.1 API service\n",
"\n",
"We use [meta-llama/Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B) to safeguard input and output conversations with the [Llama 3.1 405B Instruct model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-3.1-405b-instruct-maas).\n",
"\n",
"Llama Guard 3 builds on the capabilities introduced with Llama Guard 2, adding three new categories, Defamation, Elections and Code Interpreter Abuse. Additionally this model is multilingual and a new prompt format is introduced, making Llama Guard 3’s prompt format consistent with Llama 3+ Instruct models.\n",
"\n",
"This section references [LlamaGuard.ipynb](https://colab.research.google.com/drive/16s0tlCSEDtczjPzdIK3jq0Le5LlnSYGf?usp=sharing) from [https://huggingface.co/meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "fHC7INgjB_DF"
},
"outputs": [],
"source": [
"!pip install --upgrade --quiet openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "ajjcGNzhB_DF"
},
"outputs": [],
"source": [
"import google.auth\n",
"import openai\n",
"\n",
"# @markdown Set up the Llama 3.1 405B Instruct model API service.\n",
"\n",
"# Programmatically get an access token\n",
"creds, _ = google.auth.default(\n",
" scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n",
")\n",
"auth_req = google.auth.transport.requests.Request()\n",
"creds.refresh(auth_req)\n",
"# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n",
"\n",
"client = openai.OpenAI(\n",
" base_url=f\"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n",
" api_key=creds.token,\n",
")\n",
"LLAMA3_405B_INSTRUCT = \"meta/llama-3.1-405b-instruct-maas\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "NvSfBcUUB_DF"
},
"outputs": [],
"source": [
"# @markdown Define input message in conversation and get output message from model.\n",
"\n",
"message_role = \"user\" # @param {type: \"string\"}\n",
"message_content = \"What is a car?\" # @param {type: \"string\"}\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": message_role,\n",
" \"content\": message_content,\n",
" }\n",
"]\n",
"print(\"Conversation [turn 1]:\", messages)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=LLAMA3_405B_INSTRUCT,\n",
" messages=messages,\n",
")\n",
"print(\"Response:\", response)\n",
"\n",
"messages.append(\n",
" {\n",
" \"role\": response.choices[0].message.role,\n",
" \"content\": response.choices[0].message.content,\n",
" }\n",
")\n",
"print(\"Conversation [turn 2]:\", messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "Y7-ym3GlB_DG"
},
"outputs": [],
"source": [
"# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n",
"# @markdown Classification is performed on the last turn of the conversation.\n",
"# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n",
"# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n",
"\n",
"instances = [\n",
" {\n",
" \"messages\": messages,\n",
" \"@requestFormat\": \"chatCompletions\",\n",
" },\n",
"]\n",
"response = endpoints[\"vllm_gpu\"].predict(\n",
" instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n",
")\n",
"\n",
"prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n",
"print(\"Llama Guard prediction:\", prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h_LgDtVyO13s"
},
"source": [
"## Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 3.2 API service\n",
"\n",
"We use [meta-llama/Llama-Guard-3-11B-Vision](https://huggingface.co/meta-llama/Llama-Guard-3-11B-Vision) to safeguard input and output conversations with the [Llama 3.2 90B-Vision-Instruct model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-3.2-90b-vision-instruct-maas)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "4hgRrEuqO13s"
},
"outputs": [],
"source": [
"!pip install --upgrade --quiet openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "zw5BkBd4O13s"
},
"outputs": [],
"source": [
"import google.auth\n",
"import openai\n",
"\n",
"# @markdown Set up the Llama 3.2 90B-Vision-Instruct model API service.\n",
"\n",
"# Programmatically get an access token\n",
"creds, _ = google.auth.default(\n",
" scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n",
")\n",
"auth_req = google.auth.transport.requests.Request()\n",
"creds.refresh(auth_req)\n",
"# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n",
"\n",
"client = openai.OpenAI(\n",
" base_url=f\"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n",
" api_key=creds.token,\n",
")\n",
"LLAMA3_90B_VISION_INSTRUCT = \"meta/llama-3.2-90b-vision-instruct-maas\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "3xX8VqWFO13s"
},
"outputs": [],
"source": [
"# @markdown Define input message in conversation and get output message from model.\n",
"\n",
"user_image = \"https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg\" # @param {type: \"string\"}\n",
"user_message = \"What is in the image?\" # @param {type: \"string\"}\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": user_image},\n",
" },\n",
" {\"type\": \"text\", \"text\": user_message},\n",
" ],\n",
" }\n",
"]\n",
"\n",
"print(\"Conversation [turn 1]:\", messages)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=LLAMA3_90B_VISION_INSTRUCT,\n",
" messages=messages,\n",
")\n",
"print(\"Response:\", response)\n",
"\n",
"messages.append(\n",
" {\n",
" \"role\": response.choices[0].message.role,\n",
" \"content\": response.choices[0].message.content,\n",
" }\n",
")\n",
"\n",
"print(\"Conversation [turn 2]:\", messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "6zhDnfAcO13s"
},
"outputs": [],
"source": [
"# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n",
"# @markdown Classification is performed on the last turn of the conversation.\n",
"# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n",
"# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n",
"\n",
"instances = [\n",
" {\n",
" \"messages\": messages,\n",
" \"@requestFormat\": \"chatCompletions\",\n",
" },\n",
"]\n",
"response = endpoints[\"vllm_gpu\"].predict(\n",
" instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n",
")\n",
"\n",
"prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n",
"print(\"Llama Guard prediction:\", prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h_LgDtVyO13s"
},
"source": [
"## Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 4 API service\n",
"\n",
"We use [meta-llama/Llama-Guard-4-12B](https://huggingface.co/meta-llama/Llama-Guard-4-12B) to safeguard input and output conversations with the [Llama 4 model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-4-maverick-17b-128e-instruct-maas)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "4hgRrEuqO13s"
},
"outputs": [],
"source": [
"!pip install --upgrade --quiet openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "zw5BkBd4O13s"
},
"outputs": [],
"source": [
"import google.auth\n",
"import openai\n",
"\n",
"# @markdown Set up the Llama 4 model API service.\n",
"\n",
"# Programmatically get an access token\n",
"creds, _ = google.auth.default(\n",
" scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n",
")\n",
"auth_req = google.auth.transport.requests.Request()\n",
"creds.refresh(auth_req)\n",
"# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n",
"\n",
"client = openai.OpenAI(\n",
" base_url=f\"https://us-east5-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n",
" api_key=creds.token,\n",
")\n",
"LLAMA4_MODEL_ID = \"meta/llama-4-scout-17b-16e-instruct-maas\" # @param [\"meta/llama-4-scout-17b-16e-instruct-maas\", \"meta/llama-4-maverick-17b-128e-instruct-maas\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "3xX8VqWFO13s"
},
"outputs": [],
"source": [
"# @markdown Define input message in conversation and get output message from model.\n",
"\n",
"user_image = \"https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg\" # @param {type: \"string\"}\n",
"user_message = \"What is in the image?\" # @param {type: \"string\"}\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\"url\": user_image},\n",
" },\n",
" {\"type\": \"text\", \"text\": user_message},\n",
" ],\n",
" }\n",
"]\n",
"\n",
"print(\"Conversation [turn 1]:\", messages)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=LLAMA4_MODEL_ID,\n",
" messages=messages,\n",
")\n",
"print(\"Response:\", response)\n",
"\n",
"messages.append(\n",
" {\n",
" \"role\": response.choices[0].message.role,\n",
" \"content\": response.choices[0].message.content,\n",
" }\n",
")\n",
"\n",
"print(\"Conversation [turn 2]:\", messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "6zhDnfAcO13s"
},
"outputs": [],
"source": [
"# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n",
"# @markdown Classification is performed on the last turn of the conversation.\n",
"# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n",
"# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n",
"\n",
"instances = [\n",
" {\n",
" \"messages\": messages,\n",
" \"@requestFormat\": \"chatCompletions\",\n",
" },\n",
"]\n",
"response = endpoints[\"vllm_gpu\"].predict(\n",
" instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n",
")\n",
"\n",
"prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n",
"print(\"Llama Guard prediction:\", prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "956x4r7rsrza"
},
"source": [
"## Clean up resources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "911406c1561e"
},
"outputs": [],
"source": [
"# @title Delete the models and endpoints\n",
"# @markdown Delete the experiment models and endpoints to recycle the resources\n",
"# @markdown and avoid unnecessary continuous charges that may incur.\n",
"\n",
"# Undeploy model and delete endpoint.\n",
"for endpoint in endpoints.values():\n",
" endpoint.delete(force=True)\n",
"\n",
"# Delete models.\n",
"for model in models.values():\n",
" model.delete()"
]
}
],
"metadata": {
"colab": {
"name": "model_garden_llama_guard_deployment.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}