notebooks/community/model_garden/model_garden_llama_guard_deployment.ipynb (859 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "7d9bbf86da5e" }, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "EED-Xb0GP_IZ" }, "source": [ "# Vertex AI Model Garden - Llama Guard\n", "\n", "<table><tbody><tr>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/instances\">\n", " <img alt=\"Workbench logo\" src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" width=\"32px\"><br> Run in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_llama_guard_deployment.ipynb\">\n", " <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_llama_guard_deployment.ipynb\">\n", " <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</tr></tbody></table>" ] }, { "cell_type": "markdown", "metadata": { "id": "3de7470326a2" }, "source": [ "## Overview\n", "\n", "This notebook demonstrates downloading and deploying [Llama Guard models](https://huggingface.co/meta-llama) with [vLLM](https://github.com/vllm-project/vllm) on GPU, and demonstrates using the Llama Guard model to safeguard LLM inputs and outputs with the Vertex Llama API service.\n", "\n", "### Objective\n", "\n", "- Download and deploy Llama Guard models with [vLLM](https://github.com/vllm-project/vllm) on GPU\n", "- Use the Llama Guard models to safeguard LLM inputs and outputs with the Vertex Llama 3.1 API service\n", "- Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 3.2 API service\n", "- Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 4 API service\n", "\n", "### File a bug\n", "\n", "File a bug on [GitHub](https://github.com/GoogleCloudPlatform/vertex-ai-samples/issues/new) if you encounter any issue with the notebook.\n", "\n", "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "264c07757582" }, "source": [ "## Before you begin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "p0F2TOGoI72D" }, "outputs": [], "source": [ "# @title Request for quota\n", "\n", "# @markdown By default, the quota for A100_80GB and H100 deployment `Custom model serving per region` is 0. You need to request quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n", "\n", "# @markdown For better chance to get resources, we recommend to request A100_80GB quota in the regions `us-central1, us-east1`, and request H100 quota in the regions `us-central1, us-west1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "9I36hYfmI72D" }, "outputs": [], "source": [ "# @title Setup Google Cloud project\n", "\n", "# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "# @markdown 2. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.\n", "\n", "REGION = \"\" # @param {type:\"string\"}\n", "\n", "# @markdown 3. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n", "\n", "# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n", "# @markdown | ----------- | ----------- | ----------- |\n", "# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n", "# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n", "# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n", "# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |\n", "\n", "# Import the necessary packages\n", "\n", "# Upgrade Vertex AI SDK.\n", "! pip3 install --upgrade --quiet 'google-cloud-aiplatform>=1.84.0'\n", "! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n", "\n", "import importlib\n", "import os\n", "import re\n", "from typing import Tuple\n", "\n", "from google.cloud import aiplatform\n", "\n", "if os.environ.get(\"VERTEX_PRODUCT\") != \"COLAB_ENTERPRISE\":\n", " ! pip install --upgrade tensorflow\n", "! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n", "\n", "common_util = importlib.import_module(\n", " \"vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util\"\n", ")\n", "\n", "LABEL = \"vllm_gpu\"\n", "models, endpoints = {}, {}\n", "\n", "\n", "# Get the default cloud project id.\n", "PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n", "\n", "# Get the default region for launching jobs.\n", "if not REGION:\n", " REGION = os.environ[\"GOOGLE_CLOUD_REGION\"]\n", "\n", "# Initialize Vertex AI API.\n", "print(\"Initializing Vertex AI API.\")\n", "aiplatform.init(project=PROJECT_ID, location=REGION)\n", "\n", "! gcloud config set project $PROJECT_ID\n", "import vertexai\n", "\n", "vertexai.init(\n", " project=PROJECT_ID,\n", " location=REGION,\n", ")\n", "\n", "# @markdown # Access Llama Guard models on Vertex AI\n", "# @markdown The original models from Meta are converted into the Hugging Face format for serving in Vertex AI.\n", "# @markdown Accept the model agreement to access the models:\n", "# @markdown 1. Open the [Llama Guard model card](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-guard) from [Vertex AI Model Garden](https://cloud.google.com/model-garden).\n", "# @markdown 2. Review and accept the agreement in the pop-up window on the model card page. If you have previously accepted the model agreement, there will not be a pop-up window on the model card page and this step is not needed.\n", "# @markdown 3. After accepting the agreement, a `gs://` URI containing Llama Guard pretrained and finetuned models will be shared.\n", "# @markdown 4. Paste the URI in the `VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD` field below.\n", "# @markdown 5. The Llama Guard models will be copied into `BUCKET_URI`.\n", "\n", "\n", "VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD = \"\" # @param {type:\"string\", isTemplate:true}\n", "assert (\n", " VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD\n", "), \"Click the agreement in Vertex AI Model Garden at https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-guard, and get the GCS path of Llama Guard model artifacts.\"\n", "parsed_gcs_url = re.search(\"gs://.*?(?=[ ]|$)\", VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD)\n", "if parsed_gcs_url:\n", " VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD = parsed_gcs_url.group()\n", "assert VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD.startswith(\n", " \"gs://\"\n", "), \"VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD is expected to be a GCS URI and must start with `gs://`.\"" ] }, { "cell_type": "markdown", "metadata": { "id": "z-XybZjtgF9M" }, "source": [ "## Deploy Llama Guard" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "kRiRTAMxxUoq" }, "outputs": [], "source": [ "# @title Select the model variants\n", "\n", "# @markdown Select one of the three model variations.\n", "\n", "base_model_name = \"Llama-Guard-4-12B\" # @param [\"Llama-Guard-4-12B\", \"Llama-Guard-3-8B\", \"Llama-Guard-3-1B\", \"Llama-Guard-3-11B-Vision\"] {allow-input: true, isTemplate: true}\n", "model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_GUARD, base_model_name)\n", "hf_model_id = \"meta-llama/\" + base_model_name\n", "version_id = base_model_name.lower()\n", "PUBLISHER_MODEL_NAME = f\"publishers/meta/models/llama-guard@{version_id}\"\n", "\n", "# The pre-built serving docker images.\n", "VLLM_DOCKER_URI = \"us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250429_0916_RC01\"\n", "\n", "# @markdown Set use_dedicated_endpoint to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint). Note that [dedicated endpoint does not support VPC Service Controls](https://cloud.google.com/vertex-ai/docs/predictions/choose-endpoint-type), uncheck the box if you are using VPC-SC.\n", "use_dedicated_endpoint = True # @param {type:\"boolean\"}\n", "\n", "# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.\n", "if \"3-1B\" in base_model_name or \"3-8B\" in base_model_name:\n", " accelerator_type = \"NVIDIA_L4\"\n", " machine_type = \"g2-standard-12\"\n", " accelerator_count = 1\n", " max_num_seqs = 256\n", "elif \"3-11B\" in base_model_name or \"4-12B\" in base_model_name:\n", " accelerator_type = \"NVIDIA_TESLA_A100\"\n", " machine_type = \"a2-highgpu-1g\"\n", " accelerator_count = 1\n", " max_num_seqs = 12\n", "else:\n", " raise ValueError(f\"Recommended GPU setting not found for: {base_model_name}.\")\n", "\n", "common_util.check_quota(\n", " project_id=PROJECT_ID,\n", " region=REGION,\n", " accelerator_type=accelerator_type,\n", " accelerator_count=accelerator_count,\n", " is_for_training=False,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "2DiRl36FzauJ" }, "outputs": [], "source": [ "# @title [Option 1] Deploy with Model Garden SDK\n", "\n", "# @markdown Deploy with Gen AI model-centric SDK. This section uploads the prebuilt model to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model. See [use open models with Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-open-models) for documentation on other use cases.\n", "from vertexai.preview import model_garden\n", "\n", "model = model_garden.OpenModel(PUBLISHER_MODEL_NAME)\n", "endpoints[LABEL] = model.deploy(\n", " machine_type=machine_type,\n", " accelerator_type=accelerator_type,\n", " accelerator_count=accelerator_count,\n", " use_dedicated_endpoint=use_dedicated_endpoint,\n", " accept_eula=True, # Accept the End User License Agreement (EULA) on the model card before deploy. Otherwise, the deployment will be forbidden.\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "E8OiHHNNE_wj" }, "outputs": [], "source": [ "# @title [Option 2] Deploy with customized configurations\n", "\n", "# @markdown This section uploads Llama Guard models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.\n", "\n", "gpu_memory_utilization = 0.9\n", "max_model_len = 4096\n", "\n", "\n", "def deploy_model_vllm(\n", " model_name: str,\n", " model_id: str,\n", " publisher: str,\n", " publisher_model_id: str,\n", " base_model_id: str = None,\n", " machine_type: str = \"g2-standard-8\",\n", " accelerator_type: str = \"NVIDIA_L4\",\n", " accelerator_count: int = 1,\n", " gpu_memory_utilization: float = 0.9,\n", " max_model_len: int = 4096,\n", " dtype: str = \"auto\",\n", " enable_trust_remote_code: bool = False,\n", " enforce_eager: bool = False,\n", " enable_lora: bool = False,\n", " enable_chunked_prefill: bool = False,\n", " enable_prefix_cache: bool = False,\n", " host_prefix_kv_cache_utilization_target: float = 0.0,\n", " max_loras: int = 1,\n", " max_cpu_loras: int = 8,\n", " use_dedicated_endpoint: bool = False,\n", " max_num_seqs: int = 256,\n", " model_type: str = None,\n", " enable_llama_tool_parser: bool = False,\n", ") -> Tuple[aiplatform.Model, aiplatform.Endpoint]:\n", " \"\"\"Deploys trained models with vLLM into Vertex AI.\"\"\"\n", " endpoint = aiplatform.Endpoint.create(\n", " display_name=f\"{model_name}-endpoint\",\n", " dedicated_endpoint_enabled=use_dedicated_endpoint,\n", " )\n", "\n", " if not base_model_id:\n", " base_model_id = model_id\n", "\n", " # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.\n", " vllm_args = [\n", " \"python\",\n", " \"-m\",\n", " \"vllm.entrypoints.api_server\",\n", " \"--host=0.0.0.0\",\n", " \"--port=8080\",\n", " f\"--model={model_id}\",\n", " f\"--tensor-parallel-size={accelerator_count}\",\n", " \"--swap-space=16\",\n", " f\"--gpu-memory-utilization={gpu_memory_utilization}\",\n", " f\"--max-model-len={max_model_len}\",\n", " f\"--dtype={dtype}\",\n", " f\"--max-loras={max_loras}\",\n", " f\"--max-cpu-loras={max_cpu_loras}\",\n", " f\"--max-num-seqs={max_num_seqs}\",\n", " \"--disable-log-stats\",\n", " ]\n", "\n", " if enable_trust_remote_code:\n", " vllm_args.append(\"--trust-remote-code\")\n", "\n", " if enforce_eager:\n", " vllm_args.append(\"--enforce-eager\")\n", "\n", " if enable_lora:\n", " vllm_args.append(\"--enable-lora\")\n", "\n", " if enable_chunked_prefill:\n", " vllm_args.append(\"--enable-chunked-prefill\")\n", "\n", " if enable_prefix_cache:\n", " vllm_args.append(\"--enable-prefix-caching\")\n", "\n", " if 0 < host_prefix_kv_cache_utilization_target < 1:\n", " vllm_args.append(\n", " f\"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}\"\n", " )\n", "\n", " if model_type:\n", " vllm_args.append(f\"--model-type={model_type}\")\n", "\n", " if enable_llama_tool_parser:\n", " vllm_args.append(\"--enable-auto-tool-choice\")\n", " vllm_args.append(\"--tool-call-parser=vertex-llama-3\")\n", "\n", " env_vars = {\n", " \"MODEL_ID\": base_model_id,\n", " \"DEPLOY_SOURCE\": \"notebook\",\n", " }\n", "\n", " # HF_TOKEN is not a compulsory field and may not be defined.\n", " try:\n", " if HF_TOKEN:\n", " env_vars[\"HF_TOKEN\"] = HF_TOKEN\n", " except NameError:\n", " pass\n", "\n", " model = aiplatform.Model.upload(\n", " display_name=model_name,\n", " serving_container_image_uri=VLLM_DOCKER_URI,\n", " serving_container_args=vllm_args,\n", " serving_container_ports=[8080],\n", " serving_container_predict_route=\"/generate\",\n", " serving_container_health_route=\"/ping\",\n", " serving_container_environment_variables=env_vars,\n", " serving_container_shared_memory_size_mb=(16 * 1024), # 16 GB\n", " serving_container_deployment_timeout=7200,\n", " model_garden_source_model_name=(\n", " f\"publishers/{publisher}/models/{publisher_model_id}\"\n", " ),\n", " )\n", " print(\n", " f\"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s).\"\n", " )\n", " model.deploy(\n", " endpoint=endpoint,\n", " machine_type=machine_type,\n", " accelerator_type=accelerator_type,\n", " accelerator_count=accelerator_count,\n", " deploy_request_timeout=1800,\n", " system_labels={\n", " \"NOTEBOOK_NAME\": \"model_garden_llama_guard_deployment.ipynb\",\n", " \"NOTEBOOK_ENVIRONMENT\": common_util.get_deploy_source(),\n", " },\n", " )\n", " print(\"endpoint_name:\", endpoint.name)\n", "\n", " return model, endpoint\n", "\n", "\n", "models[LABEL], endpoints[LABEL] = deploy_model_vllm(\n", " model_name=common_util.get_job_name_with_datetime(prefix=\"llama-guard\"),\n", " model_id=model_id,\n", " publisher=\"meta\",\n", " publisher_model_id=\"llama-guard\",\n", " base_model_id=hf_model_id,\n", " machine_type=machine_type,\n", " accelerator_type=accelerator_type,\n", " accelerator_count=accelerator_count,\n", " gpu_memory_utilization=gpu_memory_utilization,\n", " max_model_len=max_model_len,\n", " enforce_eager=False,\n", " use_dedicated_endpoint=use_dedicated_endpoint,\n", " max_num_seqs=max_num_seqs,\n", " enable_llama_tool_parser=False,\n", ")\n", "# @markdown Click \"Show Code\" to see more details." ] }, { "cell_type": "markdown", "metadata": { "id": "192a021iB_DE" }, "source": [ "## Use the Llama Guard models to safeguard LLM inputs and outputs with the Vertex Llama 3.1 API service\n", "\n", "We use [meta-llama/Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B) to safeguard input and output conversations with the [Llama 3.1 405B Instruct model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-3.1-405b-instruct-maas).\n", "\n", "Llama Guard 3 builds on the capabilities introduced with Llama Guard 2, adding three new categories, Defamation, Elections and Code Interpreter Abuse. Additionally this model is multilingual and a new prompt format is introduced, making Llama Guard 3’s prompt format consistent with Llama 3+ Instruct models.\n", "\n", "This section references [LlamaGuard.ipynb](https://colab.research.google.com/drive/16s0tlCSEDtczjPzdIK3jq0Le5LlnSYGf?usp=sharing) from [https://huggingface.co/meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fHC7INgjB_DF" }, "outputs": [], "source": [ "!pip install --upgrade --quiet openai" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "ajjcGNzhB_DF" }, "outputs": [], "source": [ "import google.auth\n", "import openai\n", "\n", "# @markdown Set up the Llama 3.1 405B Instruct model API service.\n", "\n", "# Programmatically get an access token\n", "creds, _ = google.auth.default(\n", " scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n", ")\n", "auth_req = google.auth.transport.requests.Request()\n", "creds.refresh(auth_req)\n", "# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n", "\n", "client = openai.OpenAI(\n", " base_url=f\"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n", " api_key=creds.token,\n", ")\n", "LLAMA3_405B_INSTRUCT = \"meta/llama-3.1-405b-instruct-maas\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "NvSfBcUUB_DF" }, "outputs": [], "source": [ "# @markdown Define input message in conversation and get output message from model.\n", "\n", "message_role = \"user\" # @param {type: \"string\"}\n", "message_content = \"What is a car?\" # @param {type: \"string\"}\n", "\n", "messages = [\n", " {\n", " \"role\": message_role,\n", " \"content\": message_content,\n", " }\n", "]\n", "print(\"Conversation [turn 1]:\", messages)\n", "\n", "response = client.chat.completions.create(\n", " model=LLAMA3_405B_INSTRUCT,\n", " messages=messages,\n", ")\n", "print(\"Response:\", response)\n", "\n", "messages.append(\n", " {\n", " \"role\": response.choices[0].message.role,\n", " \"content\": response.choices[0].message.content,\n", " }\n", ")\n", "print(\"Conversation [turn 2]:\", messages)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "Y7-ym3GlB_DG" }, "outputs": [], "source": [ "# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n", "# @markdown Classification is performed on the last turn of the conversation.\n", "# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n", "# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n", "\n", "instances = [\n", " {\n", " \"messages\": messages,\n", " \"@requestFormat\": \"chatCompletions\",\n", " },\n", "]\n", "response = endpoints[\"vllm_gpu\"].predict(\n", " instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n", ")\n", "\n", "prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n", "print(\"Llama Guard prediction:\", prediction)" ] }, { "cell_type": "markdown", "metadata": { "id": "h_LgDtVyO13s" }, "source": [ "## Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 3.2 API service\n", "\n", "We use [meta-llama/Llama-Guard-3-11B-Vision](https://huggingface.co/meta-llama/Llama-Guard-3-11B-Vision) to safeguard input and output conversations with the [Llama 3.2 90B-Vision-Instruct model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-3.2-90b-vision-instruct-maas)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "4hgRrEuqO13s" }, "outputs": [], "source": [ "!pip install --upgrade --quiet openai" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "zw5BkBd4O13s" }, "outputs": [], "source": [ "import google.auth\n", "import openai\n", "\n", "# @markdown Set up the Llama 3.2 90B-Vision-Instruct model API service.\n", "\n", "# Programmatically get an access token\n", "creds, _ = google.auth.default(\n", " scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n", ")\n", "auth_req = google.auth.transport.requests.Request()\n", "creds.refresh(auth_req)\n", "# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n", "\n", "client = openai.OpenAI(\n", " base_url=f\"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n", " api_key=creds.token,\n", ")\n", "LLAMA3_90B_VISION_INSTRUCT = \"meta/llama-3.2-90b-vision-instruct-maas\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "3xX8VqWFO13s" }, "outputs": [], "source": [ "# @markdown Define input message in conversation and get output message from model.\n", "\n", "user_image = \"https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg\" # @param {type: \"string\"}\n", "user_message = \"What is in the image?\" # @param {type: \"string\"}\n", "\n", "messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\"url\": user_image},\n", " },\n", " {\"type\": \"text\", \"text\": user_message},\n", " ],\n", " }\n", "]\n", "\n", "print(\"Conversation [turn 1]:\", messages)\n", "\n", "response = client.chat.completions.create(\n", " model=LLAMA3_90B_VISION_INSTRUCT,\n", " messages=messages,\n", ")\n", "print(\"Response:\", response)\n", "\n", "messages.append(\n", " {\n", " \"role\": response.choices[0].message.role,\n", " \"content\": response.choices[0].message.content,\n", " }\n", ")\n", "\n", "print(\"Conversation [turn 2]:\", messages)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "6zhDnfAcO13s" }, "outputs": [], "source": [ "# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n", "# @markdown Classification is performed on the last turn of the conversation.\n", "# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n", "# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n", "\n", "instances = [\n", " {\n", " \"messages\": messages,\n", " \"@requestFormat\": \"chatCompletions\",\n", " },\n", "]\n", "response = endpoints[\"vllm_gpu\"].predict(\n", " instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n", ")\n", "\n", "prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n", "print(\"Llama Guard prediction:\", prediction)" ] }, { "cell_type": "markdown", "metadata": { "id": "h_LgDtVyO13s" }, "source": [ "## Use the Llama Guard models to safeguard LLM vision inputs and outputs with the Vertex Llama 4 API service\n", "\n", "We use [meta-llama/Llama-Guard-4-12B](https://huggingface.co/meta-llama/Llama-Guard-4-12B) to safeguard input and output conversations with the [Llama 4 model API service on Vertex](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama-4-maverick-17b-128e-instruct-maas)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "4hgRrEuqO13s" }, "outputs": [], "source": [ "!pip install --upgrade --quiet openai" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "zw5BkBd4O13s" }, "outputs": [], "source": [ "import google.auth\n", "import openai\n", "\n", "# @markdown Set up the Llama 4 model API service.\n", "\n", "# Programmatically get an access token\n", "creds, _ = google.auth.default(\n", " scopes=[\"https://www.googleapis.com/auth/cloud-platform\"]\n", ")\n", "auth_req = google.auth.transport.requests.Request()\n", "creds.refresh(auth_req)\n", "# Note: the credential lives for 1 hour by default (https://cloud.google.com/docs/authentication/token-types#at-lifetime); after expiration, it must be refreshed.\n", "\n", "client = openai.OpenAI(\n", " base_url=f\"https://us-east5-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi\",\n", " api_key=creds.token,\n", ")\n", "LLAMA4_MODEL_ID = \"meta/llama-4-scout-17b-16e-instruct-maas\" # @param [\"meta/llama-4-scout-17b-16e-instruct-maas\", \"meta/llama-4-maverick-17b-128e-instruct-maas\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "3xX8VqWFO13s" }, "outputs": [], "source": [ "# @markdown Define input message in conversation and get output message from model.\n", "\n", "user_image = \"https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg\" # @param {type: \"string\"}\n", "user_message = \"What is in the image?\" # @param {type: \"string\"}\n", "\n", "messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\"url\": user_image},\n", " },\n", " {\"type\": \"text\", \"text\": user_message},\n", " ],\n", " }\n", "]\n", "\n", "print(\"Conversation [turn 1]:\", messages)\n", "\n", "response = client.chat.completions.create(\n", " model=LLAMA4_MODEL_ID,\n", " messages=messages,\n", ")\n", "print(\"Response:\", response)\n", "\n", "messages.append(\n", " {\n", " \"role\": response.choices[0].message.role,\n", " \"content\": response.choices[0].message.content,\n", " }\n", ")\n", "\n", "print(\"Conversation [turn 2]:\", messages)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "6zhDnfAcO13s" }, "outputs": [], "source": [ "# @markdown Use Llama Guard to classify the conversation: safe versus unsafe.\n", "# @markdown Classification is performed on the last turn of the conversation.\n", "# @markdown If the content is safe, the model will return `safe`. If the content is unsafe, the model will return `unsafe` and additionally the list of offending categories as a comma-separated list in a new line.\n", "# @markdown Set `\"@requestFormat\": \"chatCompletions\"` to use the OpenAI chat completions format.\n", "\n", "instances = [\n", " {\n", " \"messages\": messages,\n", " \"@requestFormat\": \"chatCompletions\",\n", " },\n", "]\n", "response = endpoints[\"vllm_gpu\"].predict(\n", " instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n", ")\n", "\n", "prediction = response.predictions[\"choices\"][0][\"message\"][\"content\"]\n", "print(\"Llama Guard prediction:\", prediction)" ] }, { "cell_type": "markdown", "metadata": { "id": "956x4r7rsrza" }, "source": [ "## Clean up resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "911406c1561e" }, "outputs": [], "source": [ "# @title Delete the models and endpoints\n", "# @markdown Delete the experiment models and endpoints to recycle the resources\n", "# @markdown and avoid unnecessary continuous charges that may incur.\n", "\n", "# Undeploy model and delete endpoint.\n", "for endpoint in endpoints.values():\n", " endpoint.delete(force=True)\n", "\n", "# Delete models.\n", "for model in models.values():\n", " model.delete()" ] } ], "metadata": { "colab": { "name": "model_garden_llama_guard_deployment.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }