notebooks/community/model_garden/model_garden_deployment_tutorial.ipynb (422 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "GyJPhQ2_Om3X" }, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "L1BRnURwUb87" }, "source": [ "# Vertex AI Model Garden - Deployment Tutorial\n", "\n", "<table><tbody><tr>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/instances\">\n", " <img alt=\"Workbench logo\" src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" width=\"32px\"><br> Run in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_deployment_tutorial.ipynb\">\n", " <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_deployment_tutorial.ipynb\">\n", " <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</tr></tbody></table>" ] }, { "cell_type": "markdown", "metadata": { "id": "K4W8U0MQUb87" }, "source": [ "## Overview\n", "\n", "You can deploy open models (including Hugging Face models) by using [Google Gen AI SDK or Google Cloud CLI](https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/use-models#deploy_a_model).\n", "\n", "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "x7Z25I3xUb87" }, "source": [ "## Before you begin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "WnEYdVEKUb87" }, "outputs": [], "source": [ "# @title Setup Google Cloud project\n", "\n", "# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "# @markdown 2. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.\n", "\n", "REGION = \"\" # @param {type:\"string\"}\n", "\n", "# @markdown 3. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n", "\n", "# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n", "# @markdown | ----------- | ----------- | ----------- |\n", "# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n", "# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n", "# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n", "# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |\n", "\n", "# Upgrade Vertex AI SDK.\n", "! pip3 install --upgrade --quiet 'google-cloud-aiplatform>=1.84.0'\n", "\n", "# Import the necessary packages\n", "import os\n", "\n", "from google.cloud import aiplatform\n", "\n", "# Upgrade Vertex AI SDK.\n", "if os.environ.get(\"VERTEX_PRODUCT\") != \"COLAB_ENTERPRISE\":\n", " ! pip install --upgrade tensorflow\n", "! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n", "\n", "common_util = importlib.import_module(\n", " \"vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util\"\n", ")\n", "\n", "models, endpoints = {}, {}\n", "LABEL = \"my-endpoint\"\n", "\n", "\n", "# Get the default cloud project id.\n", "PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n", "\n", "# Get the default region for launching jobs.\n", "if not REGION:\n", " REGION = os.environ[\"GOOGLE_CLOUD_REGION\"]\n", "\n", "# Initialize Vertex AI API.\n", "print(\"Initializing Vertex AI API.\")\n", "aiplatform.init(project=PROJECT_ID, location=REGION)\n", "\n", "! gcloud config set project $PROJECT_ID\n", "import vertexai\n", "\n", "vertexai.init(\n", " project=PROJECT_ID,\n", " location=REGION,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "2NRInIKZ3nSJ" }, "source": [ "## Deploy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "xpIv3skP3nSK" }, "outputs": [], "source": [ "# @title Choose the model to deploy\n", "from vertexai.preview import model_garden\n", "\n", "# @markdown List all deployable models and then get the ID of the model to deploy.\n", "\n", "# @markdown You can also use the Hugging Face model ID.\n", "list_hf_models = False # @param {type:\"boolean\"}\n", "\n", "# @markdown You can also filter by model name.\n", "model_filter = \"gemma\" # @param {type:\"string\"}\n", "\n", "model_garden.list_deployable_models(\n", " list_hf_models=list_hf_models, model_filter=model_filter\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "uaLkN2mta5T2" }, "outputs": [], "source": [ "# @markdown Select a model from the list above.\n", "MODEL_ID = \"google/gemma3@gemma-3-1b-it\" # @param [\"google/gemma3@gemma-3-1b-it\", \"google/gemma-3-1b-it\"] {isTemplate:true}\n", "\n", "# @markdown Follow the [Hugging Face documentation](https://huggingface.co/docs/hub/en/security-tokens) to create a **read** access token and put it in the `HF_TOKEN` field below.\n", "HF_TOKEN = \"\" # @param {type:\"string\", isTemplate: true}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fR9wl0-Na5T2" }, "outputs": [], "source": [ "# @markdown Set use_dedicated_endpoint to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint). Note that [dedicated endpoint does not support VPC Service Controls](https://cloud.google.com/vertex-ai/docs/predictions/choose-endpoint-type), uncheck the box if you are using VPC-SC.\n", "use_dedicated_endpoint = True # @param {type:\"boolean\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "Fw9W4TrOUb87" }, "outputs": [], "source": [ "# @title Deploy with Model Garden SDK\n", "\n", "# @markdown Deploy with Gen AI model-centric SDK. This section uploads the prebuilt model to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model. See [use open models with Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-open-models) for documentation on other use cases.\n", "from vertexai.preview import model_garden\n", "\n", "model = model_garden.OpenModel(MODEL_ID)\n", "endpoints[LABEL] = model.deploy(\n", " hugging_face_access_token=HF_TOKEN,\n", " use_dedicated_endpoint=use_dedicated_endpoint,\n", " accept_eula=True, # Accept the End User License Agreement (EULA) on the model card before deploy. Otherwise, the deployment will be forbidden.\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "rKjAOh7FUb87" }, "outputs": [], "source": [ "# @title Raw predict\n", "\n", "# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).\n", "\n", "# @markdown Example:\n", "\n", "# @markdown ```\n", "# @markdown Human: What is a car?\n", "# @markdown Assistant: A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another. The term also encompasses a wide range of vehicles, including motorboats, trains, and aircrafts. Cars typically have four wheels, a cabin for passengers, and an engine or motor. They have been around since the early 19th century and are now one of the most popular forms of transportation, used for daily commuting, shopping, and other purposes.\n", "# @markdown ```\n", "# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.\n", "\n", "# Loads an existing endpoint instance using the endpoint name:\n", "# - Using `endpoint_name = endpoint.name` allows us to get the\n", "# endpoint name of the endpoint `endpoint` created in the cell\n", "# above.\n", "# - Alternatively, you can set `endpoint_name = \"1234567890123456789\"` to load\n", "# an existing endpoint with the ID 1234567890123456789.\n", "# You may uncomment the code below to load an existing endpoint.\n", "\n", "# endpoint_name = \"\" # @param {type:\"string\"}\n", "# aip_endpoint_name = (\n", "# f\"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}\"\n", "# )\n", "# endpoint = aiplatform.Endpoint(aip_endpoint_name)\n", "\n", "prompt = \"What is a car?\" # @param {type: \"string\"}\n", "# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n", "max_tokens = 50 # @param {type:\"integer\"}\n", "temperature = 1.0 # @param {type:\"number\"}\n", "top_p = 1.0 # @param {type:\"number\"}\n", "top_k = 1 # @param {type:\"integer\"}\n", "# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `\"Prompt:\\n{prompt.strip()}\\nOutput:\\n{output}\"`.\n", "raw_response = False # @param {type:\"boolean\"}\n", "\n", "# Overrides parameters for inferences.\n", "instances = [\n", " {\n", " \"prompt\": prompt,\n", " \"max_tokens\": max_tokens,\n", " \"temperature\": temperature,\n", " \"top_p\": top_p,\n", " \"top_k\": top_k,\n", " \"raw_response\": raw_response,\n", " },\n", "]\n", "response = endpoints[\"my-endpoint\"].predict(\n", " instances=instances, use_dedicated_endpoint=use_dedicated_endpoint\n", ")\n", "\n", "for prediction in response.predictions:\n", " print(prediction)\n", "\n", "# @markdown Click \"Show Code\" to see more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "wxe8GIwpUb87" }, "outputs": [], "source": [ "# @title Chat completion\n", "\n", "if use_dedicated_endpoint:\n", " DEDICATED_ENDPOINT_DNS = endpoints[\n", " \"my-endpoint\"\n", " ].gca_resource.dedicated_endpoint_dns\n", "ENDPOINT_RESOURCE_NAME = endpoints[\"my-endpoint\"].resource_name\n", "\n", "# @title Chat Completions Inference\n", "\n", "# @markdown Once deployment succeeds, you can send requests to the endpoint using the OpenAI SDK.\n", "\n", "# @markdown First you will need to install the SDK and some auth-related dependencies.\n", "\n", "! pip install -qU openai google-auth requests\n", "\n", "# @markdown Next fill out some request parameters:\n", "\n", "user_message = \"How is your day going?\" # @param {type: \"string\"}\n", "# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.\n", "max_tokens = 50 # @param {type: \"integer\"}\n", "temperature = 1.0 # @param {type: \"number\"}\n", "stream = False # @param {type: \"boolean\"}\n", "\n", "# @markdown Now we can send a request.\n", "\n", "import google.auth\n", "import openai\n", "\n", "creds, project = google.auth.default()\n", "auth_req = google.auth.transport.requests.Request()\n", "creds.refresh(auth_req)\n", "\n", "BASE_URL = (\n", " f\"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}\"\n", ")\n", "try:\n", " if use_dedicated_endpoint:\n", " BASE_URL = f\"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}\"\n", "except NameError:\n", " pass\n", "\n", "client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)\n", "\n", "model_response = client.chat.completions.create(\n", " model=\"\",\n", " messages=[{\"role\": \"user\", \"content\": user_message}],\n", " temperature=temperature,\n", " max_tokens=max_tokens,\n", " stream=stream,\n", ")\n", "\n", "if stream:\n", " usage = None\n", " contents = []\n", " for chunk in model_response:\n", " if chunk.usage is not None:\n", " usage = chunk.usage\n", " continue\n", " print(chunk.choices[0].delta.content, end=\"\")\n", " contents.append(chunk.choices[0].delta.content)\n", " print(f\"\\n\\n{usage}\")\n", "else:\n", " print(model_response)\n", "\n", "# @markdown Click \"Show Code\" to see more details." ] }, { "cell_type": "markdown", "metadata": { "id": "cqHoUEEMXNAH" }, "source": [ "## Clean up resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "5qKldg25Ub87" }, "outputs": [], "source": [ "# @title Delete the models and endpoints\n", "\n", "# @markdown Delete the experiment models and endpoints to recycle the resources\n", "# @markdown and avoid unnecessary continuous charges that may incur.\n", "\n", "# Undeploy model and delete endpoint.\n", "for endpoint in endpoints.values():\n", " endpoint.delete(force=True)\n", "\n", "# Delete models.\n", "for model in models.values():\n", " model.delete()" ] } ], "metadata": { "colab": { "name": "model_garden_deployment_tutorial.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }