In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Model Garden - Using SpotVM and Reservations to Deploy a Vertex AI Llama-3.1 Endpoint

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_reservations_spotvm.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_reservations_spotvm.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

### Overview

This notebook provides a comprehensive, step-by-step guide to leveraging [Spot VMs](https://cloud.google.com/compute/docs/instances/spot) and [Reservations](https://cloud.google.com/vertex-ai/docs/predictions/use-reservations#get-predictions) for deploying a fully managed Vertex AI endpoint. The process involves configuring a reservation—a dedicated pool of compute resources—that can help ensure cost stability and resource availability for your inference workloads. You will learn how to create, view, and manage these reservations to control how your endpoints consume underlying resources. By following these instructions, you will gain a deep understanding of how the Vertex AI ecosystem can be tuned to your workload requirements, achieving an optimal balance of cost-effectiveness and reliability.

This tutorial will cover how to:
1. **Deploy an Endpoint Using a Spot VM**: Deploy endpoints automatically on preempted resources (if capacity is available).
2. **Create a Single-Project Reservation:** Establish a dedicated pool of compute resources reserved solely for your current project.
3. **Grant Permissions to Google Cloud Services:** Ensure that the necessary Google Cloud services can access and utilize these reservations securely and transparently.
4. **Deploy a Vertex AI Endpoint Using Reservations:** Harness the full potential of reserved resources to deploy an endpoint that benefits from predictable performance and cost stability.

Upon completion, you will not only know how to set up and use reservations for a Vertex AI endpoint, but you will also possess the insights needed to adapt these techniques to a variety of production scenarios. For an even broader understanding of reservations and to explore additional reservation configurations, refer to the [Compute Engine Reservations Overview](https://cloud.google.com/compute/docs/instances/reservations-overview).


### Objective

In this tutorial, we will utilize the `Meta-Llama-3.1-8B` model running on [vLLM](https://github.com/vllm-project/vllm) as a concrete example. This allows you to experiment with state-of-the-art language modeling within a well-defined environment. Throughout the process, we will delve into every aspect of setting up, managing, and cleaning up a complete end-to-end deployment pipeline using Vertex AI and reservations.

By following along, you will learn how to:

- **Set Up a Google Cloud Project:** Configure your environment and ensure that all prerequisites—such as APIs, billing, and IAM roles—are properly in place.
- **Configure Deployment Utilities:** Prepare and manage essential tools and scripts that streamline endpoint creation, testing, and maintenance.
- **Deploy an Endpoint Using a Spot VM:** Achieve cost savings by running inference workloads on preemptible resources while maintaining service integrity.
- **Create and Manage Reservations:** Establish a dedicated pool of compute resources, ensuring that your endpoints can maintain consistent performance without competing for capacity.
- **View and Verify Reservations:** Inspect your reservations to confirm that resources are correctly allocated and ready for consumption by your endpoints.
- **Consume Reservations as Instances:** Utilize the reserved resources to run your endpoints, guaranteeing predictable performance and capacity.
- **Deploy Endpoints with `ANY_RESERVATION` and `SPECIFIC_RESERVATION` Policies:** Gain granular control over how your endpoints source their compute resources, whether by tapping into any available reservation or by targeting a specific one for tighter control.
- **Delete Reservations and Endpoints:** Cleanly remove resources to maintain a tidy and cost-efficient environment, ensuring that unused capacity does not incur ongoing costs.

By the end of this tutorial, you will have developed a thorough, in-depth understanding of how to combine Spot VMs, reservations, and Vertex AI to create a flexible, efficient, and cost-conscious inference infrastructure.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

In [None]:
# @title # Setup Google Cloud Project and Shared Reservation
#
# @markdown 1. **Enable Billing for Your Project:** Confirm that billing is active for your chosen Google Cloud project. Without an active billing account, resources such as GPUs and Spot VMs cannot be provisioned. If you haven’t done this yet, follow the instructions here: [Enable Billing for Your Project](https://cloud.google.com/billing/docs/how-to/modify-project).
#
# @markdown 2. **Set Deployment `PROJECT_ID` and `REGION`:**  When setting up your environment, ensure that your Vertex AI endpoint and any associated reservations are in the same region and under projects that belong to the same organization. This alignment helps streamline IAM policies and resource sharing.
#
# @markdown 3. **Shared Reservation Requirement:** If you plan to use a reservation created in a separate project (e.g., `SHARED_PROJECT_ID`) within the same organization, you must grant appropriate permissions to the Vertex AI Principal Service Accounts (P4SAs) from both projects. This ensures your endpoint in `PROJECT_ID` can use a reservation from `SHARED_PROJECT_ID`.
#
# @markdown   - The P4SA from the primary project hosting the endpoint (in `PROJECT_ID`) must have `roles/compute.viewer` in that project.
# @markdown   - The P4SA from the project where the reservation resides (`SHARED_PROJECT_ID`) must also be granted `roles/compute.viewer` in that shared project.
# @markdown   - This cross-project permission enables your endpoint’s underlying infrastructure to "see" and utilize the reservation capacity in the shared project.
#
# @markdown 4. **Recommended Regions for Specialized GPUs:**
# @markdown If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus).
#
# @markdown > | Machine Type        | Accelerator Type       | Recommended Regions |
# @markdown > | ------------------- | ---------------------- | ------------------ |
# @markdown > | a2-ultragpu-1g      | 1 NVIDIA_A100_80GB     | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |
# @markdown > | a3-highgpu-2g       | 2 NVIDIA_H100_80GB     | us-west1, asia-southeast1, europe-west4 |
# @markdown > | a3-highgpu-4g       | 4 NVIDIA_H100_80GB     | us-west1, asia-southeast1, europe-west4 |
# @markdown > | a3-highgpu-8g       | 8 NVIDIA_H100_80GB     | us-central1, us-east5, europe-west4, us-west1, asia-southeast1 |

PROJECT_ID = ""  # @param {type:"string"}
SHARED_PROJECT_ID = ""  # @param {type:"string"}

BUCKET_URI = "gs://"  # @param {type:"string"}

REGION = ""  # @param {type:"string"}

# Import the necessary packages

! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

import datetime
import importlib
import os
import uuid
from typing import Tuple

from google.cloud import aiplatform

common_util = importlib.import_module(
    "vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util"
)

models, endpoints = {}, {}

# Enable the Vertex AI API and Compute Engine API, if not already.
print("Enabling Vertex AI API and Compute Engine API.")
! gcloud services enable aiplatform.googleapis.com compute.googleapis.com

# Cloud Storage bucket for storing the experiment artifacts.
# A unique GCS bucket will be created for the purpose of this notebook. If you
# prefer using your own GCS bucket, change the value yourself below.
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])

if BUCKET_URI is None or BUCKET_URI.strip() == "" or BUCKET_URI == "gs://":
    BUCKET_URI = f"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}"
    BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])
    ! gsutil mb -l {REGION} {BUCKET_URI}
else:
    assert BUCKET_URI.startswith("gs://"), "BUCKET_URI must start with `gs://`."
    shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep "Location constraint:" | sed "s/Location constraint://"
    bucket_region = shell_output[0].strip().lower()
    if bucket_region != REGION:
        raise ValueError(
            "Bucket region %s is different from notebook region %s"
            % (bucket_region, REGION)
        )
print(f"Using this GCS Bucket: {BUCKET_URI}")

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")

# Initialize Vertex AI API.
print("Initializing Vertex AI API.")
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

# Gets the default SERVICE_ACCOUNT .
shell_output = ! gcloud projects describe $PROJECT_ID
project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"
print("Using this default Service Account:", SERVICE_ACCOUNT)

# Get the P4SA email for the current project
P4SA_SERVICE_ACCOUNT = (
    f"service-{project_number}@gcp-sa-aiplatform.iam.gserviceaccount.com"
)
print("Current P4SA Service Account:", P4SA_SERVICE_ACCOUNT)

# Get the P4SA email for the shared project
shell_output = ! gcloud projects describe $SHARED_PROJECT_ID
shared_project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
SHARED_P4SA_SERVICE_ACCOUNT = (
    f"service-{shared_project_number}@gcp-sa-aiplatform.iam.gserviceaccount.com"
)
print("Shared P4SA Service Account:", SHARED_P4SA_SERVICE_ACCOUNT)

# grant compute.viewer role to the current P4SA
command = f"gcloud projects add-iam-policy-binding {PROJECT_ID} --member=serviceAccount:{P4SA_SERVICE_ACCOUNT} --role=roles/compute.viewer"
! {command}
command = f"gcloud projects add-iam-policy-binding {SHARED_PROJECT_ID} --member=serviceAccount:{P4SA_SERVICE_ACCOUNT} --role=roles/compute.viewer"
! {command}

# grant compute.viewer role to the shared P4SA
command = f"gcloud projects add-iam-policy-binding {PROJECT_ID} --member=serviceAccount:{SHARED_P4SA_SERVICE_ACCOUNT} --role=roles/compute.viewer"
! {command}
command = f"gcloud projects add-iam-policy-binding {SHARED_PROJECT_ID} --member=serviceAccount:{SHARED_P4SA_SERVICE_ACCOUNT} --role=roles/compute.viewer"
! {command}

! gcloud config set project $PROJECT_ID

print(f"Using Project ID: {PROJECT_ID}")
print(f"Using Shared Project ID: {SHARED_PROJECT_ID}")
print(f"Using Region: {REGION}")

In [None]:
# @title **Llama-3.1 vLLM Endpoint Deployment Utility Functions**

# @markdown This section introduces utility functions to facilitate deploying the `Llama-3.1-8B` model to a Vertex AI endpoint using the [vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html) runtime environment. The tools provided here will help streamline the process of loading models, configuring serving parameters, and integrating seamlessly with Vertex AI predictions. By abstracting away much of the complexity, these utilities empower you to:

# @markdown - **Provision Efficient Inference Runtimes:**
# @markdown   Take advantage of vLLM’s optimized serving environment, which is designed to handle large language model inference at scale. The library’s focus on latency reduction, memory efficiency, and throughput enables you to achieve superior performance with fewer system resources.

# @markdown - **Customize Model Behavior for Your Use Case:**
# @markdown   Adjust parameters such as prompt handling, token generation strategies, and memory management policies directly through the utility functions, ensuring that your deployment meets the unique requirements of your application—be it real-time dialogue systems, multi-language support, summarization tasks, or any other LLM-powered workflow.

# @markdown ### **About Meta’s Llama 3.1 Collections**

# @markdown [Meta’s Llama 3.1 series](https://huggingface.co/meta-llama/Llama-3.1-8B) comprises a set of cutting-edge multilingual LLMs, pretrained and instruction-tuned to excel in a wide range of tasks. They support robust conversational capabilities, succinct summarization, and intelligent retrieval from diverse data sources. Whether you’re building interactive chatbots, processing large volumes of content, or generating domain-specific reports, these models serve as a powerful starting point.

# @markdown ### **Hugging Face User Access Tokens**

# @markdown To access the Llama 3.1 models and other resources hosted on Hugging Face, you will need a **Read Access Token**. This token ensures you have the necessary permissions to download models and related artifacts while maintaining proper security controls.

# @markdown **Follow these steps to obtain and use your Hugging Face token:**

# @markdown 1. **Generate a Read Access Token:**
# @markdown    - Visit your [Hugging Face account settings](https://huggingface.co/settings/tokens).
# @markdown    - Click on **Create new token**, assign it a **Read** role (no more permissions than necessary), and generate the token.
# @markdown    - Store this token in a safe and secure location, as it provides direct access to the resources you’ll need.

# @markdown 2. **Use the Token for Authentication:**
# @markdown    - Within this notebook or your scripting environment, supply the token to authenticate yourself.
# @markdown    - This ensures that any model downloads or asset retrieval from private or restricted repositories occur smoothly and securely.

# @markdown Maintaining minimal, read-only permissions helps prevent accidental exposures or misuse of your credentials. For more details on configuring Hugging Face tokens, refer to the platform’s official documentation and best practices.

# @markdown **Provide Hugging Face TOKEN to Download Models:**

HF_TOKEN = ""  # @param {type:"string"}

# The pre-built serving docker images.
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241001_0916_RC00"


def deploy_model_vllm(
    model_name: str,
    model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    gpu_memory_utilization: float = 0.9,
    max_model_len: int = 4096,
    dtype: str = "auto",
    enable_trust_remote_code: bool = False,
    enforce_eager: bool = False,
    enable_lora: bool = False,
    max_loras: int = 1,
    max_cpu_loras: int = 8,
    use_dedicated_endpoint: bool = False,
    max_num_seqs: int = 256,
    model_type: str = None,
    reservation_name: str = None,
    reservation_affinity_type: str = None,
    reservation_project: str = None,
    reservation_zone: str = None,
    is_spot: bool = False,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.
    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--gpu-memory-utilization={gpu_memory_utilization}",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        f"--max-loras={max_loras}",
        f"--max-cpu-loras={max_cpu_loras}",
        f"--max-num-seqs={max_num_seqs}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    if enforce_eager:
        vllm_args.append("--enforce-eager")

    if enable_lora:
        vllm_args.append("--enable-lora")

    if model_type:
        vllm_args.append(f"--model-type={model_type}")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )

    deploy_args = {
        "endpoint": endpoint,
        "machine_type": machine_type,
        "accelerator_type": accelerator_type,
        "accelerator_count": accelerator_count,
        "deploy_request_timeout": 1800,
        "service_account": service_account,
    }

    if is_spot:
        deploy_args["min_replica_count"] = 1
        deploy_args["max_replica_count"] = 1
        deploy_args["spot"] = True
        deploy_args["sync"] = True

    if reservation_affinity_type:
        deploy_args["reservation_affinity_type"] = reservation_affinity_type

    if reservation_name:
        deploy_args[
            "reservation_affinity_key"
        ] = "compute.googleapis.com/reservation-name"
        deploy_args["reservation_affinity_values"] = [
            f"projects/{reservation_project}/zones/{reservation_zone}/reservations/{reservation_name}"
        ]

    model.deploy(**deploy_args)

    print("endpoint_name:", endpoint.name)

    return model, endpoint

### Spot VM

In [None]:
# @title Spot VM Vertex AI Endpoint Deployment

# @markdown **What are Spot VMs?**  [Spot VMs](https://cloud.google.com/compute/docs/instances/spot) are spare compute instances offered by Google Cloud at significantly discounted rates. Unlike standard on-demand VMs, Spot VMs provide lower prices—often as much as 60-91% off the regular cost for most machine types and GPUs—making them extremely cost-effective for certain workloads. However, the trade-off is that these VMs can be preempted (stopped) by Google Cloud at any time if the capacity is needed elsewhere. As a result, Spot VMs are best suited for workloads that are resilient to interruptions.

# @markdown **Stockouts and Resource Availability:**  Even with the correct quotas, you may encounter **stockouts**, which occur when the requested resources (such as a specific VM family, shape, or disk type) are temporarily unavailable. This situation can lead to delays or increased costs if you opt for alternative resource configurations. For more insights into handling capacity constraints and stockouts, refer to the [Capacity, Quota, and Stockouts resource guide](https://www.googlecloudcommunity.com/gc/Community-Blogs/Managing-Capacity-Quota-and-Stockouts-in-the-Cloud-Concepts-and/ba-p/464770#toc-hId-1635110264).

# @markdown **Mitigating Stockouts with Spot VMs and Reservations:**  If a particular VM type or resource is experiencing shortages, consider alternative strategies:

# @markdown - **Use Spot VMs:**  Spot VMs fill idle capacity at discounted prices. If a stockout prevents you from acquiring standard VMs, a Spot VM can serve as a cost-effective and readily available fallback. While preemptions can occur, if your model inference or training jobs can tolerate being paused or restarted, this approach can greatly reduce compute costs.

# @markdown - **Use Reservations:**  Another way to ensure predictable resource availability is to use [Reservations](https://cloud.google.com/vertex-ai/docs/predictions/use-reservations#get-predictions), which guarantee that certain resources remain allocated for your workloads. Although not as cost-effective as Spot VMs, reservations can alleviate the uncertainty caused by stockouts, ensuring that you always have enough capacity for your deployments.

# @markdown **When to Choose Spot VMs:**  Spot VMs are ideal for jobs and tasks that are:

# @markdown - **Fault-Tolerant and Interruptible:**  If your workloads can handle interruptions—such as batch processing jobs that can resume after a delay or distributed training jobs that can adjust to losing certain workers—a Spot VM’s lower cost can result in significant savings over time.

# @markdown - **Not Strictly Latency-Sensitive:**  For workflows where occasional preemptions do not severely impact business outcomes, Spot VMs are a strategic choice.

# @markdown **Cost and Billing Model:**  One attractive aspect of Spot VMs is that you're billed only for the actual compute time used. You do not pay for:

# @markdown - Time spent waiting in a job queue or time lost due to preemptions.

# @markdown This means that if a Spot VM is interrupted, you’re not charged for downtime, making the overall cost model more favorable for price-sensitive workloads.

# @markdown **Learn More:**  To understand the pricing structure, including discounts and comparisons to standard on-demand VMs, see the [Spot VM Pricing Guide](https://cloud.google.com/compute/docs/instances/spot#pricing).

# @markdown For strategies on handling preemptions within Vertex AI, including how to design your workflows and code to gracefully manage interruptions, consult the [Preemption Handling Documentation](https://cloud.google.com/vertex-ai/docs/predictions/use-spot-vms#preemption-handling).

# @markdown By combining Spot VMs with robust workflow planning and reservations, you can strike an optimal balance between cost savings, reliability, and performance in your Vertex AI endpoint deployments.

# @markdown **Set the model to deploy:**

base_model_name = "Meta-Llama-3.1-8B"  # @param ["Meta-Llama-3.1-8B"]
hf_model_id = "meta-llama/" + base_model_name

if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
    max_loras = 5
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

gpu_memory_utilization = 0.95
max_model_len = 8192  # Maximum context length.

models["vllm_gpu_spotvm"], endpoints["vllm_gpu_spotvm"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve-spotvm"),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    gpu_memory_utilization=gpu_memory_utilization,
    max_model_len=max_model_len,
    max_loras=max_loras,
    enforce_eager=True,
    enable_lora=True,
    use_dedicated_endpoint=False,
    model_type="llama3.1",
    is_spot=True,
)

In [None]:
# @title Raw predict with SpotVM Endpoint

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

# @markdown Example:

# @markdown ```
# @markdown Human: What is a car?
# @markdown Assistant:  A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another. The term also encompasses a wide range of vehicles, including motorboats, trains, and aircrafts. Cars typically have four wheels, a cabin for passengers, and an engine or motor. They have been around since the early 19th century and are now one of the most popular forms of transportation, used for daily commuting, shopping, and other purposes.
# @markdown ```
# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.

# Loads an existing endpoint instance using the endpoint name:
# - Using `endpoint_name = endpoint.name` allows us to get the
#   endpoint name of the endpoint `endpoint` created in the cell
#   above.
# - Alternatively, you can set `endpoint_name = "1234567890123456789"` to load
#   an existing endpoint with the ID 1234567890123456789.
# You may uncomment the code below to load an existing endpoint.

# endpoint_name = ""  # @param {type:"string"}
# aip_endpoint_name = (
#     f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}"
# )
# endpoint = aiplatform.Endpoint(aip_endpoint_name)

prompt = "What is a car?"  # @param {type: "string"}
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
raw_response = True  # @param {type:"boolean"}

# Overrides parameters for inferences.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]
response = endpoints["vllm_gpu_spotvm"].predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

### Reservations


In [None]:
# @title Set Up Reservations for Vertex AI Predictions

# @markdown ### Why Use Reservations?

# @markdown In addition to using Spot VMs, another robust strategy to mitigate capacity issues and avoid stockouts is to leverage resource **reservations**. A reservation is a powerful Compute Engine feature that guarantees the availability of certain machine and accelerator resources within a specific zone, ensuring your Vertex AI endpoints can scale predictably. Unlike relying on transient or best-effort resources, reservations grant you a higher level of certainty that the infrastructure you need will be ready when you need it.

# @markdown **Key Advantages of Reservations:**

# @markdown 1. **Predictable Capacity:**
# @markdown By reserving capacity ahead of time, you ensure that resources (such as specific GPU types) remain available. This is particularly valuable when deploying models at scale or handling peak workloads that demand consistent performance and availability.

# @markdown 2. **Simplified Scaling and Migration:**
# @markdown Reservations facilitate scaling up your deployments without capacity surprises. They also help with planned migrations or transitioning to new hardware configurations, minimizing downtime and resource contention.

# @markdown 3. **Disaster Recovery Preparedness:**
# @markdown In the event of failures or the need for rapid failover, having pre-reserved resources enables you to spin up new endpoints quickly in the designated zone. This capability enhances the resiliency and reliability of your Vertex AI services.

# @markdown ### Current Support for GPU Reservations in Vertex AI

# @markdown It’s crucial to note that, as of now, **only GPU reservations are supported in Vertex AI**. This limitation means that if your workloads rely on GPU-accelerated inference, reservations can ensure that the required GPUs are consistently available. If you’re running GPU-intensive machine learning tasks like large language model inference, image recognition, or video processing, GPU reservations can be a game changer in maintaining stable and predictable performance.

# @markdown For more details on managing reservations, including best practices and advanced configurations, consult the [Compute Engine Reservations Overview](https://cloud.google.com/compute/docs/instances/reservations-overview).

# @markdown ### Configuring Your Deployment to Use Reservations

# @markdown You have the flexibility to configure Vertex AI endpoints to consume either a **specific reservation** or **any available reservation**, depending on your operational requirements and the degree of control you wish to maintain.

# @markdown - **`ANY_RESERVATION`:**
# @markdown If you choose this option, the endpoint will use any suitable reservation available in the specified project and region. This approach is simpler and may be suitable if you have multiple reservations and don’t need fine-grained resource management.

# @markdown - **`SPECIFIC_RESERVATION`:**
# @markdown By specifying the exact reservation name, you ensure that the endpoint always pulls from a predefined pool of resources. This method is ideal when you need strict control over the hardware configuration—such as ensuring a particular GPU type and count—or when you have distinct reservations allocated for different use cases or departments.

# @markdown ### Important Considerations

# @markdown **Same Project, Same Region, Same Zone:**
# @markdown As emphasized earlier, ensure that both your Vertex AI endpoint and the reservation exist in the **same project** and **same region**. Additionally, the reservation’s zone must fall within that region. This alignment prevents configuration issues and ensures reliable, low-latency communication between the endpoint and the reserved resources.

# @markdown **Matching Resource Configurations:**
# @markdown The `RES_MACHINE_TYPE` and `RES_ACCELERATOR_TYPE` you specify in your Vertex AI endpoint deployment command must exactly match the configuration defined in the reservation. If these configurations differ, the endpoint may not be able to utilize the reserved capacity, leading to stockouts or fallback to alternative resource pools.

# @markdown ### Parameters for Reservation-Based Deployment

# @markdown When configuring your deployment, use the parameters below and replace their placeholders with values specific to your environment:
# @markdown - **`RES_MACHINE_TYPE`:** [The machine type](https://cloud.google.com/compute/docs/accelerator-optimized-machines) you plan to use for the endpoint (e.g., `n1-standard-4`). Ensure that this machine type aligns with what’s defined in your reservation.
# @markdown - **`RES_ACCELERATOR_TYPE`:** The type of [GPU (or other accelerators)](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec) your model requires (e.g., `nvidia-tesla-t4`). Confirm this type matches the accelerator configuration in your reservation.
# @markdown - **`RES_ACCELERATOR_COUNT`:** The number of accelerators per instance (e.g., `1`). Adjust this to match your model’s inference or training needs.
# @markdown - **`RES_PROJECT_ID`:** Your Google Cloud project ID. Reservations must be created in and consumed from this project.
# @markdown - **`RES_ZONE`:** The region (e.g., `us-central1`) where your reservation is located. Remember that the Vertex AI endpoint and reservations must share the same region.
# @markdown - **`RESERVATION_NAME`:** The name of your GPU reservation. Refer to this by name when using `SPECIFIC_RESERVATION` mode to guarantee the endpoint consumes the correct reserved resources.

# @markdown By following these guidelines and properly setting up your reservation parameters, you ensure that your Vertex AI endpoints benefit from stable resource availability. This approach reduces the risk of unexpected resource shortages and helps maintain high service quality, even as demand fluctuates.


import time

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

# Authenticate and build service
credentials = GoogleCredentials.get_application_default()
service = discovery.build("compute", "v1", credentials=credentials)


# Function to wait for operation to complete
def wait_for_zonal_operation(service, project, zone, operation, delete=False):
    print("Waiting for operation to finish...")
    while True:
        result = (
            service.zoneOperations()
            .get(project=project, zone=zone, operation=operation)
            .execute()
        )

        if result["status"] == "DONE":
            if "error" in result:
                print("Error during operation:", result["error"])
                return result
            else:
                if not delete:
                    print("Reservation created successfully.")
                else:
                    print("Reservation deleted successfully.")
            return result
        time.sleep(1)


def create_reservation(
    res_project_id,
    res_zone,
    res_name,
    res_machine_type,
    res_accelerator_type,
    res_accelerator_count,
    shared_project_id,
):
    """
    Create a reservation in Google Cloud Platform, with optional sharing.

    Args:
        res_project_id (str): Project ID.
        res_zone (str): Zone where the reservation will be created.
        res_name (str): Name of the reservation.
        res_machine_type (str): Machine type for the reservation.
        res_accelerator_type (str): Accelerator type for the reservation.
        res_accelerator_count (int): Number of accelerators.
        shared_project_id (str): ID of the project to share the reservation with (required if shared=True).

    Returns:
        dict: Final result of the operation.
    """
    # Define reservation
    reservation_body = {
        "name": res_name,
        "specificReservation": {
            "count": 1,
            "instanceProperties": {
                "machineType": res_machine_type,
                "guestAccelerators": [
                    {
                        "acceleratorType": res_accelerator_type,
                        "acceleratorCount": res_accelerator_count,
                    }
                ],
            },
        },
        "specificReservationRequired": True,
    }

    if not shared_project_id:
        raise ValueError("shared_project_id must be provided.")
    else:
        reservation_body["shareSettings"] = {
            "shareType": "SPECIFIC_PROJECTS",
            "projectMap": {shared_project_id: {"projectId": shared_project_id}},
        }

    # Create reservation
    request = service.reservations().insert(
        project=res_project_id, zone=res_zone, body=reservation_body
    )

    response = request.execute()

    # Wait for the operation to complete
    operation_name = response["name"]
    return wait_for_zonal_operation(service, res_project_id, res_zone, operation_name)


def delete_reservation(project_id, zone, name):
    """
    Delete a reservation for a specific project in Google Cloud Platform.

    Args:
        res_project_id (str): Project ID.
        zone (str): Zone where the reservation exists.
        res_name (str): Name of the reservation to delete.

    Returns:
        dict: Final result of the operation.
    """
    # Authenticate and build service
    credentials = GoogleCredentials.get_application_default()
    service = discovery.build("compute", "v1", credentials=credentials)

    # Delete the reservation
    request = service.reservations().delete(
        project=project_id, zone=zone, reservation=name
    )

    response = request.execute()

    # Wait for the operation to complete
    operation_name = response["name"]
    return wait_for_zonal_operation(service, project_id, zone, operation_name, True)

In [None]:
# @title Create A New Shared Reservation for `ANY_RESERVATION` Deployment Use Case.

# @markdown It's important to note that the **deployment machine specifications and accelerator type must match the reservation machine specifications**. This ensures optimal performance and resource allocation when deploying your model.

# @markdown Provide the following arguments:

rev_names = []

reservation_zone = "a"  # @param {type:"string"}
RES_ZONE = f"{REGION}-{reservation_zone}"

RESERVATION_NAME = "shared-reservation-1"  # @param {type:"string"}
RESERVATION_NAME = f"{PROJECT_ID}-{RESERVATION_NAME}"
RES_MACHINE_TYPE = "g2-standard-12"  # @param {type:"string"}
RES_ACCELERATOR_TYPE = "nvidia-l4"  # @param {type:"string"}
RES_ACCELERATOR_COUNT = 1  # @param {type:"integer"}
rev_names.append(RESERVATION_NAME)

create_reservation(
    res_project_id=PROJECT_ID,
    res_zone=RES_ZONE,
    res_name=RESERVATION_NAME,
    res_machine_type=RES_MACHINE_TYPE,
    res_accelerator_type=RES_ACCELERATOR_TYPE,
    res_accelerator_count=RES_ACCELERATOR_COUNT,
    shared_project_id=SHARED_PROJECT_ID,
)

In [None]:
# @title Create A New Shared Reservation for `SPECIFIC_RESERVATION` Deployment Use Case.

# @markdown It's important to note that the **deployment machine specifications and accelerator type must match the reservation machine specifications**. This ensures optimal performance and resource allocation when deploying your model.

# @markdown Provide the following arguments:

rev_names = []

reservation_zone = "a"  # @param {type:"string"}
RES_ZONE = f"{REGION}-{reservation_zone}"

RESERVATION_NAME = "shared-reservation-2"  # @param {type:"string"}
RESERVATION_NAME = f"{PROJECT_ID}-{RESERVATION_NAME}"
rev_names.append(RESERVATION_NAME)

create_reservation(
    res_project_id=PROJECT_ID,
    res_zone=RES_ZONE,
    res_name=RESERVATION_NAME,
    res_machine_type=RES_MACHINE_TYPE,
    res_accelerator_type=RES_ACCELERATOR_TYPE,
    res_accelerator_count=RES_ACCELERATOR_COUNT,
    shared_project_id=SHARED_PROJECT_ID,
)

In [None]:
# @title Retrieve Newly Created Reservation

# @markdown Viewing reservations is useful to get an overview of all the reservations in your project, or review the configuration details of a reservation. If you want to view a shared reservation, then you can only view it using the owner project.

# @markdown Note that `RES_PROJECT_ID` and `RES_REGION` could be different from the `PROJECT_ID` and `REGION` used in this notebook.

from google.cloud import compute_v1
from google.cloud.compute_v1.services.reservations.pagers import ListPager


def list_compute_reservation(project_id: str, zone: str = "us-central1-a") -> ListPager:
    """
    Lists all compute reservations in a specified Google Cloud project and zone.
    Args:
        project_id (str): The ID of the Google Cloud project.
        zone (str): The zone of the reservations.
    Returns:
        ListPager: A pager object containing the list of reservations.
    """

    client = compute_v1.ReservationsClient()

    reservations_list = client.list(
        project=project_id,
        zone=zone,
    )

    for reservation in reservations_list:
        print("Name: ", reservation.name)
        print(
            "Machine type: ",
            reservation.specific_reservation.instance_properties.machine_type,
        )

    return reservations_list


list_compute_reservation(project_id=PROJECT_ID, zone=RES_ZONE)

In [None]:
# @title Deploy Llama-3.1 Endpoint with `SPECIFIC_RESERVATION`

# @markdown Prior to deploying the endpoint, in the Google Cloud console, go to the [Reservations page](https://console.cloud.google.com/compute/reservations).
# @markdown - Click on the newly created reservation.
# @markdown - Enable **Share with other Google services** in the reservation basic information panel.
# @markdown - Deploy Endpoint with the `SPECIFIC_RESERVATION` created in previous cell.
hf_model_id = "meta-llama/Meta-Llama-3.1-8B"

MACHINE_TYPE = "g2-standard-12"
ACCELERATOR_TYPE = "NVIDIA_L4"
ACCELERATOR_COUNT = 1

(
    models["vllm_gpu_specific_reserve"],
    endpoints["vllm_gpu_specific_reserve"],
) = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(
        prefix=f"llama3_1-serve-specific-{RESERVATION_NAME}"
    ),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
    model_type="llama3.1",
    reservation_name=RESERVATION_NAME,
    reservation_affinity_type="SPECIFIC_RESERVATION",
    reservation_project=PROJECT_ID,
    reservation_zone=RES_ZONE,
)

In [None]:
# @title Test `SPECIFIC_RESERVATION` Endpoint with Raw Predict

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

prompt = "What is a car?"  # @param {type: "string"}
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
raw_response = True  # @param {type:"boolean"}

# Overrides parameters for inferences.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]
response = endpoints["vllm_gpu_specific_reserve"].predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

In [None]:
# @title Deploy Llama-3.1 Endpoint with `ANY_RESERVATION`
# @markdown Prior to deploying the endpoint, in the Google Cloud console, go to the [Reservations page](https://console.cloud.google.com/compute/reservations).
# @markdown - Click on the newly created reservation.
# @markdown - Enable **Share with other Google services** in the reservation basic information panel.
# @markdown - Deploy Endpoint with the `ANY_RESERVATION`.

hf_model_id = "meta-llama/Meta-Llama-3.1-8B"

models["vllm_gpu_any_reserve"], endpoints["vllm_gpu_any_reserve"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(
        prefix=f"llama3_1-serve-any-{RESERVATION_NAME}"
    ),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
    model_type="llama3.1",
    reservation_affinity_type="ANY_RESERVATION",
)

In [None]:
# @title Test `ANY_RESERVATION` Endpoint with Raw Predict

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

prompt = "What is a car?"  # @param {type: "string"}
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
raw_response = True  # @param {type:"boolean"}

# Overrides parameters for inferences.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]
response = endpoints["vllm_gpu_any_reserve"].predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

### Delete the models, endpoints and reservations


In [None]:
# @markdown  Delete the experiment models and endpoints to recycle the resources
# @markdown  and avoid unnecessary continuous charges that may incur.

# @markdown  If you no longer need a reservation, then delete it to stop incurring charges for its reserved resources. If you no longer need a shared reservation, then you can only delete it using the owner project.

# Undeploy model and delete endpoint.
for endpoint in endpoints.values():
    endpoint.delete(force=True)

# Delete models.
for model in models.values():
    model.delete()

delete_bucket = False  # @param {type:"boolean"}
if delete_bucket:
    ! gsutil -m rm -r $BUCKET_NAME

for rev_name in rev_names:
    delete_reservation(project_id=PROJECT_ID, zone=RES_ZONE, name=rev_name)