# **LLM Serving with Apigee**

<table align="left">
    <td style="text-align: center">
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/llm_circuit_breaking.ipynb">
          <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo\"><br> Open in Colab
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapigee-samples%2Fmain%2Fllm-circuit-breaking%2Fllm_circuit_breaking.ipynb">
          <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
        </a>
      </td>    
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/apigee-samples/main/llm-circuit-breaking/llm_circuit_breaking.ipynb">
          <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/llm_circuit_breaking.ipynb">
          <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
        </a>
      </td>
</table>
<br />
<br />
<br />

# Circuit Breaking Sample

Circuit breaking with Apigee offers significant benefits for serving Large Language Models (LLMs) in Retrieval Augmented Generation (RAG) applications, particularly in preventing the dreaded `429` HTTP errors that arise from exceeding LLM endpoint quotas. By placing Apigee between the RAG application and LLM endpoints, users gain a robust mechanism for managing traffic distribution and graceful failure handling.

Imagine a scenario where multiple tenants, each with their own LLM endpoints and associated capacity limits, are accessed by a single RAG application. Without circuit breaking, a surge in traffic to a particular tenant's LLM endpoint could trigger a `429` error, disrupting the entire RAG application's functionality. Apigee acts as a traffic cop, monitoring the health of each tenant's endpoint and implementing a circuit-breaking strategy to prevent cascading failures.

To further enhance resilience, users can create priority pools, grouping together LLM endpoints with similar capabilities and quota limitations. This allows Apigee to distribute traffic evenly within a pool, effectively aggregating the individual endpoint quotas and ensuring that the combined capacity can handle the load.

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/images/llm-circuit-breaking.png?raw=1)


# Circuit Breaking Benefits

1. **Improved fault tolerance**: The multi-pool architecture, coupled with circuit breaking, provides inherent fault tolerance, ensuring that the RAG application remains operational even if one or more LLM endpoints fail or experience outages.
2. **Data-driven capacity planning**: Circuit breaking provides valuable insights into endpoint performance, allowing you to monitor and adjust capacity allocations based on actual traffic patterns and usage. This enables informed capacity planning and avoids unnecessary overprovisioning.
3. **Multitenancy**: Apigee provides a unified platform for managing and routing traffic to different LLM tenants, simplifying integration and reducing development effort.
4. **Centralized monitoring and analytics**: Apigee offers comprehensive monitoring and analytics capabilities, allowing for real-time insights into LLM endpoint performance, quota usage, and failover events. This enables proactive identification and resolution of issues, enhancing operational efficiency.


# How does it work?

1. Apigee recieves a request and verifies the primary pool status. If it's open, then route the traffic to the primary pool. It it's closed, then route the traffic to the secondary pool.
2. If the request to the primary pool fails (`429` or error greater than `399`) then failover to the seconday pool and increase the error count in the circuit breaker.
3. Once an max of 2 errors has been detected, then the primary pool is taken out of rotation and all traffic will be sent to the secondary pool.
4. The primary pool will be returned back into rotation after a cooldown period of 2 minutes.

# Setup

Use the following GCP CloudShell tutorial. Follow the instructions to deploy the sample.

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.png)](https://ssh.cloud.google.com/cloudshell/open?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/apigee-samples&cloudshell_git_branch=main&cloudshell_workspace=.&cloudshell_tutorial=llm-circuit-breaking/docs/cloudshell-tutorial.md)

# Test Sample

## Install dependencies


In [None]:
!pip install -Uq google-cloud-tasks

## Authenticate your notebook environment (Colab only)
If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using Vertex AI Workbench or Colab Enterprise.

In [1]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth
    auth.authenticate_user()

## Initialize notebook variables

* **PROJECT_ID**: The default GCP project to use when making Vertex API calls.
* **LOCATION**: The default location to use when making Vertex API calls.
* **API_ENDPOINT**:  Desired API endpoint, e.g. https://apigee.iloveapimanagement.com/circuit-breaking
* **TASK_QUEUE**: After deploying the sample you'll get a task queue ID. By default this value should be `ai-queue`.

In [None]:
from google.auth import default
from google.auth.transport.requests import Request
# Define sample information
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = ""  # @param {type:"string"}
API_ENDPOINT = "https://REPLACE_WITH_APIGEE_HOST/v1/samples/llm-circuit-breaking"  # @param {type:"string"}
TASK_QUEUE = "ai-queue" # @param {type:"string"}
SCOPES = ['https://www.googleapis.com/auth/cloud-platform']
MODEL = "gemini-1.5-pro"
GEMINI_SUFFIX = "/v1/projects/{project}/locations/{location}/publishers/google/models/{model}:streamGenerateContent".format(project=PROJECT_ID, location=LOCATION, model=MODEL)
LLM_REQUEST_URL=API_ENDPOINT + GEMINI_SUFFIX

credentials, project_id = default(scopes=SCOPES, quota_project_id=PROJECT_ID)
credentials.refresh(Request())
access_token = credentials.token


## Test Circuit Breaking

The following cell executes a test scenario to exceed the total Gemini quota [model limits](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#quotas_by_region_and_model) for a **primary** GCP project. As soon as the project quota is reached, a secondary target will serve traffic without returing `429` errors to the consumer.

In [3]:
from google.cloud import tasks_v2
from google.protobuf import duration_pb2
from typing import Dict, Optional
import json

prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?",
           "Why is the sky blue?"]

def create_http_task(
    project: str,
    location: str,
    queue: str,
    url: str,
    json_payload: Dict,
    scheduled_seconds_from_now: Optional[int] = None,
    task_id: Optional[str] = None,
    deadline_in_seconds: Optional[int] = None,
) -> tasks_v2.Task:
    client = tasks_v2.CloudTasksClient()
    task = tasks_v2.Task(
        http_request=tasks_v2.HttpRequest(
            http_method=tasks_v2.HttpMethod.POST,
            url=url,
            headers={"Content-type": "application/json",
                     "Authorization": f"Bearer {access_token}"},
            body=json.dumps(json_payload).encode(),
        ),
        name=(
            client.task_path(project, location, queue, task_id)
            if task_id is not None
            else None
        ),
    )
    duration = duration_pb2.Duration()
    duration.FromSeconds(120)
    task.dispatch_deadline = duration
    return client.create_task(
        tasks_v2.CreateTaskRequest(
            parent=client.queue_path(project, location, queue),
            task=task,
        )
    )

def invoke_model(prompt):
  request = {"contents":[{"role":"user","parts":[{"text":prompt}]}],"generationConfig":{}}
  create_http_task(PROJECT_ID, LOCATION, TASK_QUEUE, LLM_REQUEST_URL, request)

x = range(15)
for n in x:
  for prompt in prompts:
    invoke_model(prompt)



### Analyze target pool Gemini quotas

This sample also creates am LLM Target analytics report that allows you to:

* Understand usage patterns: See how often the Gemini quota is being reached.
* Optimize token management Make informed decisions about quota usage and ajust pte-allocated quota.
* Plan for scalability: Forecast future demand and ensure resource availability.

To use this dashboard, from the Apigee console navigate to `Custom Reports` > `LLM Target Report`. You'll be able to drill down into token metrics that represent LLM traffic. 

**NOTE**: It might take a few mins for the report to show some data

See sample below:

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/images/circuit-breaking-report.png?raw=1)
