# **LLM Serving with Apigee**

<table align="left">
    <td style="text-align: center">
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo\"><br> Open in Colab
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapigee-samples%2Fmain%2Fllm-token-limits%2Fllm_token_limits.ipynb">
          <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
        </a>
      </td>    
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/apigee-samples/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
        </a>
      </td>
</table>
<br />
<br />
<br />

# Token Limits Sample

Every interaction with an LLM consumes tokens, therefore, LLM token management plays a crutial role in maintaining platform-level control and visility over the consumption of tokens across LLM providers and consumers.

Apigee's API Products, when applied to token consumption, allows you to effectively manage token usage by setting limits on the number of tokens consumed per LLM consumer. This policy leverages the token usage metrics provided by an LLM, enabling real-time monitoring and enforcement of limits.

![architecture](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/images/ai-product.png?raw=1)


# Benefits Token Limits with AI Products

Creating Product tiers within Apigee allows for differentiated token quotas at each consumer tier. This enables you to:

* **Control resource allocation**: Prioritize resources for high-priority consumers by allocating higher token quotas to their tiers. This will also help to manage platform-wide token budgets across multiple LLM providers.
* **Tiered AI products**: By utilizing product tiers with granular token quotas, Apigee effectively manages LLM and empowers AI platform teams to manage costs and provide a multi-tenant platform experience.

# How does it work?

1. Prompt request is receved by an Apigee Proxy.
2. Apigee identifies the consumer Application and verifies that the AI Product token quota has not been exceeded.
3. Apigee extracts token counts and adds them to quota counter.
4. Apigee captures token counts as metrics for Analytics.

# Setup

Use the following GCP CloudShell tutorial. Follow the instructions to deploy the sample.

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.png)](https://ssh.cloud.google.com/cloudshell/open?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/apigee-samples&cloudshell_git_branch=main&cloudshell_workspace=.&cloudshell_tutorial=llm-token-limits/docs/cloudshell-tutorial.md)

# Test Sample

## Install dependencies


In [None]:
!pip install -Uq langchain==0.3.18
!pip install -Uq langchain-google-vertexai==2.0.12
!pip install -Uq google-cloud-aiplatform

## Authenticate your notebook environment (Colab only)
If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using Vertex AI Workbench or Colab Enterprise.

In [1]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

## Initialize notebook variables

* **PROJECT_ID**: The default GCP project to use when making Vertex API calls.
* **REGION**: The default location to use when making API calls.
* **API_ENDPOINT**:  Desired API endpoint, e.g. https://apigee.iloveapimanagement.com/generate
* **API_KEY**: After deploying the sample you'll get 2 API keys: **Bronze Key** and **Silver Key**. First, set the value of your **Bronze Key**.

In [None]:
from langchain_google_vertexai import VertexAI
# Define project information
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = ""  # @param {type:"string"}
API_ENDPOINT = "https://REPLACE_WITH_APIGEE_HOST/v1/samples/llm-token-limits"  # @param {type:"string"}
API_KEY = ""  # @param {type:"string"}
MODEL = "gemini-2.0-flash"
# Initialize LangChain
model = VertexAI(
      project=PROJECT_ID,
      location=LOCATION,
      api_endpoint=API_ENDPOINT,
      api_transport="rest",
      streaming=True,
      model_name=MODEL,
      additional_headers={"x-apikey": API_KEY})

## Test tiered AI products

Apigee allows you to create a tiered product strategy with different API access levels (e.g., Bronze, Silver, Gold) to cater to diverse user needs and limits. During the [Setup](#setup) stage you deployed 2 AI Product tiers for testing purposes.

* **Bronze AI Product**

This product enforces a 2000 token limit every 5 minutes. To test this limit, follow the steps below.

  1. Set the `API_KEY` value using your **Bronze Key** in the previous [step](#initialize-notebook-variables).
  2. Start a debug session on the **llm-token-limits-v1** proxy that was deployed during the [Setup](#setup) stage.
  3. Run the 2000 tokens every 5 minutes [scenario](#2000-tokens-every-5-minutes).
  4. Observe `HTTP 200` success codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count` and `available.count`.
  5. Run the 10000 tokens every 5 minutes [scenario](#5000-tokens-every-5-minutes).
  6. Observe `HTTP 429` error codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count`, `available.count` and `exceed.count`.

* **Silver AI Product**

This product enforces a 5000 token limit every 5 minutes. To test this limit, follow the steps below.

  1. Set the `API_KEY` value using your **Silver Key** in the previous [step](#initialize-notebook-variables).
  2. Start a debug session on the **llm-token-limits-v1** proxy that was deployed during the [Setup](#setup) stage.
  3. Run the 5000 tokens every 5 minutes [scenario](#5000-tokens-every-5-minutes).
  4. Observe `HTTP 200` success codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count` and `available.count`.

## Tokens Consumption Analytics

This sample also creates a Tokens Consumption analytics dashboard that allows you to:

* Understand usage patterns: See how often tokens are being used and by Developer App.
* Optimize token management Make informed decisions about token usage and ajust your tiered limits.
* Plan for scalability: Forecast future demand and ensure resource availability.

To use this dashboard, from the Apigee console navigate to `Custom Reports` > `Tokens Consumption Report`. You'll be able to drill down into token metrics that represent consumption by Developer Apps and Products. See sample below:

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/images/token-counts.png?raw=1)

# 2000 tokens every 5 minutes

This scenario demonstrates a basic interaction with a language model. The code repeatedly asks a language model the same question, "Why is the sky blue?" but phrased in different ways. It's a simple example of how to interact with a language model. After running the scenario **only once** expect the following behavior:

* If using the **Bronze Key**, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Bronze AI Product tokens limit of 2000 tokens every 5 minutes.
* If using the **Silver Key**, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Silver AI Product tokens limit of 5000 tokens every 5 minutes.


In [6]:
prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  model.invoke(prompt)

# 5000 tokens every 5 minutes

This scenario demonstrates a basic interaction with a language model. The code repeatedly asks a language model the same question, "Why is the sky blue?" but phrased in different ways to make sure the candidate responses are very extensive (high token count). After running the scenario **only once** expect the following behavior:

* If using the **Bronze Key**, the final token count (sum of tokens from prompts and response candidates) **should exceed** the Bronze AI Product tokens limit of 2000 tokens every 5 minutes. Should expect `HTTP 429` error messages in the notebook and also visible on Apigee's debug session.
* If using the **Silver Key**, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Silver AI Product tokens limit of 5000 tokens every 5 minutes.

In [7]:
prompts = ["Why is the sky blue? Provide a very long and detailed explanation.",
           "Furnish and exhaustive and long explanation (as long as a scence magazine article) for the phenomenon of the blue sky.",
           "Can you give me a really in-depth and as long as a book chapter of why the sky is blue?",
           "Give me a super detailed and very extensive explanation (as long as the yellow pages) of why the sky is blue.",
           "Can you tell me all about why the sky is blue, and make sure it's longer than a novel?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  model.invoke(prompt)