In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with your deployed model on GKE

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fgke_model_ui_deployment_notebook.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/gke_model_ui_deployment_notebook.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

# Overview

This notebook will guide you through the initial step of testing your recently
deployed model with text prompts. Depending on your deployed model's inference
setup, the notebook utilizes either Text Generation Inference
[TGI](https://huggingface.co/docs/text-generation-inference/en/index) or
[vLLM](https://developers.googleblog.com/en/inference-with-gemma-using-dataflow-and-vllm/#:~:text=model%20frameworks%20simple.-,What%20is%20vLLM%3F,-vLLM%20is%20an),
two efficient serving frameworks that enhance the performance of your GPU model.
Ready to see your deployed model respond? Run the cells below and start
experimenting with different prompts!

### Prerequisites

Before proceeding with this notebook, ensure you have already deployed a model
using the Google Cloud Console. You can find an overview of AI and Machine
Learning services on
[GKE AI/ML](https://console.cloud.google.com/kubernetes/aiml/overview).

### Objective

Enable prompt-based testing of the AI model deployed on GKE

### GPUs

GPUs let you accelerate specific workloads running on your nodes, such as
machine learning and data processing. GKE provides a range of machine type
options for node configuration, including machine types with NVIDIA H100, L4,
and A100 GPUs.

### Understanding the Inference Frameworks

Your model is running on one of two popular and efficient serving frameworks:
vLLM or Text Generation Inference (TGI). The following sections provide a brief
overview of each to give you context on the underlying technology powering your
model.

#### TGI

TGI is a highly optimized open-source LLM serving framework that can increase
serving throughput on GPUs. TGI includes features such as:

*   Optimized transformer implementation with PagedAttention
*   Continuous batching to improve the overall serving throughput
*   Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to the
[TGI documentation](https://github.com/huggingface/text-generation-inference/blob/main/README.md)

#### vLLM

vLLM is another fast and easy-to-use library for LLM inference and serving. It's
known for its high throughput and efficiency, and it leverages PagedAttention.
Key features include:

*   PagedAttention: Efficient memory management for handling long sequences and
    dynamic workloads.
*   Continuous batching: Maximizes GPU utilization by batching incoming
    requests.
*   High-throughput serving: Designed for production-level serving with low
    latency.
*   Optimized CUDA kernels.

To learn more, refer to the
[vLLM documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/vllm/use-vllm)

In [None]:
# @title # Connect to Google Cloud Project
# @markdown #### Run this cell to configure your Google Cloud environment for Kubernetes (GKE) operations.
# @markdown
# @markdown #### Actions:
# @markdown 1.  **Connects to Project:** Retrieves and sets your Google Cloud project ID.
# @markdown 3.  **Installs `kubectl`:** Installs the Kubernetes command-line tool.

import os

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Set up gcloud.
! gcloud config set project "$PROJECT_ID"
! gcloud services enable container.googleapis.com

# Add kubectl to the set of available tools.
! mkdir -p /tools/google-cloud-sdk/.install
! gcloud components install kubectl --quiet

In [None]:
# @title # Select Cluster and Deployment { vertical-output: true }
# @markdown **Instructions:**
# @markdown
# @markdown Run this cell using the â–¶ button. Then, use the interactive widgets that appear below:
# @markdown 1. **Select Cluster:** From the first dropdown, choose the GKE cluster where your model deployment is running. Note: the list only contains autopilot clusters.
# @markdown 2.   **Select Namespace:** After selecting a cluster, choose the Kubernetes *Namespace* where your deployment resides within that cluster.
# @markdown 3. **Select Deployment:** After selecting a cluster, this dropdown will populate with the names of deployments found.

import json
import subprocess

import ipywidgets as widgets
from IPython.display import Markdown, clear_output, display

# --- Globals and Configuration ---
DEFAULT_NAMESPACE = "default"
SELECTED_DEPLOYMENT = None
SELECTED_NAMESPACE = DEFAULT_NAMESPACE
deployment_dropdown = None
namespace_dropdown = None
cluster_dropdown = None
output_area = widgets.Output()


# --- Data Fetching Functions ---
def get_clusters(project_id):
    """Fetches autopilot GKE clusters for a given project."""
    # Note: Uses broad exception handling as per original code.
    try:
        cmd = f"gcloud container clusters list --filter=autopilot.enabled=true --format=json --project={project_id}"
        result = subprocess.run(
            cmd, shell=True, capture_output=True, text=True, check=True, timeout=60
        )
        clusters_data = json.loads(result.stdout)
        # Create a map of cluster name to its region/location
        return {c["name"]: c["location"] for c in clusters_data}
    except Exception as e:
        # Original code prints error and returns empty dict
        print(f"Error getting clusters: {e}")
        return {}


# Fetch clusters immediately using PROJECT_ID assumed to be globally defined
# Note: This relies on PROJECT_ID being set *before* this cell runs.
try:
    CLUSTER_REGION_MAP = get_clusters(PROJECT_ID)
except NameError:
    print(
        "Error: PROJECT_ID variable is not defined. Please define it in a previous cell."
    )
    CLUSTER_REGION_MAP = {}  # Define as empty to prevent errors later


def get_deployments(cluster, region, namespace):
    """Fetches deployments from a specific namespace in a cluster."""
    # Note: Uses PROJECT_ID as a global variable as per original code.
    # Note: Uses broad exception handling as per original code.
    target_namespace = namespace if namespace else DEFAULT_NAMESPACE
    try:
        # Ensure credentials for the target cluster
        cred_cmd = [
            "gcloud",
            "container",
            "clusters",
            "get-credentials",
            cluster,
            f"--location={region}",
            f"--project={PROJECT_ID}",
        ]
        subprocess.run(cred_cmd, capture_output=True, text=True, check=True, timeout=60)

        # Fetch deployments using kubectl
        kubectl_cmd = [
            "kubectl",
            "get",
            "deployments",
            f"--namespace={target_namespace}",
            "-o",
            "json",
        ]
        result = subprocess.run(
            kubectl_cmd, capture_output=True, text=True, check=True, timeout=60
        )
        deployments_data = json.loads(result.stdout)
        # Extract deployment names
        return [item["metadata"]["name"] for item in deployments_data.get("items", [])]
    except Exception as e:
        # Original code prints error and returns empty list
        print(f"Error fetching deployments from namespace '{target_namespace}': {e}")
        return []


def get_namespaces(cluster, region, project_id):
    """Fetches namespaces for a given cluster."""
    # Note: Uses broad exception handling as per original code.
    try:
        # Ensure credentials for the target cluster
        cred_cmd = [
            "gcloud",
            "container",
            "clusters",
            "get-credentials",
            cluster,
            f"--location={region}",
            f"--project={project_id}",
        ]
        subprocess.run(cred_cmd, capture_output=True, text=True, check=True, timeout=60)

        # Fetch namespaces using kubectl
        kubectl_cmd = ["kubectl", "get", "namespaces", "-o", "json"]
        result = subprocess.run(
            kubectl_cmd, capture_output=True, text=True, check=True, timeout=60
        )
        namespaces_data = json.loads(result.stdout)
        # Extract namespace names
        all_ns = [item["metadata"]["name"] for item in namespaces_data.get("items", [])]
        return all_ns
    except Exception as e:
        # Original code displays error in output_area and returns None
        with output_area:
            # Clear previous output before showing error
            clear_output(wait=True)
            display(
                Markdown(
                    f"<font color='red'>Error processing namespaces for **{cluster}**: {e}</font>"
                )
            )
        return None


# --- Event Handlers ---
def on_deployment_select(change):
    """Handles changes in the deployment selection."""
    global SELECTED_DEPLOYMENT
    if change["type"] == "change" and change["name"] == "value":
        SELECTED_DEPLOYMENT = change["new"]
        with output_area:
            clear_output(wait=True)
            current_cluster = cluster_dropdown.value

            # Display context message
            if current_cluster != "Select Cluster":
                # Use SELECTED_NAMESPACE global which should be set by on_namespace_change
                # or default if namespace hasn't been selected yet.
                ns_context = SELECTED_NAMESPACE or DEFAULT_NAMESPACE
                ns_info = f"Cluster: **{current_cluster}**, Namespace: **{ns_context}**"
                display(Markdown(ns_info))

            # Display selection message if a valid deployment is chosen
            if (
                SELECTED_DEPLOYMENT
                and SELECTED_DEPLOYMENT != "Select Deployment"
                and SELECTED_DEPLOYMENT != "Loading..."
            ):
                mes = f"""Selected deployment: **{SELECTED_DEPLOYMENT}**"""
                display(Markdown(mes))


def update_deployment_dropdown(cluster_name, namespace_to_use):
    """Updates the deployment list based on cluster/namespace change."""
    global deployment_dropdown, SELECTED_DEPLOYMENT
    target_namespace = namespace_to_use if namespace_to_use else DEFAULT_NAMESPACE

    # Reset selection before fetching/updating
    SELECTED_DEPLOYMENT = None
    deployment_dropdown.disabled = True  # Disable while loading/updating
    deployment_dropdown.options = ["Loading..."]
    deployment_dropdown.value = "Loading..."

    # Clear output area and show loading context
    with output_area:
        clear_output(wait=True)
        display(Markdown(f"Cluster: **{cluster_name}**"))
        if namespace_to_use:
            display(Markdown(f"Namespace: **{namespace_to_use}**"))
        display(Markdown("Fetching deployments..."))

    # Fetch deployments (assuming CLUSTER_REGION_MAP and PROJECT_ID are available)
    region = CLUSTER_REGION_MAP.get(cluster_name)
    if not region:
        with output_area:
            clear_output(wait=True)
            display(
                Markdown(
                    f"<font color='red'>Error: Region not found for cluster {cluster_name}.</font>"
                )
            )
        deployment_dropdown.options = ["Error loading"]
        deployment_dropdown.value = "Error loading"
        return  # Stop if region is missing

    deployments = get_deployments(cluster_name, region, target_namespace)

    # Update dropdown options
    new_options = ["Select Deployment"] + deployments
    deployment_dropdown.options = new_options

    # Set final state based on results
    if deployments:
        deployment_dropdown.value = "Select Deployment"
        deployment_dropdown.disabled = False
        status_message = f"Found {len(deployments)} deployment(s) in namespace **{target_namespace}**."
    else:
        deployment_dropdown.value = "Select Deployment"  # Keep prompt
        deployment_dropdown.disabled = True  # No valid options to select
        # Check if get_deployments printed an error or if it just returned empty
        if not output_area.outputs:  # If no error printed by get_deployments
            status_message = (
                f"No deployments found in namespace **{target_namespace}**."
            )
        else:
            status_message = None  # Error likely already shown

    # Update output area with final status
    with output_area:
        clear_output(wait=True)
        display(Markdown(f"Cluster: **{cluster_name}**"))
        if namespace_to_use:
            display(Markdown(f"Namespace: **{namespace_to_use}**"))
        if status_message:
            display(Markdown(status_message))


def update_namespace_dropdown(cluster_name):
    """Updates the namespace list based on cluster change."""
    global namespace_dropdown, SELECTED_NAMESPACE
    global deployment_dropdown, SELECTED_DEPLOYMENT  # Need to reset deployment too

    # Reset namespace state and dependent deployment dropdown
    SELECTED_NAMESPACE = None  # Reset selection
    SELECTED_DEPLOYMENT = None
    namespace_dropdown.disabled = True
    namespace_dropdown.options = ["Loading..."]
    namespace_dropdown.value = "Loading..."
    deployment_dropdown.options = ["Select Deployment"]
    deployment_dropdown.value = "Select Deployment"
    deployment_dropdown.disabled = True

    # Clear output area and show loading context
    with output_area:
        clear_output(wait=True)
        display(Markdown(f"Cluster: **{cluster_name}**"))
        display(Markdown("Fetching namespaces..."))

    # Fetch namespaces (assuming CLUSTER_REGION_MAP and PROJECT_ID are available)
    region = CLUSTER_REGION_MAP.get(cluster_name)
    if not region:
        with output_area:
            clear_output(wait=True)
            display(
                Markdown(
                    f"<font color='red'>Error: Region not found for cluster {cluster_name}.</font>"
                )
            )
        namespace_dropdown.options = ["Error loading"]
        namespace_dropdown.value = "Error loading"
        return  # Stop if region is missing

    # Assuming PROJECT_ID is globally available
    namespaces = get_namespaces(cluster_name, region, PROJECT_ID)

    # Update dropdown options based on fetch result
    if namespaces is not None:  # Success (get_namespaces returns None on error)
        new_options = ["Select Namespace"] + namespaces  # Use "Select Namespace" prompt
        namespace_dropdown.options = new_options
        namespace_dropdown.value = "Select Namespace"
        namespace_dropdown.disabled = False
        status_message = (
            f"Found {len(namespaces)} namespace(s). Select one to list deployments."
        )
    else:  # Error occurred during fetch
        namespace_dropdown.options = ["Error loading"]  # Keep error state
        namespace_dropdown.value = "Error loading"
        namespace_dropdown.disabled = True
        status_message = None  # Error already displayed by get_namespaces

    # Update output area with final status
    with output_area:
        clear_output(wait=True)
        display(Markdown(f"Cluster: **{cluster_name}**"))
        if status_message:
            display(Markdown(status_message))


def on_cluster_change(change):
    """Handles cluster selection changes."""
    # Globals not strictly needed here as it calls update_namespace_dropdown which uses them
    if change["type"] == "change" and change["name"] == "value":
        cluster = change["new"]

        # Clear output area for new selection process
        with output_area:
            clear_output(wait=True)

        if cluster == "Select Cluster":
            # Reset namespace dropdown
            namespace_dropdown.options = ["Select Namespace"]  # Correct prompt
            namespace_dropdown.value = "Select Namespace"
            namespace_dropdown.disabled = True
            # Reset deployment dropdown
            deployment_dropdown.options = ["Select Deployment"]
            deployment_dropdown.value = "Select Deployment"
            deployment_dropdown.disabled = True
            # Clear globals
            global SELECTED_NAMESPACE, SELECTED_DEPLOYMENT
            SELECTED_NAMESPACE = None
            SELECTED_DEPLOYMENT = None
        else:
            # Trigger update for the namespace dropdown
            update_namespace_dropdown(cluster)


def on_namespace_change(change):
    """Handles namespace selection: fetches deployments."""
    global SELECTED_NAMESPACE, cluster_dropdown, deployment_dropdown  # Added deployment_dropdown
    if change["type"] == "change" and change["name"] == "value":
        new_namespace = change["new"]

        # Get current cluster value
        current_cluster = cluster_dropdown.value

        # Handle placeholder/loading/error values or if cluster isn't selected
        if (
            new_namespace in ["Select Namespace", "Loading...", "Error loading"]
            or current_cluster == "Select Cluster"
        ):
            SELECTED_NAMESPACE = None
            # Reset deployment dropdown state
            deployment_dropdown.options = ["Select Deployment"]
            deployment_dropdown.value = "Select Deployment"
            deployment_dropdown.disabled = True
            global SELECTED_DEPLOYMENT
            SELECTED_DEPLOYMENT = None
            # Clear output area for clean state
            with output_area:
                clear_output(wait=True)
                if current_cluster != "Select Cluster":  # Keep cluster context
                    display(Markdown(f"Cluster: **{current_cluster}**"))
                if new_namespace == "Select Namespace":
                    display(Markdown("Select a namespace to list deployments."))
            return  # Don't proceed to fetch deployments

        # Valid namespace selected
        SELECTED_NAMESPACE = new_namespace

        # Trigger update for the deployment dropdown
        if current_cluster != "Select Cluster":
            update_deployment_dropdown(current_cluster, SELECTED_NAMESPACE)


# --- Main Widget Setup ---
if CLUSTER_REGION_MAP:
    clusters_with_prompt = ["Select Cluster"] + sorted(list(CLUSTER_REGION_MAP.keys()))
    cluster_dropdown = widgets.Dropdown(
        options=clusters_with_prompt,
        value="Select Cluster",  # Set initial value
        description="Cluster:",
        style={"description_width": "initial"},
        layout=widgets.Layout(width="auto"),  # Auto width
    )

    namespace_dropdown = widgets.Dropdown(
        options=["Select Namespace"],  # Correct initial prompt
        value="Select Namespace",
        description="Namespace:",
        disabled=True,  # Initially disabled
        style={"description_width": "initial"},
        layout=widgets.Layout(width="auto"),
    )

    deployment_dropdown = widgets.Dropdown(
        options=["Select Deployment"],
        value="Select Deployment",
        description="Deployment:",
        disabled=True,  # Initially disabled
        style={"description_width": "initial"},
        layout=widgets.Layout(width="auto"),
    )

    # Observe changes
    cluster_dropdown.observe(on_cluster_change, names="value")
    namespace_dropdown.observe(on_namespace_change, names="value")
    deployment_dropdown.observe(on_deployment_select, names="value")

    # Display initial status and widgets
    print(
        f"Found {len(CLUSTER_REGION_MAP)} Autopilot Cluster(s) in Project '{PROJECT_ID}'.\n"
    )
    display(cluster_dropdown, namespace_dropdown, deployment_dropdown, output_area)

else:
    # Handle case where PROJECT_ID might be missing or no clusters found
    if "PROJECT_ID" not in globals() or not PROJECT_ID:
        error_message = "Error: PROJECT_ID variable is not defined or empty. Please define it in a previous cell."
    else:
        error_message = f"Error: No Autopilot clusters found or accessible in project '{PROJECT_ID}'. Check Project ID, permissions, and ensure Autopilot clusters exist."
    print(error_message)
    # Display error message using a widget for better integration in notebook
    display(widgets.HTML(f"<font color='red'>{error_message}</font>"))
    # Keep output_area widget displayed even on error for potential messages from retries etc.
    display(output_area)

In [None]:
# @title # Chat completion for text-only models { vertical-output: true}
# @markdown You may send prompts to the model server for prediction.
# @markdown
# @markdown * **user_prompt (string):** This is the text prompt you provide to the language model. It's the question or instruction e (e.g., "Explain neural networks").
# @markdown * **temperature (number):** This  parameter controls the randomness of the model's output. It influences how the model selects the next token in the sequence it generates. Typical values range from 0.2 to 1.0.
# @markdown * **max_tokens (number):** This parameter refers to the maximum number of tokens (words or sub-word units) that the model is allowed to generate in its response.

import ipywidgets as widgets


def _run_kubectl(cmd):
    """Executes a kubectl command and returns its stdout."""
    result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=60)
    return result.stdout.strip()


def get_deployment_pod_name(deployment, namespace):
    """Finds the running pod name for a given deployment and namespace."""
    cmd = [
        "kubectl",
        "get",
        "pods",
        "-n",
        namespace,
        "-o",
        "json",
        "-l",
        f"app={deployment}-app",
        "--field-selector=status.phase=Running",
    ]
    try:
        pods_json = _run_kubectl(cmd)
        pods = json.loads(pods_json)
        if pods.get("items"):
            return pods["items"][0]["metadata"]["name"]
        print(f"No running pods found for {deployment} in {namespace}.")
        return None
    except (
        subprocess.CalledProcessError,
        json.JSONDecodeError,
        IndexError,
        KeyError,
    ) as e:
        print(f"Error getting pod name for {deployment} in {namespace}: {e}")
        return None


def check_inference_label(pod_name, namespace):
    """Checks if the specified pod has the vLLM inference server label."""
    cmd = ["kubectl", "get", "pod", pod_name, "-n", namespace, "-o", "json"]
    try:
        pod_json = _run_kubectl(cmd)
        labels = json.loads(pod_json).get("metadata", {}).get("labels", {})
        return labels.get("ai.gke.io/inference-server") == "vllm"
    except (subprocess.CalledProcessError, json.JSONDecodeError, KeyError) as e:
        print(f"Error checking labels for pod {pod_name} in {namespace}: {e}")
        return False


def process_response(request, pod_name, pod_endpoint, is_vllm_inference, namespace):
    """Sends a request to the pod and processes the response."""
    json_data_escaped = json.dumps(request).replace("'", "'\\''")
    curl_cmd = f"kubectl exec -n {namespace} -t {pod_name} -- curl -s -X POST http://{pod_endpoint}/generate -H \"Content-Type: application/json\" -d '{json_data_escaped}' 2> /dev/null"
    try:
        response_raw = _run_kubectl(["bash", "-c", curl_cmd])
        if not response_raw:
            return f"Error: Empty response from pod {pod_name}."
        first_line = response_raw.splitlines()[0]
        data = json.loads(first_line)

        if is_vllm_inference:
            predictions = data.get("predictions")
            if isinstance(predictions, (list, tuple)) and predictions:
                return predictions[0]
            return f"Error: Unexpected vLLM format. Raw: {first_line}"
        else:  # TGI format
            generated_text = data.get("generated_text")
            if generated_text is not None:
                return generated_text
            return f"Error: Unexpected TGI format. Raw: {first_line}"

    except json.JSONDecodeError as e:
        raw_response = (
            response_raw.splitlines()[0]
            if "response_raw" in locals() and response_raw
            else "N/A"
        )
        return f"Error decoding JSON: {e}. Raw: {raw_response}"
    except (subprocess.CalledProcessError, IndexError, KeyError, TypeError) as e:
        raw_response = (
            response_raw.splitlines()[0]
            if "response_raw" in locals() and response_raw
            else "N/A"
        )
        return f"Error processing response: {e}. Raw: {raw_response}"
    except Exception as e:
        return f"Unexpected error during response processing: {e}"


# --- Widgets Setup ---
user_prompt_widget = widgets.Textarea(
    value="What is AI?",
    description="User Prompt:",
    layout=widgets.Layout(width="95%", height="100px"),
)
temperature_widget = widgets.FloatSlider(
    value=0.50, min=0.0, max=1.0, step=0.01, description="Temperature:"
)
max_tokens_widget = widgets.IntSlider(
    value=250, min=1, max=2048, step=1, description="Max Tokens:"
)
submit_button = widgets.Button(description="Submit")
output_area_response = widgets.Output()


# --- Submit Button Logic ---
def on_submit_clicked(b):
    """Handles the submit button click event."""
    with output_area_response:
        clear_output()
        if (
            "SELECTED_DEPLOYMENT" not in globals()
            or "SELECTED_NAMESPACE" not in globals()
        ):
            display(
                Markdown(
                    "**Error:** `SELECTED_DEPLOYMENT` or `SELECTED_NAMESPACE` not defined."
                )
            )
            return

        print(
            f"Target: {SELECTED_DEPLOYMENT} in {SELECTED_NAMESPACE}. \n\nRequesting response..."
        )

        pod_name = get_deployment_pod_name(SELECTED_DEPLOYMENT, SELECTED_NAMESPACE)
        if not pod_name:
            display(
                Markdown(
                    f"**Error:** Could not find running pod for `{SELECTED_DEPLOYMENT}`."
                )
            )
            return

        is_vllm = check_inference_label(pod_name, SELECTED_NAMESPACE)
        request = {
            "max_tokens": max_tokens_widget.value,
            "temperature": temperature_widget.value,
            "prompt" if is_vllm else "inputs": user_prompt_widget.value,
        }
        service = f"{SELECTED_DEPLOYMENT}-service"
        endpoint_cmd = [
            "kubectl",
            "get",
            "endpoints",
            service,
            "-n",
            SELECTED_NAMESPACE,
        ]

        try:
            endpoint_output = _run_kubectl(endpoint_cmd).splitlines()
            if len(endpoint_output) < 2 or len(endpoint_output[1].split()) < 2:
                display(
                    Markdown(
                        f"**Error:** Endpoint data incomplete for service `{service}`."
                    )
                )
                print("kubectl output:\n", "\n".join(endpoint_output))
                return
            endpoint = endpoint_output[1].split()[
                1
            ]  # Assumes format: NAME ENDPOINTS AGE -> service ip:port,... age
            response = process_response(
                request, pod_name, endpoint, is_vllm, SELECTED_NAMESPACE
            )
            display(Markdown(f"**Response:**\n\n{response}"))

        except subprocess.CalledProcessError as e:
            display(
                Markdown(
                    f"**Error getting endpoints for `{service}`:**\n```\n{e.stderr}\n```"
                )
            )
        except Exception as e:
            display(Markdown(f"**Unexpected Error:**\n```\n{e}\n```"))


# --- Display Widgets ---
submit_button.on_click(on_submit_clicked)
display(
    user_prompt_widget,
    temperature_widget,
    max_tokens_widget,
    submit_button,
    output_area_response,
)

# Next Steps: Integrating the GKE Service Endpoint

After successfully deploying a model on Google Kubernetes Engine (GKE) and
verifying it via a notebook, the next step is to integrate it into various
applications. This involves making HTTP requests to the service's endpoint from
your application code.

### Exposing the Service

To make your deployed model accessible to applications, you'll need to expose
its service endpoint. Google Kubernetes Engine offers several ways to do this:

1.  **Ingress:** Configure an Ingress resource to route external HTTP(S) traffic
    to your service. Set up Ingress for either an internal Load Balancer
    (accessible only within your VPC) or an external Load Balancer (accessible
    from the internet).
    [Learn more about GKE Ingress](https://cloud.google.com/kubernetes-engine/docs/concepts/ingress).
2.  **Gateway API:** A more modern and feature-rich API for managing traffic
    routing in Kubernetes. Similar to Ingress, Gateway API allows you to define
    how external and internal traffic should be directed to your services.
    [Explore GKE Gateway API](https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api).

### Setting Up Autoscaling

Ensure your model serving can handle varying traffic by configuring the
Horizontal Pod Autoscaler (HPA). HPA automatically scales the number of Pods
based on resource utilization or custom metrics, optimizing performance and
cost.
[See how to configure HPA](https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling).

### Setting Up Monitoring

Monitor the health and performance of your deployed model using Google Cloud
Managed Service for Prometheus. Configure your model serving to expose
Prometheus metrics for comprehensive insights.
[Get started with Google Cloud Managed Prometheus](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-automatic-application-monitoring).

### Additional Resources:

*   #### Kubernetes Documentation:

    *   Services:
        https://kubernetes.io/docs/concepts/services-networking/service/

*   #### Google Cloud Documentation:

    *   Google Kubernetes Engine (GKE):
        https://cloud.google.com/kubernetes-engine
    *   Cloud Load Balancing:
        https://cloud.google.com/load-balancing/docs/ingress
    *   Gateway API on GKE:
        https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api
    *   Learn about GPUs in GKE:
        https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

*   #### Python requests Library:

    *   https://requests.readthedocs.io/en/latest/

*   #### LangChain with Google Integrations:

    *   The Langchain documentation is very useful:
        https://python.langchain.com/docs/integrations/providers/google/