notebooks/community/model_garden/gke_model_ui_deployment_notebook.ipynb (862 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "Pr9TgOcV9vAXeqGiyTaTI5kS",
"metadata": {
"cellView": "form",
"id": "Pr9TgOcV9vAXeqGiyTaTI5kS"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "M1CpgYundFwz",
"metadata": {
"id": "M1CpgYundFwz"
},
"source": [
"# Get started with your deployed model on GKE\n",
"\n",
"<table><tbody><tr>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fgke_model_ui_deployment_notebook.ipynb\">\n",
" <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/gke_model_ui_deployment_notebook.ipynb\">\n",
" <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</tr></tbody></table>"
]
},
{
"cell_type": "markdown",
"id": "t2jj2XOgkS4F",
"metadata": {
"id": "t2jj2XOgkS4F"
},
"source": [
"# Overview\n",
"\n",
"This notebook will guide you through the initial step of testing your recently\n",
"deployed model with text prompts. Depending on your deployed model's inference\n",
"setup, the notebook utilizes either Text Generation Inference\n",
"[TGI](https://huggingface.co/docs/text-generation-inference/en/index) or\n",
"[vLLM](https://developers.googleblog.com/en/inference-with-gemma-using-dataflow-and-vllm/#:~:text=model%20frameworks%20simple.-,What%20is%20vLLM%3F,-vLLM%20is%20an),\n",
"two efficient serving frameworks that enhance the performance of your GPU model.\n",
"Ready to see your deployed model respond? Run the cells below and start\n",
"experimenting with different prompts!\n",
"\n",
"### Prerequisites\n",
"\n",
"Before proceeding with this notebook, ensure you have already deployed a model\n",
"using the Google Cloud Console. You can find an overview of AI and Machine\n",
"Learning services on\n",
"[GKE AI/ML](https://console.cloud.google.com/kubernetes/aiml/overview).\n",
"\n",
"### Objective\n",
"\n",
"Enable prompt-based testing of the AI model deployed on GKE\n",
"\n",
"### GPUs\n",
"\n",
"GPUs let you accelerate specific workloads running on your nodes, such as\n",
"machine learning and data processing. GKE provides a range of machine type\n",
"options for node configuration, including machine types with NVIDIA H100, L4,\n",
"and A100 GPUs.\n",
"\n",
"### Understanding the Inference Frameworks\n",
"\n",
"Your model is running on one of two popular and efficient serving frameworks:\n",
"vLLM or Text Generation Inference (TGI). The following sections provide a brief\n",
"overview of each to give you context on the underlying technology powering your\n",
"model.\n",
"\n",
"#### TGI\n",
"\n",
"TGI is a highly optimized open-source LLM serving framework that can increase\n",
"serving throughput on GPUs. TGI includes features such as:\n",
"\n",
"* Optimized transformer implementation with PagedAttention\n",
"* Continuous batching to improve the overall serving throughput\n",
"* Tensor parallelism and distributed serving on multiple GPUs\n",
"\n",
"To learn more, refer to the\n",
"[TGI documentation](https://github.com/huggingface/text-generation-inference/blob/main/README.md)\n",
"\n",
"#### vLLM\n",
"\n",
"vLLM is another fast and easy-to-use library for LLM inference and serving. It's\n",
"known for its high throughput and efficiency, and it leverages PagedAttention.\n",
"Key features include:\n",
"\n",
"* PagedAttention: Efficient memory management for handling long sequences and\n",
" dynamic workloads.\n",
"* Continuous batching: Maximizes GPU utilization by batching incoming\n",
" requests.\n",
"* High-throughput serving: Designed for production-level serving with low\n",
" latency.\n",
"* Optimized CUDA kernels.\n",
"\n",
"To learn more, refer to the\n",
"[vLLM documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/vllm/use-vllm)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "XMf-T58TkDy1",
"metadata": {
"cellView": "form",
"id": "XMf-T58TkDy1"
},
"outputs": [],
"source": [
"# @title # Connect to Google Cloud Project\n",
"# @markdown #### Run this cell to configure your Google Cloud environment for Kubernetes (GKE) operations.\n",
"# @markdown\n",
"# @markdown #### Actions:\n",
"# @markdown 1. **Connects to Project:** Retrieves and sets your Google Cloud project ID.\n",
"# @markdown 3. **Installs `kubectl`:** Installs the Kubernetes command-line tool.\n",
"\n",
"import os\n",
"\n",
"# Get the default cloud project id.\n",
"PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n",
"\n",
"# Set up gcloud.\n",
"! gcloud config set project \"$PROJECT_ID\"\n",
"! gcloud services enable container.googleapis.com\n",
"\n",
"# Add kubectl to the set of available tools.\n",
"! mkdir -p /tools/google-cloud-sdk/.install\n",
"! gcloud components install kubectl --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1oG8ymQenHyD",
"metadata": {
"cellView": "form",
"id": "1oG8ymQenHyD"
},
"outputs": [],
"source": [
"# @title # Select Cluster and Deployment { vertical-output: true }\n",
"# @markdown **Instructions:**\n",
"# @markdown\n",
"# @markdown Run this cell using the ▶ button. Then, use the interactive widgets that appear below:\n",
"# @markdown 1. **Select Cluster:** From the first dropdown, choose the GKE cluster where your model deployment is running. Note: the list only contains autopilot clusters.\n",
"# @markdown 2. **Select Namespace:** After selecting a cluster, choose the Kubernetes *Namespace* where your deployment resides within that cluster.\n",
"# @markdown 3. **Select Deployment:** After selecting a cluster, this dropdown will populate with the names of deployments found.\n",
"\n",
"import json\n",
"import subprocess\n",
"\n",
"import ipywidgets as widgets\n",
"from IPython.display import Markdown, clear_output, display\n",
"\n",
"# --- Globals and Configuration ---\n",
"DEFAULT_NAMESPACE = \"default\"\n",
"SELECTED_DEPLOYMENT = None\n",
"SELECTED_NAMESPACE = DEFAULT_NAMESPACE\n",
"deployment_dropdown = None\n",
"namespace_dropdown = None\n",
"cluster_dropdown = None\n",
"output_area = widgets.Output()\n",
"\n",
"\n",
"# --- Data Fetching Functions ---\n",
"def get_clusters(project_id):\n",
" \"\"\"Fetches autopilot GKE clusters for a given project.\"\"\"\n",
" # Note: Uses broad exception handling as per original code.\n",
" try:\n",
" cmd = f\"gcloud container clusters list --filter=autopilot.enabled=true --format=json --project={project_id}\"\n",
" result = subprocess.run(\n",
" cmd, shell=True, capture_output=True, text=True, check=True, timeout=60\n",
" )\n",
" clusters_data = json.loads(result.stdout)\n",
" # Create a map of cluster name to its region/location\n",
" return {c[\"name\"]: c[\"location\"] for c in clusters_data}\n",
" except Exception as e:\n",
" # Original code prints error and returns empty dict\n",
" print(f\"Error getting clusters: {e}\")\n",
" return {}\n",
"\n",
"\n",
"# Fetch clusters immediately using PROJECT_ID assumed to be globally defined\n",
"# Note: This relies on PROJECT_ID being set *before* this cell runs.\n",
"try:\n",
" CLUSTER_REGION_MAP = get_clusters(PROJECT_ID)\n",
"except NameError:\n",
" print(\n",
" \"Error: PROJECT_ID variable is not defined. Please define it in a previous cell.\"\n",
" )\n",
" CLUSTER_REGION_MAP = {} # Define as empty to prevent errors later\n",
"\n",
"\n",
"def get_deployments(cluster, region, namespace):\n",
" \"\"\"Fetches deployments from a specific namespace in a cluster.\"\"\"\n",
" # Note: Uses PROJECT_ID as a global variable as per original code.\n",
" # Note: Uses broad exception handling as per original code.\n",
" target_namespace = namespace if namespace else DEFAULT_NAMESPACE\n",
" try:\n",
" # Ensure credentials for the target cluster\n",
" cred_cmd = [\n",
" \"gcloud\",\n",
" \"container\",\n",
" \"clusters\",\n",
" \"get-credentials\",\n",
" cluster,\n",
" f\"--location={region}\",\n",
" f\"--project={PROJECT_ID}\",\n",
" ]\n",
" subprocess.run(cred_cmd, capture_output=True, text=True, check=True, timeout=60)\n",
"\n",
" # Fetch deployments using kubectl\n",
" kubectl_cmd = [\n",
" \"kubectl\",\n",
" \"get\",\n",
" \"deployments\",\n",
" f\"--namespace={target_namespace}\",\n",
" \"-o\",\n",
" \"json\",\n",
" ]\n",
" result = subprocess.run(\n",
" kubectl_cmd, capture_output=True, text=True, check=True, timeout=60\n",
" )\n",
" deployments_data = json.loads(result.stdout)\n",
" # Extract deployment names\n",
" return [item[\"metadata\"][\"name\"] for item in deployments_data.get(\"items\", [])]\n",
" except Exception as e:\n",
" # Original code prints error and returns empty list\n",
" print(f\"Error fetching deployments from namespace '{target_namespace}': {e}\")\n",
" return []\n",
"\n",
"\n",
"def get_namespaces(cluster, region, project_id):\n",
" \"\"\"Fetches namespaces for a given cluster.\"\"\"\n",
" # Note: Uses broad exception handling as per original code.\n",
" try:\n",
" # Ensure credentials for the target cluster\n",
" cred_cmd = [\n",
" \"gcloud\",\n",
" \"container\",\n",
" \"clusters\",\n",
" \"get-credentials\",\n",
" cluster,\n",
" f\"--location={region}\",\n",
" f\"--project={project_id}\",\n",
" ]\n",
" subprocess.run(cred_cmd, capture_output=True, text=True, check=True, timeout=60)\n",
"\n",
" # Fetch namespaces using kubectl\n",
" kubectl_cmd = [\"kubectl\", \"get\", \"namespaces\", \"-o\", \"json\"]\n",
" result = subprocess.run(\n",
" kubectl_cmd, capture_output=True, text=True, check=True, timeout=60\n",
" )\n",
" namespaces_data = json.loads(result.stdout)\n",
" # Extract namespace names\n",
" all_ns = [item[\"metadata\"][\"name\"] for item in namespaces_data.get(\"items\", [])]\n",
" return all_ns\n",
" except Exception as e:\n",
" # Original code displays error in output_area and returns None\n",
" with output_area:\n",
" # Clear previous output before showing error\n",
" clear_output(wait=True)\n",
" display(\n",
" Markdown(\n",
" f\"<font color='red'>Error processing namespaces for **{cluster}**: {e}</font>\"\n",
" )\n",
" )\n",
" return None\n",
"\n",
"\n",
"# --- Event Handlers ---\n",
"def on_deployment_select(change):\n",
" \"\"\"Handles changes in the deployment selection.\"\"\"\n",
" global SELECTED_DEPLOYMENT\n",
" if change[\"type\"] == \"change\" and change[\"name\"] == \"value\":\n",
" SELECTED_DEPLOYMENT = change[\"new\"]\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" current_cluster = cluster_dropdown.value\n",
"\n",
" # Display context message\n",
" if current_cluster != \"Select Cluster\":\n",
" # Use SELECTED_NAMESPACE global which should be set by on_namespace_change\n",
" # or default if namespace hasn't been selected yet.\n",
" ns_context = SELECTED_NAMESPACE or DEFAULT_NAMESPACE\n",
" ns_info = f\"Cluster: **{current_cluster}**, Namespace: **{ns_context}**\"\n",
" display(Markdown(ns_info))\n",
"\n",
" # Display selection message if a valid deployment is chosen\n",
" if (\n",
" SELECTED_DEPLOYMENT\n",
" and SELECTED_DEPLOYMENT != \"Select Deployment\"\n",
" and SELECTED_DEPLOYMENT != \"Loading...\"\n",
" ):\n",
" mes = f\"\"\"Selected deployment: **{SELECTED_DEPLOYMENT}**\"\"\"\n",
" display(Markdown(mes))\n",
"\n",
"\n",
"def update_deployment_dropdown(cluster_name, namespace_to_use):\n",
" \"\"\"Updates the deployment list based on cluster/namespace change.\"\"\"\n",
" global deployment_dropdown, SELECTED_DEPLOYMENT\n",
" target_namespace = namespace_to_use if namespace_to_use else DEFAULT_NAMESPACE\n",
"\n",
" # Reset selection before fetching/updating\n",
" SELECTED_DEPLOYMENT = None\n",
" deployment_dropdown.disabled = True # Disable while loading/updating\n",
" deployment_dropdown.options = [\"Loading...\"]\n",
" deployment_dropdown.value = \"Loading...\"\n",
"\n",
" # Clear output area and show loading context\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(Markdown(f\"Cluster: **{cluster_name}**\"))\n",
" if namespace_to_use:\n",
" display(Markdown(f\"Namespace: **{namespace_to_use}**\"))\n",
" display(Markdown(\"Fetching deployments...\"))\n",
"\n",
" # Fetch deployments (assuming CLUSTER_REGION_MAP and PROJECT_ID are available)\n",
" region = CLUSTER_REGION_MAP.get(cluster_name)\n",
" if not region:\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(\n",
" Markdown(\n",
" f\"<font color='red'>Error: Region not found for cluster {cluster_name}.</font>\"\n",
" )\n",
" )\n",
" deployment_dropdown.options = [\"Error loading\"]\n",
" deployment_dropdown.value = \"Error loading\"\n",
" return # Stop if region is missing\n",
"\n",
" deployments = get_deployments(cluster_name, region, target_namespace)\n",
"\n",
" # Update dropdown options\n",
" new_options = [\"Select Deployment\"] + deployments\n",
" deployment_dropdown.options = new_options\n",
"\n",
" # Set final state based on results\n",
" if deployments:\n",
" deployment_dropdown.value = \"Select Deployment\"\n",
" deployment_dropdown.disabled = False\n",
" status_message = f\"Found {len(deployments)} deployment(s) in namespace **{target_namespace}**.\"\n",
" else:\n",
" deployment_dropdown.value = \"Select Deployment\" # Keep prompt\n",
" deployment_dropdown.disabled = True # No valid options to select\n",
" # Check if get_deployments printed an error or if it just returned empty\n",
" if not output_area.outputs: # If no error printed by get_deployments\n",
" status_message = (\n",
" f\"No deployments found in namespace **{target_namespace}**.\"\n",
" )\n",
" else:\n",
" status_message = None # Error likely already shown\n",
"\n",
" # Update output area with final status\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(Markdown(f\"Cluster: **{cluster_name}**\"))\n",
" if namespace_to_use:\n",
" display(Markdown(f\"Namespace: **{namespace_to_use}**\"))\n",
" if status_message:\n",
" display(Markdown(status_message))\n",
"\n",
"\n",
"def update_namespace_dropdown(cluster_name):\n",
" \"\"\"Updates the namespace list based on cluster change.\"\"\"\n",
" global namespace_dropdown, SELECTED_NAMESPACE\n",
" global deployment_dropdown, SELECTED_DEPLOYMENT # Need to reset deployment too\n",
"\n",
" # Reset namespace state and dependent deployment dropdown\n",
" SELECTED_NAMESPACE = None # Reset selection\n",
" SELECTED_DEPLOYMENT = None\n",
" namespace_dropdown.disabled = True\n",
" namespace_dropdown.options = [\"Loading...\"]\n",
" namespace_dropdown.value = \"Loading...\"\n",
" deployment_dropdown.options = [\"Select Deployment\"]\n",
" deployment_dropdown.value = \"Select Deployment\"\n",
" deployment_dropdown.disabled = True\n",
"\n",
" # Clear output area and show loading context\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(Markdown(f\"Cluster: **{cluster_name}**\"))\n",
" display(Markdown(\"Fetching namespaces...\"))\n",
"\n",
" # Fetch namespaces (assuming CLUSTER_REGION_MAP and PROJECT_ID are available)\n",
" region = CLUSTER_REGION_MAP.get(cluster_name)\n",
" if not region:\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(\n",
" Markdown(\n",
" f\"<font color='red'>Error: Region not found for cluster {cluster_name}.</font>\"\n",
" )\n",
" )\n",
" namespace_dropdown.options = [\"Error loading\"]\n",
" namespace_dropdown.value = \"Error loading\"\n",
" return # Stop if region is missing\n",
"\n",
" # Assuming PROJECT_ID is globally available\n",
" namespaces = get_namespaces(cluster_name, region, PROJECT_ID)\n",
"\n",
" # Update dropdown options based on fetch result\n",
" if namespaces is not None: # Success (get_namespaces returns None on error)\n",
" new_options = [\"Select Namespace\"] + namespaces # Use \"Select Namespace\" prompt\n",
" namespace_dropdown.options = new_options\n",
" namespace_dropdown.value = \"Select Namespace\"\n",
" namespace_dropdown.disabled = False\n",
" status_message = (\n",
" f\"Found {len(namespaces)} namespace(s). Select one to list deployments.\"\n",
" )\n",
" else: # Error occurred during fetch\n",
" namespace_dropdown.options = [\"Error loading\"] # Keep error state\n",
" namespace_dropdown.value = \"Error loading\"\n",
" namespace_dropdown.disabled = True\n",
" status_message = None # Error already displayed by get_namespaces\n",
"\n",
" # Update output area with final status\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" display(Markdown(f\"Cluster: **{cluster_name}**\"))\n",
" if status_message:\n",
" display(Markdown(status_message))\n",
"\n",
"\n",
"def on_cluster_change(change):\n",
" \"\"\"Handles cluster selection changes.\"\"\"\n",
" # Globals not strictly needed here as it calls update_namespace_dropdown which uses them\n",
" if change[\"type\"] == \"change\" and change[\"name\"] == \"value\":\n",
" cluster = change[\"new\"]\n",
"\n",
" # Clear output area for new selection process\n",
" with output_area:\n",
" clear_output(wait=True)\n",
"\n",
" if cluster == \"Select Cluster\":\n",
" # Reset namespace dropdown\n",
" namespace_dropdown.options = [\"Select Namespace\"] # Correct prompt\n",
" namespace_dropdown.value = \"Select Namespace\"\n",
" namespace_dropdown.disabled = True\n",
" # Reset deployment dropdown\n",
" deployment_dropdown.options = [\"Select Deployment\"]\n",
" deployment_dropdown.value = \"Select Deployment\"\n",
" deployment_dropdown.disabled = True\n",
" # Clear globals\n",
" global SELECTED_NAMESPACE, SELECTED_DEPLOYMENT\n",
" SELECTED_NAMESPACE = None\n",
" SELECTED_DEPLOYMENT = None\n",
" else:\n",
" # Trigger update for the namespace dropdown\n",
" update_namespace_dropdown(cluster)\n",
"\n",
"\n",
"def on_namespace_change(change):\n",
" \"\"\"Handles namespace selection: fetches deployments.\"\"\"\n",
" global SELECTED_NAMESPACE, cluster_dropdown, deployment_dropdown # Added deployment_dropdown\n",
" if change[\"type\"] == \"change\" and change[\"name\"] == \"value\":\n",
" new_namespace = change[\"new\"]\n",
"\n",
" # Get current cluster value\n",
" current_cluster = cluster_dropdown.value\n",
"\n",
" # Handle placeholder/loading/error values or if cluster isn't selected\n",
" if (\n",
" new_namespace in [\"Select Namespace\", \"Loading...\", \"Error loading\"]\n",
" or current_cluster == \"Select Cluster\"\n",
" ):\n",
" SELECTED_NAMESPACE = None\n",
" # Reset deployment dropdown state\n",
" deployment_dropdown.options = [\"Select Deployment\"]\n",
" deployment_dropdown.value = \"Select Deployment\"\n",
" deployment_dropdown.disabled = True\n",
" global SELECTED_DEPLOYMENT\n",
" SELECTED_DEPLOYMENT = None\n",
" # Clear output area for clean state\n",
" with output_area:\n",
" clear_output(wait=True)\n",
" if current_cluster != \"Select Cluster\": # Keep cluster context\n",
" display(Markdown(f\"Cluster: **{current_cluster}**\"))\n",
" if new_namespace == \"Select Namespace\":\n",
" display(Markdown(\"Select a namespace to list deployments.\"))\n",
" return # Don't proceed to fetch deployments\n",
"\n",
" # Valid namespace selected\n",
" SELECTED_NAMESPACE = new_namespace\n",
"\n",
" # Trigger update for the deployment dropdown\n",
" if current_cluster != \"Select Cluster\":\n",
" update_deployment_dropdown(current_cluster, SELECTED_NAMESPACE)\n",
"\n",
"\n",
"# --- Main Widget Setup ---\n",
"if CLUSTER_REGION_MAP:\n",
" clusters_with_prompt = [\"Select Cluster\"] + sorted(list(CLUSTER_REGION_MAP.keys()))\n",
" cluster_dropdown = widgets.Dropdown(\n",
" options=clusters_with_prompt,\n",
" value=\"Select Cluster\", # Set initial value\n",
" description=\"Cluster:\",\n",
" style={\"description_width\": \"initial\"},\n",
" layout=widgets.Layout(width=\"auto\"), # Auto width\n",
" )\n",
"\n",
" namespace_dropdown = widgets.Dropdown(\n",
" options=[\"Select Namespace\"], # Correct initial prompt\n",
" value=\"Select Namespace\",\n",
" description=\"Namespace:\",\n",
" disabled=True, # Initially disabled\n",
" style={\"description_width\": \"initial\"},\n",
" layout=widgets.Layout(width=\"auto\"),\n",
" )\n",
"\n",
" deployment_dropdown = widgets.Dropdown(\n",
" options=[\"Select Deployment\"],\n",
" value=\"Select Deployment\",\n",
" description=\"Deployment:\",\n",
" disabled=True, # Initially disabled\n",
" style={\"description_width\": \"initial\"},\n",
" layout=widgets.Layout(width=\"auto\"),\n",
" )\n",
"\n",
" # Observe changes\n",
" cluster_dropdown.observe(on_cluster_change, names=\"value\")\n",
" namespace_dropdown.observe(on_namespace_change, names=\"value\")\n",
" deployment_dropdown.observe(on_deployment_select, names=\"value\")\n",
"\n",
" # Display initial status and widgets\n",
" print(\n",
" f\"Found {len(CLUSTER_REGION_MAP)} Autopilot Cluster(s) in Project '{PROJECT_ID}'.\\n\"\n",
" )\n",
" display(cluster_dropdown, namespace_dropdown, deployment_dropdown, output_area)\n",
"\n",
"else:\n",
" # Handle case where PROJECT_ID might be missing or no clusters found\n",
" if \"PROJECT_ID\" not in globals() or not PROJECT_ID:\n",
" error_message = \"Error: PROJECT_ID variable is not defined or empty. Please define it in a previous cell.\"\n",
" else:\n",
" error_message = f\"Error: No Autopilot clusters found or accessible in project '{PROJECT_ID}'. Check Project ID, permissions, and ensure Autopilot clusters exist.\"\n",
" print(error_message)\n",
" # Display error message using a widget for better integration in notebook\n",
" display(widgets.HTML(f\"<font color='red'>{error_message}</font>\"))\n",
" # Keep output_area widget displayed even on error for potential messages from retries etc.\n",
" display(output_area)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "IKGTaN84p8rX",
"metadata": {
"cellView": "form",
"id": "IKGTaN84p8rX"
},
"outputs": [],
"source": [
"# @title # Chat completion for text-only models { vertical-output: true}\n",
"# @markdown You may send prompts to the model server for prediction.\n",
"# @markdown\n",
"# @markdown * **user_prompt (string):** This is the text prompt you provide to the language model. It's the question or instruction e (e.g., \"Explain neural networks\").\n",
"# @markdown * **temperature (number):** This parameter controls the randomness of the model's output. It influences how the model selects the next token in the sequence it generates. Typical values range from 0.2 to 1.0.\n",
"# @markdown * **max_tokens (number):** This parameter refers to the maximum number of tokens (words or sub-word units) that the model is allowed to generate in its response.\n",
"\n",
"import ipywidgets as widgets\n",
"\n",
"\n",
"def _run_kubectl(cmd):\n",
" \"\"\"Executes a kubectl command and returns its stdout.\"\"\"\n",
" result = subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=60)\n",
" return result.stdout.strip()\n",
"\n",
"\n",
"def get_deployment_pod_name(deployment, namespace):\n",
" \"\"\"Finds the running pod name for a given deployment and namespace.\"\"\"\n",
" cmd = [\n",
" \"kubectl\",\n",
" \"get\",\n",
" \"pods\",\n",
" \"-n\",\n",
" namespace,\n",
" \"-o\",\n",
" \"json\",\n",
" \"-l\",\n",
" f\"app={deployment}-app\",\n",
" \"--field-selector=status.phase=Running\",\n",
" ]\n",
" try:\n",
" pods_json = _run_kubectl(cmd)\n",
" pods = json.loads(pods_json)\n",
" if pods.get(\"items\"):\n",
" return pods[\"items\"][0][\"metadata\"][\"name\"]\n",
" print(f\"No running pods found for {deployment} in {namespace}.\")\n",
" return None\n",
" except (\n",
" subprocess.CalledProcessError,\n",
" json.JSONDecodeError,\n",
" IndexError,\n",
" KeyError,\n",
" ) as e:\n",
" print(f\"Error getting pod name for {deployment} in {namespace}: {e}\")\n",
" return None\n",
"\n",
"\n",
"def check_inference_label(pod_name, namespace):\n",
" \"\"\"Checks if the specified pod has the vLLM inference server label.\"\"\"\n",
" cmd = [\"kubectl\", \"get\", \"pod\", pod_name, \"-n\", namespace, \"-o\", \"json\"]\n",
" try:\n",
" pod_json = _run_kubectl(cmd)\n",
" labels = json.loads(pod_json).get(\"metadata\", {}).get(\"labels\", {})\n",
" return labels.get(\"ai.gke.io/inference-server\") == \"vllm\"\n",
" except (subprocess.CalledProcessError, json.JSONDecodeError, KeyError) as e:\n",
" print(f\"Error checking labels for pod {pod_name} in {namespace}: {e}\")\n",
" return False\n",
"\n",
"\n",
"def process_response(request, pod_name, pod_endpoint, is_vllm_inference, namespace):\n",
" \"\"\"Sends a request to the pod and processes the response.\"\"\"\n",
" json_data_escaped = json.dumps(request).replace(\"'\", \"'\\\\''\")\n",
" curl_cmd = f\"kubectl exec -n {namespace} -t {pod_name} -- curl -s -X POST http://{pod_endpoint}/generate -H \\\"Content-Type: application/json\\\" -d '{json_data_escaped}' 2> /dev/null\"\n",
" try:\n",
" response_raw = _run_kubectl([\"bash\", \"-c\", curl_cmd])\n",
" if not response_raw:\n",
" return f\"Error: Empty response from pod {pod_name}.\"\n",
" first_line = response_raw.splitlines()[0]\n",
" data = json.loads(first_line)\n",
"\n",
" if is_vllm_inference:\n",
" predictions = data.get(\"predictions\")\n",
" if isinstance(predictions, (list, tuple)) and predictions:\n",
" return predictions[0]\n",
" return f\"Error: Unexpected vLLM format. Raw: {first_line}\"\n",
" else: # TGI format\n",
" generated_text = data.get(\"generated_text\")\n",
" if generated_text is not None:\n",
" return generated_text\n",
" return f\"Error: Unexpected TGI format. Raw: {first_line}\"\n",
"\n",
" except json.JSONDecodeError as e:\n",
" raw_response = (\n",
" response_raw.splitlines()[0]\n",
" if \"response_raw\" in locals() and response_raw\n",
" else \"N/A\"\n",
" )\n",
" return f\"Error decoding JSON: {e}. Raw: {raw_response}\"\n",
" except (subprocess.CalledProcessError, IndexError, KeyError, TypeError) as e:\n",
" raw_response = (\n",
" response_raw.splitlines()[0]\n",
" if \"response_raw\" in locals() and response_raw\n",
" else \"N/A\"\n",
" )\n",
" return f\"Error processing response: {e}. Raw: {raw_response}\"\n",
" except Exception as e:\n",
" return f\"Unexpected error during response processing: {e}\"\n",
"\n",
"\n",
"# --- Widgets Setup ---\n",
"user_prompt_widget = widgets.Textarea(\n",
" value=\"What is AI?\",\n",
" description=\"User Prompt:\",\n",
" layout=widgets.Layout(width=\"95%\", height=\"100px\"),\n",
")\n",
"temperature_widget = widgets.FloatSlider(\n",
" value=0.50, min=0.0, max=1.0, step=0.01, description=\"Temperature:\"\n",
")\n",
"max_tokens_widget = widgets.IntSlider(\n",
" value=250, min=1, max=2048, step=1, description=\"Max Tokens:\"\n",
")\n",
"submit_button = widgets.Button(description=\"Submit\")\n",
"output_area_response = widgets.Output()\n",
"\n",
"\n",
"# --- Submit Button Logic ---\n",
"def on_submit_clicked(b):\n",
" \"\"\"Handles the submit button click event.\"\"\"\n",
" with output_area_response:\n",
" clear_output()\n",
" if (\n",
" \"SELECTED_DEPLOYMENT\" not in globals()\n",
" or \"SELECTED_NAMESPACE\" not in globals()\n",
" ):\n",
" display(\n",
" Markdown(\n",
" \"**Error:** `SELECTED_DEPLOYMENT` or `SELECTED_NAMESPACE` not defined.\"\n",
" )\n",
" )\n",
" return\n",
"\n",
" print(\n",
" f\"Target: {SELECTED_DEPLOYMENT} in {SELECTED_NAMESPACE}. \\n\\nRequesting response...\"\n",
" )\n",
"\n",
" pod_name = get_deployment_pod_name(SELECTED_DEPLOYMENT, SELECTED_NAMESPACE)\n",
" if not pod_name:\n",
" display(\n",
" Markdown(\n",
" f\"**Error:** Could not find running pod for `{SELECTED_DEPLOYMENT}`.\"\n",
" )\n",
" )\n",
" return\n",
"\n",
" is_vllm = check_inference_label(pod_name, SELECTED_NAMESPACE)\n",
" request = {\n",
" \"max_tokens\": max_tokens_widget.value,\n",
" \"temperature\": temperature_widget.value,\n",
" \"prompt\" if is_vllm else \"inputs\": user_prompt_widget.value,\n",
" }\n",
" service = f\"{SELECTED_DEPLOYMENT}-service\"\n",
" endpoint_cmd = [\n",
" \"kubectl\",\n",
" \"get\",\n",
" \"endpoints\",\n",
" service,\n",
" \"-n\",\n",
" SELECTED_NAMESPACE,\n",
" ]\n",
"\n",
" try:\n",
" endpoint_output = _run_kubectl(endpoint_cmd).splitlines()\n",
" if len(endpoint_output) < 2 or len(endpoint_output[1].split()) < 2:\n",
" display(\n",
" Markdown(\n",
" f\"**Error:** Endpoint data incomplete for service `{service}`.\"\n",
" )\n",
" )\n",
" print(\"kubectl output:\\n\", \"\\n\".join(endpoint_output))\n",
" return\n",
" endpoint = endpoint_output[1].split()[\n",
" 1\n",
" ] # Assumes format: NAME ENDPOINTS AGE -> service ip:port,... age\n",
" response = process_response(\n",
" request, pod_name, endpoint, is_vllm, SELECTED_NAMESPACE\n",
" )\n",
" display(Markdown(f\"**Response:**\\n\\n{response}\"))\n",
"\n",
" except subprocess.CalledProcessError as e:\n",
" display(\n",
" Markdown(\n",
" f\"**Error getting endpoints for `{service}`:**\\n```\\n{e.stderr}\\n```\"\n",
" )\n",
" )\n",
" except Exception as e:\n",
" display(Markdown(f\"**Unexpected Error:**\\n```\\n{e}\\n```\"))\n",
"\n",
"\n",
"# --- Display Widgets ---\n",
"submit_button.on_click(on_submit_clicked)\n",
"display(\n",
" user_prompt_widget,\n",
" temperature_widget,\n",
" max_tokens_widget,\n",
" submit_button,\n",
" output_area_response,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5b6ZM2K3fux0",
"metadata": {
"id": "5b6ZM2K3fux0"
},
"source": [
"# Next Steps: Integrating the GKE Service Endpoint\n",
"\n",
"After successfully deploying a model on Google Kubernetes Engine (GKE) and\n",
"verifying it via a notebook, the next step is to integrate it into various\n",
"applications. This involves making HTTP requests to the service's endpoint from\n",
"your application code.\n",
"\n",
"### Exposing the Service\n",
"\n",
"To make your deployed model accessible to applications, you'll need to expose\n",
"its service endpoint. Google Kubernetes Engine offers several ways to do this:\n",
"\n",
"1. **Ingress:** Configure an Ingress resource to route external HTTP(S) traffic\n",
" to your service. Set up Ingress for either an internal Load Balancer\n",
" (accessible only within your VPC) or an external Load Balancer (accessible\n",
" from the internet).\n",
" [Learn more about GKE Ingress](https://cloud.google.com/kubernetes-engine/docs/concepts/ingress).\n",
"2. **Gateway API:** A more modern and feature-rich API for managing traffic\n",
" routing in Kubernetes. Similar to Ingress, Gateway API allows you to define\n",
" how external and internal traffic should be directed to your services.\n",
" [Explore GKE Gateway API](https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api).\n",
"\n",
"### Setting Up Autoscaling\n",
"\n",
"Ensure your model serving can handle varying traffic by configuring the\n",
"Horizontal Pod Autoscaler (HPA). HPA automatically scales the number of Pods\n",
"based on resource utilization or custom metrics, optimizing performance and\n",
"cost.\n",
"[See how to configure HPA](https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling).\n",
"\n",
"### Setting Up Monitoring\n",
"\n",
"Monitor the health and performance of your deployed model using Google Cloud\n",
"Managed Service for Prometheus. Configure your model serving to expose\n",
"Prometheus metrics for comprehensive insights.\n",
"[Get started with Google Cloud Managed Prometheus](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-automatic-application-monitoring).\n",
"\n",
"### Additional Resources:\n",
"\n",
"* #### Kubernetes Documentation:\n",
"\n",
" * Services:\n",
" https://kubernetes.io/docs/concepts/services-networking/service/\n",
"\n",
"* #### Google Cloud Documentation:\n",
"\n",
" * Google Kubernetes Engine (GKE):\n",
" https://cloud.google.com/kubernetes-engine\n",
" * Cloud Load Balancing:\n",
" https://cloud.google.com/load-balancing/docs/ingress\n",
" * Gateway API on GKE:\n",
" https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api\n",
" * Learn about GPUs in GKE:\n",
" https://cloud.google.com/kubernetes-engine/docs/concepts/gpus\n",
"\n",
"* #### Python requests Library:\n",
"\n",
" * https://requests.readthedocs.io/en/latest/\n",
"\n",
"* #### LangChain with Google Integrations:\n",
"\n",
" * The Langchain documentation is very useful:\n",
" https://python.langchain.com/docs/integrations/providers/google/"
]
}
],
"metadata": {
"colab": {
"name": "gke_model_ui_deployment_notebook.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}