# Deploy a Model to a SageMaker Endpoint and run a simple Gradio UI
In this notebook, we will demonstrate how to deploy a large language model as as SageMaker endpoint and interact with it using a simple Gradio UI. This approach enables quick and easy experimentation with a wide range of models. 

## 1. Setup

First, we need to install the required libraries and set up the environment.

In [None]:
%pip install -Uqq sagemaker
%pip install -Uqq "huggingface_hub[cli]"
%pip install -Uqq gradio

In [None]:
import os
import boto3
import sagemaker
from pathlib import Path
from sagemaker.djl_inference.model import DJLModel
import re

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
bucket = sess.default_bucket()  # default bucket name
prefix = "gradio_chatbot" 
account_id = sess.account_id() 

## 2. Download Model
Next we will download the model from the Hugging Face model hub. Certain models are gated and require signing in to download. See the documentation [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli) for the CLI commands to login. In this case, we will use the `Phi-3.5-mini-instruct` model that is publicly available.

In [4]:
LLM_NAME = "microsoft/Phi-3.5-mini-instruct"
LLM_DIR = os.path.basename(LLM_NAME).lower()

We can use the `huggingface-cli` to download the model.

In [None]:
!huggingface-cli download $LLM_NAME --local-dir $LLM_DIR

Next we can upload the model to our S3 bucket so that we can deploy it to a SageMaker endpoint.

In [None]:
llm_s3_path = f"s3://{bucket}/{prefix}/llm/{LLM_DIR}"
!aws s3 sync $LLM_DIR $llm_s3_path

## 3. Deploy Model
Now that the model is in our S3 bucket, we can deploy it to a SageMaker endpoint. We will use the `DJLModel` class from the `sagemaker.djl_inference` module to deploy the model. **DGL** stands for Deep Java Library, which is a Java framework for deep learning that also underpins the [Large Model Inference](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html) container which is what we will use to deploy the model.

LMI allows us to deploy the models without needing to write any inference code. We can simply provide  our desired deployment configuration via environment variables and pass it to the `DJLModel` class.
Below are example configurations that work with the `Phi-3.5-mini-instruct` model. You can adjust these configurations based on your model's requirements. See [here](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/configurations.html) for a complete list of configurations.

In [9]:
# define inference environment for LLM
llm_env = env = {
    "TENSOR_PARALLEL_DEGREE": "1",  # use 1 GPUs
    "OPTION_ROLLING_BATCH": "vllm", # use VLLM rolling batch
    "OPTION_MAX_ROLLING_BATCH_SIZE": "32", # max rolling batch size (controls the concurrency)
    "OPTION_DTYPE": "fp16", # load weights in fp16
    "OPTION_MAX_MODEL_LEN": "16384", # max context length in tokens for the model
    "OPTION_TRUST_REMOTE_CODE": "true", # trust remote code
    "OPTION_GPU_MEMORY_UTILIZATION": "0.95", # use 95% of GPU memory
}

# create DJLModel object for LLM
# see here for LMI version updates https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
sm_llm_model = DJLModel(
    model_id=llm_s3_path,
    djl_version="0.29.0",
    djl_framework="djl-lmi",
    role=role,
    env=llm_env,
)

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"{re.sub('[._]+', '-', LLM_DIR)}")

llm_predictor = sm_llm_model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
            )

LMI includes a [chat completions](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html) API which works with message based payloads. This only works with models that provide a [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) as part of their tokenizer.

We can validate the deployment by sending a test request to the endpoint.

In [None]:
chat = [
  {"role": "system", "content": "You are a helpful AI assistant."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "Can you write a python function that parses BibTex using only standard python libraries?"},
]

result = llm_predictor.predict(

    {"messages": chat, "max_tokens": 2000, "temperature": 0.5}
)

response_message = result["choices"][0]["message"]

print(response_message["content"])

## 4. Build chat interface
Finally, we will build a simple Gradio UI that allows us to interact with the model. Gradio is a Python library that allows you to quickly create UIs for your models.

Gradio provides a built in [ChatInterface](https://www.gradio.app/docs/gradio/chatinterface) class for chat based interfaces. We can use this class to build a chat interface for our model simply by implementing a chat function that takes in the latest message, the history of prior messages, and any other generation parameters we wish to configure.

In [None]:
import gradio as gr
from functools import partial

In [14]:
def chat(predictor, message, history, system_prompt=None, max_tokens=100, temperature=0.5, top_p=0.99):
    
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    
    for turn in history:
        user_msg, assistant_msg = turn
        messages.append({"role": "user", "content": user_msg})
        messages.append({"role": "assistant", "content": assistant_msg})
    
    messages.append({"role": "user", "content": message})
    
    response = predictor.predict({"messages": messages, "max_tokens": max_tokens, "temperature": temperature, "top_p": top_p})
    
    response_message = response["choices"][0]["message"]["content"]
    
    return response_message

In [None]:
chat_function = partial(chat, llm_predictor) # create a partial function with the predictor

gr.close_all()

chat_interface = gr.ChatInterface(
    chat_function,
    title="Example Chatbot",
    description=f"Example chatbot powered by {LLM_DIR}",
    additional_inputs=[
        gr.Textbox("You are helpful AI.", label="System Prompt"),
        gr.Slider(1, 4000, 500, label="Max Tokens"),
        gr.Slider(0.1, 1.0, 0.5, label="Temperature"),
        gr.Slider(0.1, 1.0, 0.99, label="Top P", value=0.99),
    ],
    additional_inputs_accordion = "Model Settings",
)
chat_interface.chatbot.render_markdown = True
chat_interface.chatbot.height = 400

chat_interface.load()

chat_interface.queue(default_concurrency_limit=10).launch(share=True) # set to False to keep private

## 5. Clean Up

In [16]:
llm_predictor.delete_endpoint()
# optional: clean up the S3 bucket
# !aws s3 rm --recursive $llm_s3_path