# Chatbot Demo

This notebook demonstrates how to create a simple chatbot interface using Llama models via Hugging Face's `transformers` library and `gradio` for the user interface.

### Main Features:
- Select from a list of pre-trained Llama models.
- Input text and receive responses from the selected model.
- Cache models to avoid reloading them multiple times.

## Log in to Hugging Face Hub

In this cell, we import the `login` function from the Hugging Face Hub and call it to authenticate with your Hugging Face account. This step is required to access the Llama models.

In [1]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import Required Libraries

In [2]:
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [3]:
device = 0 if torch.cuda.is_available() else -1

## Basic Usage

Below there are a list of available Llama models to choose from. The dictionary llama_models is a mapping that associates model names with their corresponding paths to their repositories

In [4]:
llama_models = {
    "Llama 3 70B Instruct": "meta-llama/Meta-Llama-3-70B-Instruct",
    "Llama 3 8B Instruct": "meta-llama/Meta-Llama-3-8B-Instruct",
    "Llama 3.1 70B Instruct": "meta-llama/Llama-3.1-70B-Instruct",
    "Llama 3.1 8B Instruct": "meta-llama/Llama-3.1-8B-Instruct",
    "Llama 3.2 3B Instruct": "meta-llama/Llama-3.2-3B-Instruct",
    "Llama 3.2 1B Instruct": "meta-llama/Llama-3.2-1B-Instruct",
}

### Loading Models

The `load_model` function loads the selected Llama model along with its tokenizer. It uses Hugging Face's `AutoModelForCausalLM` and `AutoTokenizer` to load the pre-trained model and return a text generation pipeline. Note that for each version of Llama, you need to separately request access to be able to use them.

In [5]:
def load_model(model_name):
    """Load the specified Llama model."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=device)
    return generator

### Model Caching

Cache models to avoid reloading them multiple times.

In [6]:
model_cache = {}

### Chat Generation

The `generate_chat` function generates chatbot responses using the selected Llama model. It first checks if the chosen model is cached; if not, it loads the model. A system prompt is set to define the bot's behavior, typically framing it as a helpful assistant, but this prompt can be customized to change the nature of the responses. The function processes the conversation history by formatting previous user and assistant exchanges into a structured sequence of messages. The latest user input is added to this sequence. The Llama model then generates a response using specific parameters such as `max_length`, `temperature`, and `top_p` to control the style and variability of the output. The response is added to the conversation history, and the updated history is returned, allowing the chat session to continue from where it left off.

In [8]:
def generate_chat(user_input, history, model_choice):
    """Generate chatbot responses using the selected Llama model and task."""
    
    if model_choice not in model_cache:
        model_cache[model_choice] = load_model(llama_models[model_choice])
    generator = model_cache[model_choice]

    system_prompt = {"role": "system", "content": "You are a helpful assistant"}

    if history is None:
        history = [system_prompt]
    
    history.append({"role": "user", "content": user_input})

    response = generator(
        history,
        max_length=512,
        pad_token_id=generator.tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )[-1]["generated_text"][-1]["content"]

    history.append({"role": "assistant", "content": response})
    
    return history

## Gradio Interface

The cell below defines an interactive Gradio interface for chatting with various Llama models. It includes a centered title, a dropdown menu for selecting a Llama model, a chatbot interface for displaying the conversation, and a textbox for entering user queries. Users can either submit input by pressing **Enter** or clicking the **Submit** button, which triggers the `respond` function. This function generates a response using the selected model, maintains the chat history, and updates the chatbot interface with the ongoing conversation

In [10]:
with gr.Blocks() as demo:
    gr.Markdown("<h1><center>Chat with Llama Models</center></h1>")

    model_choice = gr.Dropdown(list(llama_models.keys()), label="Select Llama Model")

    chatbot = gr.Chatbot(label="Chatbot Interface", type = "messages")
    txt_input = gr.Textbox(show_label=False, placeholder="Type your message here...")

    def respond(user_input, chat_history, model_choice):
        if model_choice is None:
            model_choice = list(llama_models.keys())[0]
        updated_history = generate_chat(user_input, chat_history, model_choice)
        return "", updated_history

    txt_input.submit(respond, [txt_input, chatbot, model_choice], [txt_input, chatbot])

    submit_btn = gr.Button("Submit")
    submit_btn.click(respond, [txt_input, chatbot, model_choice], [txt_input, chatbot])

### Launch the interface

The screenshot below shows the interface
![Chatbot Interface Screenshot](../assets/gradio_chatbot_demo.png)

In [None]:
demo.launch()