# Vertex AI Model Garden - LLaMA2 (PEFT)

## Overview

This notebook demonstrates using of the existing prediction endpoint deployed to vertex AI Model Garden.

You could check the ID of the existing prediction endpoint by navigating to  [Vertex AI-> Online prediction](https://pantheon.corp.google.com/vertex-ai/online-prediction).





## (Optional) Deploy Llama2 Model into Vertex AI

Steps to deploy pre-built [LLaMA2 models](https://huggingface.co/meta-llama)  with vLLM.

### Setup Google Cloud project

1. [Enable the Vertex AI API, Compute Engine API and Cloud Natural Language API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,language.googleapis.com).

1. [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs.

1. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console) with `Vertex AI User` and `Storage Object Admin` roles for deploying fine tuned model to Vertex AI endpoint.

### Deploy Llama2

Accept the model agreement to access the models:
1. Navigate to the Vertex AI > Model Garden page in the Google Cloud console
2. Find the LLaMA2 model card and click on "VIEW DETAILS"
3. Review the agreement on the model card page.
4. After clicking the agreement of LLaMA2, a Cloud Storage bucket containing LLaMA2 pretrained and finetuned models will be shared
5. Deploy `Llama2-7B-chat-001` or `Llama2-13B-chat-001` model using One-click deploy
6. Paste the Cloud Storage bucket link below which you have previously created and assign it to `VERTEX_AI_MODEL_GARDEN_LLAMA2`
7. Note down the endpoint name, otherwise navigate to  [Vertex AI-> Online prediction](https://pantheon.corp.google.com/vertex-ai/online-prediction) and copy Endpoint id from there.

## Using endpoint prediction

In [1]:
import os
import sys
import vertexai
from google.cloud import aiplatform
sys.path.append("../../common/src")
sys.path.append("../src")
os.chdir("../src")

In [2]:
!export PROJECT_ID="gcp-mira-demo"
PROJECT_ID = "gcp-mira-demo"
os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["MODEL_GARDEN_LLAMA2_CHAT_ENDPOINT_ID"] = "3894402015861669888"
REGION = "us-central1"  # @param {type:"string"}


In [3]:
from services.llm_generate import model_garden_predict
from services.llm_generate import llm_generate

prompt = "What is a Medicaid?"

INFO: [src/config.py:49 - <module>()] Namespace File not found, setting job namespace as default
INFO: [src/config.py:84 - get_environ_flag()] ENABLE_GOOGLE_LLM = True
INFO: [src/config.py:84 - get_environ_flag()] ENABLE_GOOGLE_MODEL_GARDEN = True
INFO: [src/config.py:84 - get_environ_flag()] ENABLE_OPENAI_LLM = True
INFO: [src/config.py:84 - get_environ_flag()] ENABLE_COHERE_LLM = True
INFO: [src/config.py:191 - <module>()] LLM types loaded ['OpenAI-GPT3.5', 'OpenAI-GPT4', 'Cohere', 'VertexAI-Text', 'VertexAI-Chat', 'VertexAI-Chat', 'VertexAI-ModelGarden-LLAMA2-Chat']


Call prediction endpoint

In [4]:

response = await  llm_generate(llm_type="VertexAI-ModelGarden-LLAMA2-Chat", prompt=prompt)
print(response)

INFO: [services/llm_generate.py:46 - llm_generate()] Generating text with an LLM given a prompt=What is a Medicaid?, llm_type=VertexAI-ModelGarden-LLAMA2-Chat
INFO: [services/llm_generate.py:117 - model_garden_predict()] Generating text using Model Garden endpoint=[projects/gcp-mira-demo/locations/us-central1/endpoints/3894402015861669888], prompt=[What is a Medicaid?], parameters=[None.
INFO: [services/llm_generate.py:140 - model_garden_predict()] Received response in 28 seconds from projects/63101149566/locations/us-central1/models/llama2-13b-chat-001-mg-one-click-deploy version=[1] with 1 prediction(s) = [Prompt:
'What is a Medicaid?'
Output:
 and 'How does it work?'

Medicaid is a government program that provides health coverage to low-income individuals and families. It is jointly funded by the federal government and each state, and is administered by each state.

To be eligible for Medicaid, you must meet certain income and asset requirements, which vary by state. Generally, you 

A custom way to modify prediction parameters, since performance depend on big extent on the output token size.

In [5]:
parameters =   {
    "max_tokens": 300,
    "temperature": 0.2,
    "top_p": 1.0,
    "top_k": 10,
}

response = await model_garden_predict(aip_endpoint_name=os.environ["MODEL_GARDEN_LLAMA2_CHAT_ENDPOINT_ID"], prompt=prompt, parameters=parameters)
print(response)

INFO: [services/llm_generate.py:117 - model_garden_predict()] Generating text using Model Garden endpoint=[projects/gcp-mira-demo/locations/us-central1/endpoints/3894402015861669888], prompt=[What is a Medicaid?], parameters=[{'max_tokens': 300, 'temperature': 0.2, 'top_p': 1.0, 'top_k': 10}.
INFO: [services/llm_generate.py:140 - model_garden_predict()] Received response in 20 seconds from projects/63101149566/locations/us-central1/models/llama2-13b-chat-001-mg-one-click-deploy version=[1] with 1 prediction(s) = [Prompt:
'What is a Medicaid?'
Output:
 and 'How does it work?'

Medicaid is a government program that provides health coverage to eligible individuals with low income and limited resources. It was created in 1965 as part of the Social Security Act and is administered by the Centers for Medicare and Medicaid Services (CMS), a division of the U.S. Department of Health and Human Services (HHS).

Medicaid is a joint federal-state program, which means that both the federal governme