# Use Azure API with Meta Llama 3.1 - 8b Instruct NIM in Azure AI Foundry and Azure ML

This notebook shows examples of how to use Meta Llama 3 APIs offered by Microsoft Azure. We will cover:  
* HTTP requests API usage for Meta Llama 3 8b pretrained and chat models in CLI
* HTTP requests API usage for Meta Llama 3 8b pretrained and chat models in Python



## Prerequisites

Before we start, there are certain steps we need to take to deploy the models:

* Register for a valid Azure account with subscription 
* Make sure you have access to [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home)
* Create a project and resource group
* Select Nvidia NIM:  Meta Llama 3.1 -8b Instruct NIM models from Model catalog


![nim-models.png](nim-models.png)

Once deployed successfully, you should be assigned for an API endpoint and a security key for inference. 


## HTTP Requests API Usage in CLI

### Basics

For using the REST API, You will need to have a Endpoint url and Authentication Key associated with that endpoint.  
This can be acquired from previous steps.  

In this text completion example for 8B pre-trained model, we use a simple curl call for illustration. There are three major components:  

* The `host-url` is your endpoint url with completion schema. 
* The `headers` defines the content type as well as your api key. 
* The `payload` or `data`, which is your prompt detail and model hyper parameters.

In [None]:
!curl -X POST -L https://<endpoint>.<region>.inference.ml.azure.com/v1/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"prompt": "Math is a", "max_tokens": 30, "temperature": 0.7}'

For chat completion, the API schema and request payload are slightly different.

For `host-url` the path changed to `/v1/chat/completions` and the request payload also changed to include roles in conversations. Here is a sample payload:  

```
{ 
  "messages": [ 
    { 
      "content": "You are a helpful assistant.", 
      "role": "system" 
},  
    { 
      "content": "Hello!", 
      "role": "user" 
    } 
  ], 
  "max_tokens": 50, 
} 
```

Here is a sample curl call for chat completion

In [None]:
!curl -X POST -L https://<endpoint>.<region>.inference.ml.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"messages":[{"content":"You are a helpful assistant.","role":"system"},{"content":"What is good about Wuhan?","role":"user"}], "max_tokens": 50}'

If you compare the generation result for both text and chat completion API calls, you will notice that:  

* Text completion returns a list of `choices` for the input prompt, each contains generated text and completion information such as `logprobs`.
* Chat completion returns a list of `cnoices` each has a `message` object with completion result and using the same `message` object in the request.  




### Streaming

One fantastic feature the API offered is the streaming capability. Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. This is extremely important for interactive applications such as chatbots, so the user is always engaged.  

To use streaming, simply set `"stream":"True"` as part of the request payload.  
In the streaming mode, the REST API response will be different from non-streaming mode.

Here is an example: 

In [None]:
!curl -X POST -L https://<endpoint>.<region>.inference.ml.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{"messages":[{"content":"You are a helpful assistant.","role":"system"},{"content":"What is good about Wuhan?","role":"user"}], "max_tokens": 500, "stream": "True"}'

As you can see the result comes back as a stream of `data` objects, each contains generated information including a `choice`.  
The stream terminated by a `data:[DONE]\n\n` message.

## HTTP Requests API Usage in Python

Besides calling the API directly from command line tools. You can also programatically call them in Python.  

Here is an example for text completion model:




In [None]:
import requests
import json
from urllib.parse import urljoin

base_url = "https://<endpoint>.<region>.inference.ml.azure.com/"

token = "key"
model_name = "meta/llama-3.1-8b-instruct"
url = urljoin(base_url, "v1/chat/completions")
headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {token}",  # modify this token
    "Content-Type": "application/json",
}
data = {
    "messages": [
        {
            "content": "You are a polite and respectful chatbot helping people plan a vacation.",
            "role": "system",
        },
        {"content": "What should I do for a 4 day vacation in Spain?", "role": "user"},
    ],
    "model": model_name,
    "max_tokens": 500,
    "top_p": 1,
    "n": 1,
    "stream": False,
    # "stop": "\n",
    "frequency_penalty": 0.0,
}

response = requests.post(url, headers=headers, json=data)
# Pretty print the JSON response
print(json.dumps(response.json(), indent=4))

Chat completion in Python is very similar, here is a quick example:

In [None]:
import urllib.request
import json

# Configure payload data sending to API endpoint
data = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is good about Wuhan?"},
    ],
    "max_tokens": 500,
    "temperature": 0.9,
    "stream": "False",
}

body = str.encode(json.dumps(data))

# Replace the url with your API endpoint
url = "https://your-endpoint.inference.ml.azure.com/v1/chat/completions"

# Replace this with the key for the endpoint
api_key = "your-auth-key"

if not api_key:
    raise Exception("API Key is missing")

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {api_key}",  # modify this token
    "Content-Type": "application/json",
}
req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)
    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))
    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", "ignore"))

However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize [`requests`](https://requests.readthedocs.io/en/latest/) library instead.

### Streaming in Python

`Requests` library is a simple HTTP library for Python built with [`urllib3`](https://github.com/urllib3/urllib3). It automatically maintains the keep-alive and HTTP connection pooling. With the `Session` class, we can easily stream the result from our API calls.  

Here is a quick example:

In [None]:
import json
import requests

data = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is good about Wuhan?"},
    ],
    "max_tokens": 500,
    "temperature": 0.9,
    "stream": "True",
}


def post_stream(url):
    s = requests.Session()
    api_key = "key"
    # headers = {"Content-Type": "application/json", "Authorization": (api_key)}
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {api_key}",  # modify this token
        "Content-Type": "application/json",
    }
    with s.post(url, data=json.dumps(data), headers=headers, stream=True) as resp:
        print(resp.status_code)
        for line in resp.iter_lines():
            if line:
                print(line)


url = "https://<endpoint>.<region>.inference.ml.azure.com/v1/chat/completions"
post_stream(url)