# How to create an Azure AI Content Safety enabled Llama 2 online endpoint (Preview)
### This notebook will walk you through the steps to create an __Azure AI Content Safety__ enabled __Llama 2__ online endpoint.
### This notebook is under preview
### The steps are:
1. Create an __Azure AI Content Safety__ resource for moderating the request from user and response from the __Llama 2__ online endpoint.
2. Create a new __Azure AI Content Safety__ enabled __Llama 2__ online endpoint with a custom score.py which will integrate with the __Azure AI Content Safety__ resource to moderate the response from the __Llama 2__ model and the request from the user, but to make the custom score.py to successfully authenticated to the __Azure AI Content Safety__ resource, is to create a User Assigned Identity (UAI) and assign appropriate roles to the UAI. Then, the custom score.py can obtain the access token of the UAI from the AAD server to access the Azure AI Content Safety resource. Use [this notebook](aacs-prepare-uai.ipynb) to create UAI account for step 3 below

### 1. Prerequisites
#### 1.1 Check List:
- [x] You have created a new Python virtual environment for this notebook.
- [x] The identity you are using to execute this notebook(yourself or your VM) need to have the __Contributor__ role on the resource group where the AML Workspace your specified is located, because this notebook will create an Azure AI Content Safety resource using that identity.

#### 1.2 Assign variables for the workspace and deployment

In [None]:
# The public registry name contains Llama 2 models
registry_name = "azureml-meta"

# Name of the Llama 2 model to be deployed
# available_llama_models_text_generation = ["Llama-2-7b", "Llama-2-13b", "Llama-2-70b"]
# available_llama_models_chat_complete = ["Llama-2-7b-chat", "Llama-2-13b-chat", "Llama-2-70b-chat"]
model_name = "Llama-2-7b"

endpoint_name = f"{model_name}-test-ep"  # Replace with your endpoint name
deployment_name = "llama"  # Replace with your deployment name, lower case only!!!
sku_name = "Standard_NC24s_v3"  # Name of the sku(instance type) Check the model-list(can be found in the parent folder(inference)) to get the most optimal sku for your model (Default: Standard_DS2_v2)

# The severity level that will trigger response be blocked
# Please reference Azure AI content documentation for more details
# https://learn.microsoft.com/en-us/azure/cognitive-services/content-safety/concepts/harm-categories
content_severity_threshold = "2"

# UAI to be used for endpoint if you choose to use UAI as authentication method
uai_name = ""  # default to "aacs-uai" in prepare uai notebook

#### 1.3 Install Dependencies(as needed)

In [None]:
# uncomment the following lines to install the required packages
# %pip install azure-identity==1.13.0
# %pip install azure-mgmt-cognitiveservices==13.4.0
# %pip install azure-ai-ml>=1.23.1
# %pip install azure-mgmt-msi==7.0.0
# %pip install azure-mgmt-authorization==3.0.0

#### 1.4 Get credential

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

#### 1.5 Configure workspace 

In [None]:
from azure.ai.ml import MLClient

try:
    ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
    # enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"

    # get a handle to the workspace
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

subscription_id = ml_client.subscription_id
resource_group = ml_client.resource_group_name
workspace = ml_client.workspace_name

print(f"Connected to workspace {workspace}")

#### 1.6 Assign variables for Azure Content Safety
Currently, Azure AI Content Safety is in a limited set of regions:


__NOTE__: before you choose the region to deploy the Azure AI Content Safety, please be aware that your data will be transferred to the region you choose and by selecting a region outside your current location, you may be allowing the transmission of your data to regions outside your jurisdiction. It is important to note that data protection and privacy laws may vary between jurisdictions. Before proceeding, we strongly advise you to familiarize yourself with the local laws and regulations governing data transfer and ensure that you are legally permitted to transmit your data to an overseas location for processing. By continuing with the selection of a different region, you acknowledge that you have understood and accepted any potential risks associated with such data transmission. Please proceed with caution.

In [None]:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient

acs_client = CognitiveServicesManagementClient(credential, subscription_id)


# settings for the Azure AI Content Safety resource
# we will choose existing AACS resource if it exists, otherwise create a new one
# name of azure ai content safety resource, has to be unique
import time

aacs_name = f"{endpoint_name}-aacs-{str(time.time()).replace('.','')}"
available_aacs_locations = ["east us", "west europe"]

# create a new Cognitive Services Account
kind = "ContentSafety"
aacs_sku_name = "S0"
aacs_location = available_aacs_locations[0]


print("Available SKUs:")
aacs_skus = acs_client.resource_skus.list()
print("SKU Name\tSKU Tier\tLocations")
for sku in aacs_skus:
    if sku.kind == "ContentSafety":
        locations = ",".join(sku.locations)
        print(sku.name + "\t" + sku.tier + "\t" + locations)

print(
    f"Choose a new Azure AI Content Safety resource in {aacs_location} with SKU {aacs_sku_name}"
)

### 2. Create Azure AI Content Safety

In [None]:
from azure.mgmt.cognitiveservices.models import Account, Sku, AccountProperties


parameters = Account(
    sku=Sku(name=aacs_sku_name),
    kind=kind,
    location=aacs_location,
    properties=AccountProperties(
        custom_sub_domain_name=aacs_name, public_network_access="Enabled"
    ),
)
# How many seconds to wait between checking the status of an async operation.
wait_time = 10


def find_acs(accounts):
    return next(
        x
        for x in accounts
        if x.kind == "ContentSafety"
        and x.location == aacs_location
        and x.sku.name == aacs_sku_name
    )


try:
    # check if AACS exists
    aacs = acs_client.accounts.get(resource_group, aacs_name)
    print(f"Found existing Azure AI content safety Account {aacs.name}.")
except:
    try:
        # check if there is an existing AACS resource within same resource group
        aacs = find_acs(acs_client.accounts.list_by_resource_group(resource_group))
        print(
            f"Found existing Azure AI content safety Account {aacs.name} in resource group {resource_group}."
        )
    except:
        print(f"Creating Azure AI content safety Account {aacs_name}.")
        acs_client.accounts.begin_create(resource_group, aacs_name, parameters).wait()
        print("Resource created.")
        aacs = acs_client.accounts.get(resource_group, aacs_name)


aacs_endpoint = aacs.properties.endpoint
aacs_resource_id = aacs.id
aacs_name = aacs.name
print(
    f"AACS name is {aacs.name}, use this name in UAI preparation notebook to create UAI."
)
print(f"AACS endpoint is {aacs_endpoint}")
print(f"AACS ResourceId is {aacs_resource_id}")

### 3. Create Azure AI Content Safety enabled Llama 2 online endpoint

#### 3.1 Check if Llama 2 model is available in the AML registry.

In [None]:
reg_client = MLClient(
    credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    registry_name=registry_name,
)
version_list = list(
    reg_client.models.list(model_name)
)  # list available versions of the model
llama_model = None

# If specific inference environments are tagged for the model
inference_envs_exist = False

if len(version_list) == 0:
    raise Exception(f"No model named {model_name} found in registry")
else:
    model_version = version_list[0].version
    llama_model = reg_client.models.get(model_name, model_version)
    if (
        "inference_supported_envs" in llama_model.tags
        and len(llama_model.tags["inference_supported_envs"]) >= 1
    ):
        inference_envs_exist = True
    print(
        f"Using model name: {llama_model.name}, version: {llama_model.version}, id: {llama_model.id} for inferencing"
    )

#### 3.2 Check if UAI is used

In [None]:
uai_id = ""
uai_client_id = ""
if uai_name != "":
    from azure.mgmt.msi import ManagedServiceIdentityClient
    from azure.mgmt.msi.models import Identity

    msi_client = ManagedServiceIdentityClient(
        subscription_id=subscription_id,
        credential=credential,
    )
    uai_resource = msi_client.user_assigned_identities.get(resource_group, uai_name)
    uai_id = uai_resource.id
    uai_client_id = uai_resource.client_id

#### 3.3 Create Llama 2 online endpoint
This step may take a few minutes.

#### Create endpoint

In [None]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

# Check if the endpoint already exists in the workspace
try:
    endpoint = ml_client.online_endpoints.get(endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description="Test endpoint for model",
        identity=IdentityConfiguration(
            type="user_assigned",
            user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
        )
        if uai_id != ""
        else None,
    )

    # Trigger the endpoint creation
    try:
        ml_client.begin_create_or_update(endpoint).wait()
        print("\n---Endpoint created successfully---\n")
    except Exception as err:
        raise RuntimeError(
            f"Endpoint creation failed. Detailed Response:\n{err}"
        ) from err

#### 3.4 Setup Deployment Parameters

We utilize an optimized __foundation-model-inference__ container for model scoring. This container is designed to deliver high throughput and low latency. In this section, we introduce several environment variables that can be adjusted to customize a deployment for either high throughput or low latency scenarios.

- __ENGINE_NAME__: Used to choose the inferencing framework to use in the scoring script. For Llama-2 models, if ENGINE_NAME = 'mii' the container will inference with the new [DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen). Alternatively if ENGINE_NAME = 'vllm' the container will inference with [vLLM](https://vllm.readthedocs.io/en/latest/), which is also the default.
- __WORKER_COUNT__: The number of workers to use for inferencing. This is used as a proxy for the number of concurrent requests that the server should handle.
- __TENSOR_PARALLEL__: The number of GPUs to use for tensor parallelism.
- __NUM_REPLICAS__: The number of model instances to load for the deployment. This is used to increase throughput by loading multiple models on multiple GPUs, if the model is small enough to fit.

`NUM_REPLICAS` and `TENSOR_PARALLEL` work hand-in-hand to determine the most optimal configuration to increase the throughput for the deployment without degrading too much on the latency. The total number of GPUs used for inference will be `NUM_REPLICAS` * `TENSOR_PARALLEL`. For example, if `NUM_REPLICAS` = 2 and `TENSOR_PARALLEL` = 2, then 4 GPUs will be used for inference.

Ensure that the model you are deploying is small enough to fit on the number of GPUs you are using, specified by `TENSOR_PARALLEL`. For instance, if there are 4 GPUs available, and `TENSOR_PARALLEL` = 2, then the model must be small enough to fit on 2 GPUs. If the model is too large, then the deployment will fail. 

__NOTE__: 
- `NUM_REPLICAS` is currently only supported by the vLLM engine.
- DeepSpeed MII Engine is only supported on A100 / H100 GPUs.


In [None]:
REQUEST_TIMEOUT_MS = 90000  # the timeout for each request in milliseconds
MAX_CONCURRENT_REQUESTS = (
    128  # the maximum number of concurrent requests supported by the endpoint
)

acs_env_vars = {
    "CONTENT_SAFETY_ACCOUNT_NAME": aacs_name,
    "CONTENT_SAFETY_ENDPOINT": aacs_endpoint,
    "CONTENT_SAFETY_THRESHOLD": content_severity_threshold,
    "SUBSCRIPTION_ID": subscription_id,
    "RESOURCE_GROUP_NAME": resource_group,
    "UAI_CLIENT_ID": uai_client_id,
}

fm_container_default_env_vars = {
    "WORKER_COUNT": MAX_CONCURRENT_REQUESTS,
    "TENSOR_PARALLEL": 2,
    "NUM_REPLICAS": 2,
}

deployment_env_vars = {**fm_container_default_env_vars, **acs_env_vars}

# Uncomment the following lines to use DeepSpeed FastGen engine (experimental)
# mii_fastgen_env_vars = {
#     "ENGINE_NAME": "mii",
#     "WORKER_COUNT": MAX_CONCURRENT_REQUESTS,
# }
# deployment_env_vars = {**mii_fastgen_env_vars, **acs_env_vars}

##### 3.5 Deploy Llama 2 model
This step may take a few minutes.

In [None]:
from azure.ai.ml.entities import (
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
)

# For inference environments vLLM and DS MII, the scoring script is baked into the container
code_configuration = (
    CodeConfiguration(code="./llama-files/score/default/", scoring_script="score.py")
    if not inference_envs_exist
    else None
)

deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=llama_model.id,
    instance_type=sku_name,
    instance_count=1,
    code_configuration=code_configuration,
    environment_variables=deployment_env_vars,
    request_settings=OnlineRequestSettings(
        request_timeout_ms=REQUEST_TIMEOUT_MS,
        max_concurrent_requests_per_instance=MAX_CONCURRENT_REQUESTS,
    ),
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
)

# Trigger the deployment creation
try:
    ml_client.begin_create_or_update(deployment).wait()
    print("\n---Deployment created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err

### 4. Test the Safety Enabled Llama 2 online endpoint.

In [None]:
import os

test_src_dir = "./safety-llama-test"
os.makedirs(test_src_dir, exist_ok=True)
print(f"test script directory: {test_src_dir}")
sample_data = os.path.join(test_src_dir, "sample-request.json")

##### Choose request from following 4 cells based on the Llama 2 model type you chosen and the scenario you want to test

In [None]:
## For text-generation models (without -chat suffix)
## Successful response

import json

with open(sample_data, "w") as f:
    json.dump(
        {
            "input_data": {
                "input_string": [
                    "Hello",
                    "My name is John and I have a dog.",
                ],
                "parameters": {
                    "temperature": 0.6,
                    "top_p": 0.6,
                    "max_new_tokens": 256,
                    "do_sample": True,
                },
            }
        },
        f,
    )

In [None]:
## For text-generation models (without -chat suffix)
## Blocked request/response due to hateful content

import json

with open(sample_data, "w") as f:
    json.dump(
        {
            "input_data": {
                "input_string": [
                    "The dog was given a eutanasa injection due to their severed leg bleding profusely from deep lacarations to the lower extremities, exposing tisssue and nerve.",  # This line contains hateful message and endpoint will return empty string as response
                ],
                "parameters": {
                    "temperature": 0.6,
                    "top_p": 0.6,
                    "max_new_tokens": 256,
                    "do_sample": True,
                },
            }
        },
        f,
    )

In [None]:
## For chat-complete models (with "-chat" suffix)
## Successful request/response

import json

with open(sample_data, "w") as f:
    json.dump(
        {
            "input_data": {
                "input_string": [
                    {
                        "role": "user",
                        "content": "What is the tallest building in the world?",
                    },
                    {
                        "role": "assistant",
                        "content": "As of 2021, the Burj Khalifa in Dubai, United Arab Emirates is the tallest building in the world, standing at a height of 828 meters (2,722 feet). It was completed in 2010 and has 163 floors. The Burj Khalifa is not only the tallest building in the world but also holds several other records, such as the highest occupied floor, highest outdoor observation deck, elevator with the longest travel distance, and the tallest freestanding structure in the world.",
                    },
                    {"role": "user", "content": "and in Africa?"},
                    {
                        "role": "assistant",
                        "content": "In Africa, the tallest building is the Carlton Centre, located in Johannesburg, South Africa. It stands at a height of 50 floors and 223 meters (730 feet). The CarltonDefault Centre was completed in 1973 and was the tallest building in Africa for many years until the construction of the Leonardo, a 55-story skyscraper in Sandton, Johannesburg, which was completed in 2019 and stands at a height of 230 meters (755 feet). Other notable tall buildings in Africa include the Ponte City Apartments in Johannesburg, the John Hancock Center in Lagos, Nigeria, and the Alpha II Building in Abidjan, Ivory Coast",
                    },
                    {"role": "user", "content": "and in Europe?"},
                ],
                "parameters": {
                    "temperature": 0.9,
                    "top_p": 0.6,
                    "do_sample": True,
                    "max_new_tokens": 100,
                },
            }
        },
        f,
    )

In [None]:
## For chat-complete models (with "-chat" suffix)
## Blocked request due to hateful content

import json

with open(sample_data, "w") as f:
    json.dump(
        {
            "input_data": {
                "input_string": [
                    {
                        "role": "user",
                        "content": "A 51-year-old man was found dead in his car. There were blood stains on the dashboard and windscreen. At autopsy, a deep, oblique, long incised injury was found on the front of the neck. It turns out that he died by suicide.",
                    },
                ],
                "parameters": {
                    "temperature": 0.9,
                    "top_p": 0.6,
                    "do_sample": True,
                    "max_new_tokens": 100,
                },
            }
        },
        f,
    )

In [None]:
ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name=deployment_name,
    request_file=sample_data,
)