---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

---


# Hugging Face Large Model Inference - TGI

This notebook demonstrates how to deploy common large language models such as flan-t5-xxl and LLaMa, using Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker.

TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and dynamic batching to simplify production inference with large language models like flan-t5-xxl, LLaMa, StableLM, and GPT-NeoX. 

## Setup

### Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

In [None]:
%pip install "sagemaker>=2.163.0" boto3 --upgrade --quiet

### Setup account and role

Then, we import the SageMaker python SDK and instantiate a `sagemaker_session` which we use to determine the current region and execution role.

In [None]:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time
from datetime import datetime, timedelta
import boto3
import json

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

## Retrieve the LLM Image URI

We use the helper function `get_huggingface_llm_image_uri()` to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.

The function takes a required parameter `backend` and several optional parameters. The `backend` specifies the type of backend to use for the model, the values can be "lmi" and "huggingface". The "lmi" stands for SageMaker LMI inference backend, and "huggingface" refers to using Hugging Face TGI inference backend.

In [None]:
image_uri = get_huggingface_llm_image_uri(backend="huggingface", region=region, version='0.8.2')  # or lmi
image_uri

## Create the Hugging Face Model

Next we configure the `model` object by specifying a model configured in `models.json` for the managed TGI container. This file contains the information for a number of environment variables including the `HF_MODEL_ID` which corresponds to the model from the HuggingFace Hub that will be deployed, and the `HF_TASK` which configures the inference task to be performed by the model.

The file also defines `SM_NUM_GPUS`, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set `SM_NUM_GPUS` to the number of available GPUs on your selected instance type.  

Additionally, we could reduce the memory footprint of the model by setting the `HF_MODEL_QUANTIZE` environment variable to `bitsandbytes` or `gptq`.

Note that for downloading `starcoder` model, we need to set the `HUGGING_FACE_HUB_TOKEN` environment variable. We can refer to [User access tokens](https://huggingface.co/docs/hub/security-tokens) to create a access tokens.

In [None]:
with open("models.json") as f:
    _MODEL_CONFIG_ = json.load(f)
    f.close()

In [None]:
%pip install ipywidgets --quiet

In [None]:
import ipywidgets as widgets

model_dropdown = widgets.Dropdown(
    options=_MODEL_CONFIG_.keys()
)
model_dropdown

In [None]:
model_id = model_dropdown.value
print(f"The selected model is: {model_id}")
if "HUGGING_FACE_HUB_TOKEN" in _MODEL_CONFIG_[model_id]['env'].keys():
    token = input(f"This model requires a token from the HuggingFace Hub. Please enter it:")
    _MODEL_CONFIG_[model_id]['env']['HUGGING_FACE_HUB_TOKEN'] = token

In [None]:
_MODEL_CONFIG_[model_id]

In [None]:
model_name = f"{model_id}-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
model = HuggingFaceModel(name=model_name, 
                         env=_MODEL_CONFIG_[model_id]['env'], 
                         role=role, 
                         image_uri=image_uri)

## Creating a SageMaker Endpoint

Next we deploy the model by invoking the `deploy()` function. Here we use an appropriate instance based on the selected LLM which come with one or more NVIDIA A10 GPUs. The `SM_NUM_GPUS` environment variable will indicate how many GPU devices the model will be sharded across.

In [None]:
predictor = model.deploy(
    initial_instance_count=1, 
    instance_type=_MODEL_CONFIG_[model_id]['instance type'], 
    endpoint_name=model_name,
    container_startup_health_check_timeout=500,
)

## Running Inference

Once the endpoint is up and running, we can evaluate the model using the `predict()` function.

In [None]:
print(f"Sample input: {_MODEL_CONFIG_[model_id]['sample_input']}")

In [None]:
input_data = {
    "inputs": _MODEL_CONFIG_[model_id]['sample_input'],
    "parameters": {"do_sample": True, "max_new_tokens": 100, "temperature": 0.7, "watermark": True},
}

output = predictor.predict(input_data)

In [None]:
print(output[0]['generated_text'])

### Use Inference Recommender to help decide the instance type and understand the model performance

uncomment below cell if you would like to provide your own input for the load testing. Otherwise, directly run the cell after to use the prepared `payload.json` file as the input.

In [None]:
# # Serializing json
# json_object = json.dumps(input_data, indent=4)
 
# # Writing to sample.json
# with open("payload.json", "w") as outfile:
#     outfile.write(json_object)

In [None]:
!tar -czvf payload.tar.gz payload.json

In [None]:
s3_location = f"s3://{bucket}/sagemaker/InferenceRecommender/{model_id}"
payload_tar_url = sagemaker.s3.S3Uploader.upload("payload.tar.gz", s3_location)
print(payload_tar_url)

Before running the Inference Recommender job, make sure that you have enough account service quota to test the job.  You can specify the isntance types by setting up the `SupportedInstanceTypes` in the `ContainerConfig` of the inference job configuration. If you don't set this parameter, SageMaker Inference Recommender will run the job against all the gpu instances that has 1 gpu core, such as ml.g4dn.2xlarge, ml.g5.xlarge, ml.g5.2xlarge, ml.g4dn.4xlarge, ml.g4dn.8xlarge, ml.p2.xlarge, ml.g4dn.16xlarge, ml.g4dn.xlarge.

In [None]:
job_name = f"{model_id}-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_client = boto3.client('sagemaker')

inference_job_config = {
        'ContainerConfig': {
            'Domain': 'NATURAL_LANGUAGE_PROCESSING',
            'Task': 'TEXT_GENERATION',
            'PayloadConfig': {
                'SamplePayloadUrl': payload_tar_url,
                'SupportedContentTypes': ["application/json"],
            },
            'SupportedEndpointType': 'RealTime'
        },
    'ModelName': model_name
    }

if 'SM_NUM_GPUS' in _MODEL_CONFIG_[model_id]['env'].keys() and _MODEL_CONFIG_[model_id]['env']['SM_NUM_GPUS'] == '4':
    inference_job_config['ContainerConfig']['SupportedInstanceTypes'] = ['ml.g5.12xlarge', 'ml.g4dn.12xlarge',  'ml.g5.24xlarge',]

response = sm_client.create_inference_recommendations_job(
    JobName=job_name,
    JobType='Default',
    RoleArn=role,
    InputConfig=inference_job_config
)

Note:
If the above code fails, please install the latest boto3 and restart your kernel.

In [None]:
inference_job_config

In [None]:
describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)

while describe_IR_job_response["Status"] in ["IN_PROGRESS", "PENDING"]:
    describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)
    print(describe_IR_job_response["Status"])
    time.sleep(15)
    
print(f'Inference Recommender job {job_name} has finished with status {describe_IR_job_response["Status"]}.')

Now, let's use the inference recommender job results to calculate the approximate invocation cost for the LLM endpoint.

In [None]:
describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)
failed = False
try:
    print(describe_IR_job_response['InferenceRecommendations'])
except:
    if "FailureReason" in describe_IR_job_response.keys():
        print(f"Inference recommender job failed with reason: {describe_IR_job_response['FailureReason']}")
        failed = True

The inference recomender job reports the below metrics: 
- 'ModelLatency'
- 'CostPerInference'
- 'CostPerHour'
- 'MaxInvocations' per minute

and more.

Note that the sample json input file consists of 6,200 characters, which is around 1550 tokens per invocation (1 token is approximately 4 characters). To calculate the approximate cost per 1K tokens, you can do the inference many times (with average payload size) and get the best token/s you get through the experiment (different instance types can result in different throughput, model latency, and cost). Then we will calculate the per token per second invocation price and multiply by 1,000. You can also use per invocation cost divide by the tokens per invocation and multiply by 1,000. The calculated price should be similar. SageMaker also supports auto-scaling to scale your endpoint out/in to save cost based on the invocation traffic pattern.


In [None]:
if not failed:
    for job_index, _ in enumerate(describe_IR_job_response['InferenceRecommendations']):
        metrics = describe_IR_job_response['InferenceRecommendations'][job_index]['Metrics']
        instance_type = describe_IR_job_response['InferenceRecommendations'][job_index]['EndpointConfiguration']['InstanceType']
        token_per_sec = round(metrics['MaxInvocations']*1550/60, 2)
        cost_per_sec = round(metrics['CostPerHour']/3600, 5)
        cost_per_1k_token = round(cost_per_sec/token_per_sec * 1000, 5)
        print(f"According to the Inference recommender job, the corresponding metrices for hosting the model on instance type {instance_type} are as below:")
        print(f"Max tokens per second is about {token_per_sec}")
        print(f"Cost per second is about ${cost_per_sec}")
        print(f"Cost per 1k tokens is about ${cost_per_1k_token}")

## Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

In this tutorial, we used a TGI container to deploy large language models on an appropriate SageMaker instance. With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like GPT-NeoX, flan-t5-xxl, and LLaMa.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|huggingfacetgi|tgi-hosting|sagemaker-huggingface-tgi-hosting-examples.ipynb)
