# Introduction
The purpose of this cookbook is to show you how to properly benchmark TGI. For more background details and explanation, please check out this [popular blog](https://huggingface.co/blog/tgi-benchmarking) first.

## Setup
Make sure you have an environment with TGI installed; docker is a great choice.The commands here can be easily copied/pasted into a terminal, which might be even easier. Don't feel compelled to use Jupyter. If you just want to test this out, you can duplicate and use [derek-thomas/tgi-benchmark-space](https://huggingface.co/spaces/derek-thomas/tgi-benchmark-space). 

# TGI Launcher

In [1]:
!text-generation-launcher --version

text-generation-launcher 2.2.1-dev0


Below we can see the different settings for TGI. Be sure to read through them and decide which settings are most 
important for your use-case.

Here are some of the most important ones:
- `--model-id`
- `--quantize` Quantization saves memory, but does not always improve speed
- `--max-input-tokens` This allows TGI to optimize the prefilling operation
- `--max-total-tokens` In combination with the above TGI now knows what the max input and output tokens are
- `--max-batch-size` This lets TGI know how many requests it can process at once.

The last 3 together provide the necessary restrictions to optimize for your use-case. You can find a lot of performance improvements by setting these as appropriately as possible.

In [2]:
!text-generation-launcher -h

Text Generation Launcher

[1m[4mUsage:[0m [1mtext-generation-launcher[0m [OPTIONS]

[1m[4mOptions:[0m
      [1m--model-id[0m <MODEL_ID>
          The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `gpt2` or `OpenAssistant/oasst-sft-1-pythia-12b`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers [env: MODEL_ID=] [default: bigscience/bloom-560m]
      [1m--revision[0m <REVISION>
          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2` [env: REVISION=]
      [1m--validation-workers[0m <VALIDATION_WORKERS>
          The number of tokenizer workers used for payload validation and truncation inside the router [env: VALIDATION_WORKERS=] [default: 2]
      [1m--sharded[0m <SHARDED>
          Whether to shard the model across multiple GPUs By default text-generation-inference will

We can launch directly from the cookbook since we dont need the command to be interactive.

We will just be using defaults in this cookbook as the intent is to understand the benchmark tool.

These parameters were changed if you're running on a Space because we don't want to conflict with the Spaces server:
- `--hostname`
- `--port`

Feel free to change or remove them based on your requirements.

In [3]:
!RUST_BACKTRACE=1 \
text-generation-launcher \
--model-id astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \
--quantize gptq \
--hostname 0.0.0.0 \
--port 1337

[2m2024-08-16T12:07:56.411768Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Args {
    model_id: "astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        Gptq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 1337,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0

# TGI Benchmark
Now lets learn how to launch the benchmark tool!

Here we can see the different settings for TGI Benchmark.

Here are some of the more important TGI Benchmark settings:

- `--tokenizer-name` This is required so the tool knows what tokenizer to use
- `--batch-size` This is important for load testing. We should use enough values to see what happens to throughput and latency. Do note that batch-size in the context of the benchmarking tool is number of virtual users. 
- `--sequence-length` AKA input tokens, it is important to match your use-case needs
- `--decode-length` AKA output tokens, it is important to match your use-case needs
- `--runs` 10 is the default

<blockquote style="border-left: 5px solid #80CBC4; background: #263238; color: #CFD8DC; padding: 0.5em 1em; margin: 1em 0;">
  <strong>üí° Tip:</strong> Use a low number for <code style="background: #37474F; color: #FFFFFF; padding: 2px 4px; border-radius: 4px;">--runs</code> when you are exploring but a higher number as you finalize to get more precise statistics
</blockquote>


In [4]:
!text-generation-benchmark -h

Text Generation Benchmarking tool

[1m[4mUsage:[0m [1mtext-generation-benchmark[0m [OPTIONS] [1m--tokenizer-name[0m <TOKENIZER_NAME>

[1m[4mOptions:[0m
  [1m-t[0m, [1m--tokenizer-name[0m <TOKENIZER_NAME>
          The name of the tokenizer (as in model_id on the huggingface hub, or local path) [env: TOKENIZER_NAME=]
      [1m--revision[0m <REVISION>
          The revision to use for the tokenizer if on the hub [env: REVISION=] [default: main]
  [1m-b[0m, [1m--batch-size[0m <BATCH_SIZE>
          The various batch sizes to benchmark for, the idea is to get enough batching to start seeing increased latency, this usually means you're moving from memory bound (usual as BS=1) to compute bound, and this is a sweet spot for the maximum batch size for the model under test
  [1m-s[0m, [1m--sequence-length[0m <SEQUENCE_LENGTH>
          This is the initial prompt sent to the text-generation-server length in token. Longer prompt will slow down the benchmark. Usually the la

Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used 
by the benchmark tool. I'm also considering which batch sizes are important based on estimated user activity.

<blockquote style="border-left: 5px solid #FFAB91; background: #37474F; color: #FFCCBC; padding: 0.5em 1em; margin: 1em 0;">
  <strong>‚ö†Ô∏è Warning:</strong> Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting it here for convenience.
</blockquote>


In [None]:
!text-generation-benchmark \
--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \
--sequence-length 70 \
--decode-length 50 \
--batch-size 1 \
--batch-size 2 \
--batch-size 4 \
--batch-size 8 \
--batch-size 16 \
--batch-size 32 \
--batch-size 64 \
--batch-size 128 