# Run Llama-3.1-405B-FP8-Instruct 

**Note that running the FP8 model requires GPUs with compute capability > 9. A potential working setup would be 8*H100**

Let's first install the required libraries:

In [None]:
! pip install transformers accelerate

Note that for torch and [fbgemm-gpu](https://huggingface.co/docs/transformers/main/en/quantization/fbgemm_fp8) libraries, you might need to download the nighly version. Just follow the instruction here :
https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html

Change with your version of cuda. In this example, we are installing the nighlty version with cuda 12.1
```
pip install torch --index-url https://download.pytorch.org/whl/cu121/
pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121/
```

In [None]:
! pip install torch fbgemm-gpu

We import the required libraries : 

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

Let's load the model. The model has already been quantized with fbgemm_fp8 as specified in the model's [config.json](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct-FP8/blob/main/config.json), so we don't need to specify a `quantization_config` and can load the quantized model as follows:

In [None]:
model_name = "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"

quantized_model = AutoModelForCausalLM.from_pretrained(
	model_name, device_map="auto", torch_dtype=torch.bfloat16)

Then, we need to prepare the inputs: 

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

Finally, we can generate the output ! 

In [None]:
output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))