# HuggingFace meets `bitsandbytes` for lighter models on GPU for inference

 <center>
 <img src="https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png">
 </center>


You can run your own 8-bit model on any HuggingFace 🤗 model with just few lines of code. Install the dependencies below first!


In [None]:
!pip install --quiet bitsandbytes
!pip install --quiet git+https://github.com/huggingface/transformers.git # Install latest version of transformers
!pip install --quiet accelerate

[K     |████████████████████████████████| 55.8 MB 1.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 101 kB 7.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 59.1 MB/s 
[K     |████████████████████████████████| 596 kB 69.3 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
[K     |████████████████████████████████| 143 kB 34.4 MB/s 
[?25h

## Hardware requirements 🔨

To run properly this feature you need to have GPU that supports 8-bit operation modules. Currently, Turing and Ampere GPUs (RTX20s, RTX30s, A40-A100, T4+) are supported, which means on colab we need to use a T4 GPU for this feature. You can check that using this code snippet and make sure you are using a supported GPU

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Aug  8 09:10:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Here we are using a `Tesla T4` GPU that should support 8-bit tensor cores! We are good to go 🚀

## Utility variables & functions 🧰

In [None]:
name = "bigscience/bloom-3b"
text = "Hello my name is"
max_new_tokens = 20

def generate_from_model(model, tokenizer):
  encoded_input = tokenizer(text, return_tensors='pt')
  output_sequences = model.generate(input_ids=encoded_input['input_ids'].cuda())
  return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

## Use 8bit models and `pipeline` 🤗

You can use 8bit quantized models together with `pipeline` as follows:

In [None]:
from transformers import pipeline

pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/710 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.3k [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please use this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA_SETUP: Detected CUDA version 111
CUDA_SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...


  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '


Downloading:   0%|          | 0.00/6.01G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Let's check the output!

In [None]:
pipe(text)

[{'generated_text': 'Hello my name is John and I am a student at the University of the West of England. I am currently studying for'}]

## Use 8bit models and `.generate` 📖

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/710 [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please use this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA_SETUP: Detected CUDA version 111
CUDA_SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...


  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '


Downloading:   0%|          | 0.00/6.01G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

In [None]:
generate_from_model(model_8bit, tokenizer)



'Hello my name is John and I am a student at the University of the West of England. I'

Let's compare the qualitative results between our quantized model and the original model

In [None]:
model_native = AutoModelForCausalLM.from_pretrained(name, device_map="auto", torch_dtype="auto")
generate_from_model(model_native, tokenizer)



'Hello my name is John and I am a student at the University of the West Indies. I am'

## Memory footprint comparison 🪶

In [None]:
mem_fp16 = model_native.get_memory_footprint()
mem_int8 = model_8bit.get_memory_footprint()
print("Memory footprint int8 model: {} | Memory footprint fp16 model: {} | Relative difference: {}".format(mem_int8, mem_fp16, mem_fp16/mem_int8))

Memory footprint int8 model: 3645818880 | Memory footprint fp16 model: 6005114880 | Relative difference: 1.6471237539918604


We saved 1.65x memory for a 3-billion parameters models! Note that internally we replace all the linear layers by the ones implemented in `bitsandbytes`. By scaling up the model the number of linear layers will increase therefore the impact of saving memory on those layers will be huge for very large models. For example quantizing BLOOM-176 (176 Billion parameter model) gives a gain of 1.96x memory footprint which can save a lot of compute power in practice.

## Hyper-parameter tuning 📠


**Warning:** you may want to run these cells separately from previous cells to avoid Out Of Memory (OOM) issues.

You can play with the parameter `int8_threshold` and see its impact in the results of your model. You can directly specify this parameter when loading your model through `.from_pretrained` method. By default we set this parameter to be `6.0` as described in the paper.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_8bit_thresh_4 = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True, int8_threshold=4.0)
model_8bit_thresh_2 = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True, int8_threshold=2.0)
tokenizer = AutoTokenizer.from_pretrained(name)

Downloading config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

In [None]:
generate_from_model(model_8bit_thresh_4, tokenizer)

'Hello my name is John and I am a student at the University of the West Indies. I am'

In [None]:
generate_from_model(model_8bit_thresh_2, tokenizer)

'Hello my name is John and I am a newbie to the forum. I have a question about'

As you can see the generations can slightly vary by using different thresholds. This is because manipulating 8-bit parameters leads to easier perturbations by small changes! Lowering the threshold means also less parameters in fp16 so breaking down the threshold to `0` leads to a full model in `int8`. 