Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32). Reducing the number of bits means the resulting model requires less memory storage, which is crucial for deploying Large Language Models
![Quanto: a pytorch quantization toolkit](https://cdn-ak-scissors.b.st-hatena.com/image/square/10c2aef839376c0ab5aa26797aae960e669e4da7/height=288;version=1;width=512/https%3A%2F%2Fhuggingface.co%2Fblog%2Fassets%2F169_quanto_intro%2Fthumbnail.png)