This article was automatically translated from the original Turkish version.

Quantization in Large Language Models (LLMs)

Quote

Quantization is an optimization technique used to represent the parameters of large language models (LLMs – Large Language Models) with lower bit widths, enabling these models to use less memory and operate more efficiently.

Large language models are deep neural networks containing billions of parameters. These models achieve high performance in natural language processing tasks such as text generation summarization and translation but require substantial computational power memory and energy to run. For example a Mistral 7B model trained at FP16 precision occupies approximately 13–14 GB of GPU memory. This requirement becomes significantly higher for larger models.

To reduce such costs quantization converts the model’s weights and activations into lower bit values such as 8-bit integers (INT8) or 4-bit integers (INT4) thereby reducing the model size and enabling it to operate with fewer resources.
Core Principle of Quantization
During training language models typically use 32-bit floating point (FP32) or 16-bit floating point (FP16) data types. These formats provide high precision but consume considerable memory and computational power. Quantization aims to convert these data types into lower bit widths such as 8-bit integers (INT8) or 4-bit integers (INT4).    
This process compresses weights and activations into a narrower numerical range while largely preserving the model’s behavior. As a result:
The overall model size is reduced
Memory consumption decreases
Inference latency is shortened
Deployment on low-power devices becomes feasible
Types of Quantization
Quantization is broadly categorized into two main types based on when it is applied and whether the model is trained or not:
PTQ (Post-Training Quantization)
PTQ applies quantization after the model has been fully trained. Low-bit precision weight transformations are applied to the already trained model. This method is notable for its speed and ease of implementation. It does not require retraining which offers advantages in terms of time and resource efficiency. However significant accuracy losses can occur especially at very low bit widths such as 4-bit and below.
QAT (Quantization-Aware Training)
In the QAT approach the model learns during training while accounting for the effects of quantization. During training weights and activations are simulated at low bit values. This enables the model to become more robust against the precision losses caused by quantization. Accuracy is better preserved especially in low-bit scenarios such as 4-bit and below. However this method requires a longer training process and higher computational resources.
Quantization Techniques
Several common and effective quantization techniques are summarized below:
GPTQ (Generative Pre-trained Transformer Quantization)
GPTQ is a post-training quantization (PTQ) method that analyzes the Hessian matrix to determine the sensitivity of each weight during the reduction to low bit widths. This analysis identifies which weights are most critical to accuracy allowing quantization to be performed with minimal information loss. GPTQ often delivers effective results even at very low bit formats such as INT4. Consequently it is frequently preferred when the goal is to reduce the size of large models while preserving their precision.
AWQ (Activation-aware Weight Quantization)
AWQ is also a PTQ method but differs from classical techniques by focusing not only on weights but also on the distribution of activations. By analyzing where activations are most concentrated it optimizes weight quantization accordingly. This approach helps maintain more stable model accuracy especially at low bit depths such as INT4. Like GPTQ it can be applied after training but stands out by delivering more consistent results in low-precision scenarios.
LLM.int8()
LLM.int8() adopts a mixed approach distinct from classical quantization. It applies different precision levels based on the importance of activation channels: critical channels are preserved in FP16 format while less critical ones are converted to INT8. This technique maintains high accuracy in demanding applications while significantly reducing memory usage. It was specifically designed for large language models such as LLaMA and GPT-3 and enhances efficiency while minimizing accuracy loss in such models.
Applications and Limitations of Quantization
Quantization enables large language models to run efficiently across a broader range of hardware. It plays a crucial role in deploying LLMs on edge devices such as mobile phones and IoT systems. It also reduces server costs by optimizing memory and computational requirements during model deployment. In applications requiring real-time responses such as chatbots or voice assistants it improves user experience by reducing inference latency. However despite these advantages there are notable limitations and challenges. At very low bit rates model accuracy can drop significantly. Additionally not all hardware supports operations with quantized weights; some devices lack compatibility with low-bit computations. Furthermore methods like QAT require more complex training procedures and demand greater technical expertise during deployment. Finally quantized models may require additional adaptations to run reliably on certain inference engines such as ONNX or TensorRT.

Bibliographies

Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. "Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale." *Advances in neural information processing systems* 35 (2022): 30318-30332. Accessed Adresi.

Egashira, Kazuki, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. "Exploiting llm quantization." arXiv preprint arXiv:2405.18137 (2024). Accessed Adresi.

Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. "Gptq: Accurate post-training quantization for generative pre-trained transformers." *arXiv preprint arXiv:2210.17323* (2022). Accessed Adresi.

Hasan, Jahid. "Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques." arXiv preprint arXiv:2411.06084 (2024). Accessed Adresi.

Lin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. "Awq: Activation-aware weight quantization for on-device llm compression and acceleration." *Proceedings of Machine Learning and Systems* 6 (2024): 87-100. Accessed Adresi. 

Author Information

AuthorBerke Bünyamin SüleDecember 3, 2025 at 11:44 AM

Discussions

No Discussion Added Yet

Start discussion for "Quantization in Large Language Models (LLMs)" article

View Discussions

Core Principle of Quantization
Types of Quantization
- PTQ (Post-Training Quantization)
- QAT (Quantization-Aware Training)
Quantization Techniques
- GPTQ (Generative Pre-trained Transformer Quantization)
- AWQ (Activation-aware Weight Quantization)
- LLM.int8()
Applications and Limitations of Quantization

Quantization in Large Language Models (LLMs)

Core Principle of Quantization

Types of Quantization

PTQ (Post-Training Quantization)

QAT (Quantization-Aware Training)

Quantization Techniques

GPTQ (Generative Pre-trained Transformer Quantization)

AWQ (Activation-aware Weight Quantization)

LLM.int8()

Applications and Limitations of Quantization

Bibliographies

Author Information

Tags

Discussions

Contents