badge icon

This article was automatically translated from the original Turkish version.

Article

Quantization in Large Language Models (LLMs)

Quantization is an optimization technique used to represent the parameters of large language models (LLMsLarge Language Models) with lower bit widths, enabling these models to use less memory and operate more efficiently.


Large language models are deep neural networks containing billions of parameters. These models achieve high performance in natural language processing tasks such as text generation summarization and translation but require substantial computational power memory and energy to run. For example a Mistral 7B model trained at FP16 precision occupies approximately 13–14 GB of GPU memory. This requirement becomes significantly higher for larger models.


To reduce such costs quantization converts the model’s weights and activations into lower bit values such as 8-bit integers (INT8) or 4-bit integers (INT4) thereby reducing the model size and enabling it to operate with fewer resources.

Core Principle of Quantization

During training language models typically use 32-bit floating point (FP32) or 16-bit floating point (FP16) data types. These formats provide high precision but consume considerable memory and computational power. Quantization aims to convert these data types into lower bit widths such as 8-bit integers (INT8) or 4-bit integers (INT4).

This process compresses weights and activations into a narrower numerical range while largely preserving the model’s behavior. As a result:

  • The overall model size is reduced
  • Memory consumption decreases
  • Inference latency is shortened
  • Deployment on low-power devices becomes feasible

Types of Quantization

Quantization is broadly categorized into two main types based on when it is applied and whether the model is trained or not:

PTQ (Post-Training Quantization)

PTQ applies quantization after the model has been fully trained. Low-bit precision weight transformations are applied to the already trained model. This method is notable for its speed and ease of implementation. It does not require retraining which offers advantages in terms of time and resource efficiency. However significant accuracy losses can occur especially at very low bit widths such as 4-bit and below.

QAT (Quantization-Aware Training)

In the QAT approach the model learns during training while accounting for the effects of quantization. During training weights and activations are simulated at low bit values. This enables the model to become more robust against the precision losses caused by quantization. Accuracy is better preserved especially in low-bit scenarios such as 4-bit and below. However this method requires a longer training process and higher computational resources.

Quantization Techniques

Several common and effective quantization techniques are summarized below:

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is a post-training quantization (PTQ) method that analyzes the Hessian matrix to determine the sensitivity of each weight during the reduction to low bit widths. This analysis identifies which weights are most critical to accuracy allowing quantization to be performed with minimal information loss. GPTQ often delivers effective results even at very low bit formats such as INT4. Consequently it is frequently preferred when the goal is to reduce the size of large models while preserving their precision.

AWQ (Activation-aware Weight Quantization)

AWQ is also a PTQ method but differs from classical techniques by focusing not only on weights but also on the distribution of activations. By analyzing where activations are most concentrated it optimizes weight quantization accordingly. This approach helps maintain more stable model accuracy especially at low bit depths such as INT4. Like GPTQ it can be applied after training but stands out by delivering more consistent results in low-precision scenarios.

LLM.int8()

LLM.int8() adopts a mixed approach distinct from classical quantization. It applies different precision levels based on the importance of activation channels: critical channels are preserved in FP16 format while less critical ones are converted to INT8. This technique maintains high accuracy in demanding applications while significantly reducing memory usage. It was specifically designed for large language models such as LLaMA and GPT-3 and enhances efficiency while minimizing accuracy loss in such models.

Applications and Limitations of Quantization

Quantization enables large language models to run efficiently across a broader range of hardware. It plays a crucial role in deploying LLMs on edge devices such as mobile phones and IoT systems. It also reduces server costs by optimizing memory and computational requirements during model deployment. In applications requiring real-time responses such as chatbots or voice assistants it improves user experience by reducing inference latency. However despite these advantages there are notable limitations and challenges. At very low bit rates model accuracy can drop significantly. Additionally not all hardware supports operations with quantized weights; some devices lack compatibility with low-bit computations. Furthermore methods like QAT require more complex training procedures and demand greater technical expertise during deployment. Finally quantized models may require additional adaptations to run reliably on certain inference engines such as ONNX or TensorRT.

Author Information

Avatar
AuthorBerke Bünyamin SüleDecember 3, 2025 at 11:44 AM

Tags

Discussions

No Discussion Added Yet

Start discussion for "Quantization in Large Language Models (LLMs)" article

View Discussions

Contents

  • Core Principle of Quantization

  • Types of Quantization

    • PTQ (Post-Training Quantization)

    • QAT (Quantization-Aware Training)

  • Quantization Techniques

    • GPTQ (Generative Pre-trained Transformer Quantization)

    • AWQ (Activation-aware Weight Quantization)

    • LLM.int8()

  • Applications and Limitations of Quantization

Ask to Küre