This article was automatically translated from the original Turkish version.

Adafactor

Information And Communication Technologies

+1 More

Quote

Year

2018

Advantages

Adaptive Learning RateLow Memory Usage

Adafactor is an efficient, low-memory optimization algorithm developed by Google, specifically designed for memory-intensive models such as large-scale language models. It was first introduced in 2018 in the paper titled "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost". Like the Adam algorithm, Adafactor performs moment-based updates but computes second-moment estimates using significantly less memory, thereby enabling the training of large models.

Adafactor Optimization Algorithm

Memory Efficiency

The most important feature of Adafactor is that instead of storing the second-moment vector as a full matrix, it separately maintains the row and column averages. This approach reduces memory consumption by a square root factor, especially for high-dimensional tensors. For example, for a parameter matrix of size d×d

Update Mechanism

Like the Adam algorithm, Adafactor uses the first moment (mean) and second moment (mean of squares) of the gradients. However, the second moment is computed as follows:

For a parameter matrix $W \in \mathbb{R}^{r \times c}$:

$R_t = \beta_2 \cdot R_{t-1} + (1 - \beta_2) \cdot \frac{1}{c} \sum_{j=1}^{c} g_{t,ij}^2 $

$C_t = \beta_2 \cdot C_{t-1} + (1 - \beta_2) \cdot \frac{1}{r} \sum_{i=1}^{r} g_{t,ij}^2 $

Using these values, an approximate square norm matrix is obtained:

$\hat{v}_{t,ij} = \frac{R_{t,i} \cdot C_{t,j}}{\frac{1}{rc} \sum_{i,j} R_{t,i} \cdot C_{t,j}} $

The parameters are updated with the learning rate and normalization:

$\theta_t = \theta_{t-1} - \eta_t \cdot \frac{g_t}{\sqrt{\hat{v}_t} + \epsilon} $

Properties

Adaptive Learning Rate

Adafactor, by default, uses a relative learning rate $(\eta_t \propto \frac{1}{\sqrt{t}})$ instead of an absolute learning rate. This enables automatic learning rate control for large models without requiring manual tuning of fixed values.

Memory Usage

Adam: Requires $O(n)$ additional memory (two moments per parameter).
Adafactor: Achieves approximately the same performance as Adam using only $O(\sqrt{n})$ memory instead of $O(n)$.

Advantages

Memory efficient: Particularly preferred for massive Transformer-based models.
Adaptive learning: The learning rate can be automatically adjusted.
Adam-like performance: Delivers accuracy comparable to Adam in most cases.

Disadvantages

Code complexity: Has a more complex update mechanism compared to Adam.
Only suitable for matrix-based parameters: May suffer performance loss with scalar parameters.
May require fine-tuning of default hyperparameters.

Applications

Transformers: Has been used in training models such as T5, mT5, and BERT.
Language modeling: Effective for long-term training on large datasets.
Memory-constrained environments: Provides advantages in systems with limited GPU RAM.

The step-by-step optimization process for the (4,4) point of Adafactor is visualized. (

Adafactor reduces memory usage by employing separate moment estimates for the row and column dimensions of the parameter matrix.

Author Information

AuthorKaan GümeleDecember 9, 2025 at 6:23 AM

Discussions

No Discussion Added Yet

Start discussion for "Adafactor" article

View Discussions

Adafactor Optimization Algorithm
- Memory Efficiency
- Update Mechanism
Properties
- Adaptive Learning Rate
- Memory Usage
Advantages
Disadvantages
Applications

Adafactor

Adafactor Optimization Algorithm

Memory Efficiency

Update Mechanism

Properties

Adaptive Learning Rate

Memory Usage

Advantages

Disadvantages

Applications

Author Information

Tags

Discussions

Contents