badge icon

This article was automatically translated from the original Turkish version.

Article
Adafactor.png
Year
2018
Advantages
Adaptive Learning RateLow Memory Usage

Adafactor is an efficient, low-memory optimization algorithm developed by Google, specifically designed for memory-intensive models such as large-scale language models. It was first introduced in 2018 in the paper titled "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost". Like the Adam algorithm, Adafactor performs moment-based updates but computes second-moment estimates using significantly less memory, thereby enabling the training of large models.

Adafactor Optimization Algorithm

Memory Efficiency

The most important feature of Adafactor is that instead of storing the second-moment vector as a full matrix, it separately maintains the row and column averages. This approach reduces memory consumption by a square root factor, especially for high-dimensional tensors. For example, for a parameter matrix of size <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">d</span></span></span></span>

Update Mechanism

Like the Adam algorithm, Adafactor uses the first moment (mean) and second moment (mean of squares) of the gradients. However, the second moment is computed as follows:

For a parameter matrix $W \in \mathbb{R}^{r \times c}$:


    $R_t = \beta_2 \cdot R_{t-1} + (1 - \beta_2) \cdot \frac{1}{c} \sum_{j=1}^{c} g_{t,ij}^2 $

      $C_t = \beta_2 \cdot C_{t-1} + (1 - \beta_2) \cdot \frac{1}{r} \sum_{i=1}^{r} g_{t,ij}^2 $


      Using these values, an approximate square norm matrix is obtained:


      $\hat{v}_{t,ij} = \frac{R_{t,i} \cdot C_{t,j}}{\frac{1}{rc} \sum_{i,j} R_{t,i} \cdot C_{t,j}} $

      The parameters are updated with the learning rate and normalization:

      $\theta_t = \theta_{t-1} - \eta_t \cdot \frac{g_t}{\sqrt{\hat{v}_t} + \epsilon} $

      Properties

      Adaptive Learning Rate

      Adafactor, by default, uses a relative learning rate $(\eta_t \propto \frac{1}{\sqrt{t}})$ instead of an absolute learning rate. This enables automatic learning rate control for large models without requiring manual tuning of fixed values.

      Memory Usage

      • Adam: Requires $O(n)$ additional memory (two moments per parameter).
      • Adafactor: Achieves approximately the same performance as Adam using only $O(\sqrt{n})$ memory instead of $O(n)$.

      Advantages

      • Memory efficient: Particularly preferred for massive Transformer-based models.
      • Adaptive learning: The learning rate can be automatically adjusted.
      • Adam-like performance: Delivers accuracy comparable to Adam in most cases.

      Disadvantages

      • Code complexity: Has a more complex update mechanism compared to Adam.
      • Only suitable for matrix-based parameters: May suffer performance loss with scalar parameters.
      • May require fine-tuning of default hyperparameters.

      Applications

      • Transformers: Has been used in training models such as T5, mT5, and BERT.
      • Language modeling: Effective for long-term training on large datasets.
      • Memory-constrained environments: Provides advantages in systems with limited GPU RAM.

      The step-by-step optimization process for the (4,4) point of Adafactor is visualized. (



      Adafactor reduces memory usage by employing separate moment estimates for the row and column dimensions of the parameter matrix.

      Author Information

      Avatar
      AuthorKaan GümeleDecember 9, 2025 at 6:23 AM

      Tags

      Discussions

      No Discussion Added Yet

      Start discussion for "Adafactor" article

      View Discussions

      Contents

      • Adafactor Optimization Algorithm

        • Memory Efficiency

        • Update Mechanism

      • Properties

        • Adaptive Learning Rate

        • Memory Usage

      • Advantages

      • Disadvantages

      • Applications

      Ask to Küre