This article was automatically translated from the original Turkish version.

Adafactor is an efficient, low-memory optimization algorithm developed by Google, specifically designed for memory-intensive models such as large-scale language models. It was first introduced in 2018 in the paper titled "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost". Like the Adam algorithm, Adafactor performs moment-based updates but computes second-moment estimates using significantly less memory, thereby enabling the training of large models.
The most important feature of Adafactor is that instead of storing the second-moment vector as a full matrix, it separately maintains the row and column averages. This approach reduces memory consumption by a square root factor, especially for high-dimensional tensors. For example, for a parameter matrix of size <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">d</span></span></span></span>
Like the Adam algorithm, Adafactor uses the first moment (mean) and second moment (mean of squares) of the gradients. However, the second moment is computed as follows:
For a parameter matrix $W \in \mathbb{R}^{r \times c}$:
$R_t = \beta_2 \cdot R_{t-1} + (1 - \beta_2) \cdot \frac{1}{c} \sum_{j=1}^{c} g_{t,ij}^2 $
$C_t = \beta_2 \cdot C_{t-1} + (1 - \beta_2) \cdot \frac{1}{r} \sum_{i=1}^{r} g_{t,ij}^2 $
Using these values, an approximate square norm matrix is obtained:
$\hat{v}_{t,ij} = \frac{R_{t,i} \cdot C_{t,j}}{\frac{1}{rc} \sum_{i,j} R_{t,i} \cdot C_{t,j}} $
The parameters are updated with the learning rate and normalization:
$\theta_t = \theta_{t-1} - \eta_t \cdot \frac{g_t}{\sqrt{\hat{v}_t} + \epsilon} $
Adafactor, by default, uses a relative learning rate $(\eta_t \propto \frac{1}{\sqrt{t}})$ instead of an absolute learning rate. This enables automatic learning rate control for large models without requiring manual tuning of fixed values.

The step-by-step optimization process for the (4,4) point of Adafactor is visualized. (
Adafactor reduces memory usage by employing separate moment estimates for the row and column dimensions of the parameter matrix.

Adafactor Optimization Algorithm
Memory Efficiency
Update Mechanism
Properties
Adaptive Learning Rate
Memory Usage
Advantages
Disadvantages
Applications