This article was automatically translated from the original Turkish version.

AdamW (Adam with Weight Decay) is a variant of the Adam optimization algorithm and provides a significant improvement related to model regularization. This variant aims to enhance Adam’s overall performance and generalization capability by incorporating an L2 penalty term (weight decay). In the traditional Adam algorithm, weight decay is computed together with the gradient updates; however, AdamW applies this penalty term independently of the update step, enabling more effective regularization.
AdamW retains the core structure of the Adam algorithm but introduces a modification in how weight regularization is applied. The L2 penalty term helps prevent overfitting by constraining the magnitude of the model’s weights. While the Adam algorithm incorrectly incorporates this regularization within the gradient updates, AdamW applies it as a separate step.
AdamW has a structure similar to the Adam algorithm but separates the weight decay term during the update process. The update steps of the AdamW algorithm are defined as follows:

Key Concepts
Mathematical Formulation of AdamW
Calculation of Moments: