Adam (Adaptive Moment Estimation) is an adaptive learning rate optimization algorithm published in ICLR 2015. It combines the concepts behind RMSProp and SGD with momentum.


Moving averages

$m_t$ is moving average of gradient $v_t$ is moving average of squared gradient (i.e. uncentered variance) $\beta_1$ and $\beta_2$ are set to 0.9 and 0.999. No one ever changes them. $$ \begin{align} m_t &= \beta_1m_{t-1} + (1-\beta_1)g_t \\v_t &= \beta_2v_{t-1} + (1-\beta_2)g_t^2 \end{align} $$

Update rule

We set the quantities $\hat{m_t}$ and $\hat{v_t}$ as follows: $$ \begin{align} \hat{m_t} &= \frac{\hat{m_t}}{1-\beta_1^t} \\\hat{v_t} &= \frac{\hat{v_t}}{1-\beta_2^t} \end{align} $$

such that:

$$ \begin{align} E[\hat{m_t}] &= E[g_t] \\E[\hat{v_t}] &= E[g_t^2] \end{align} $$

So the gradient update rule becomes: $$ w_t = w_{t-1} - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} $$


  1. Learning rate is limited to between 0 and $\eta$ (by Cauchy–Schwarz Inequality)
  2. Smaller variance (e.g. all values near mean) means higher update step

Design Choices

Why is the uncentered variance used?

  • We do not have access to the mean global gradient (across all batches / timesteps) but we can safely assume it to be zero.
  • The variance term in Adam has the same expectation as the uncentered variance of the global variance (or centered assuming 0 mean gradient) vt: given historical gradients and assume global mean 0, what is expected variance (for current step, or any step cuz i.i.d.) mt: what is expected gradient mean (for current step, or any step cuz i.i.d.)

dividing them is liek sharp ratio, how much to believe in current mean estimate