Adam

Introduction

Adam (Adaptive Moment Estimation) is an adaptive learning rate optimization algorithm published in ICLR 2015. It combines the concepts behind RMSProp and SGD with momentum.

Maths

Moving averages

mtm_t is moving average of gradient vtv_t is moving average of squared gradient (i.e. uncentered variance) Ξ²1\beta_1 and Ξ²2\beta_2 are set to 0.9 and 0.999. No one ever changes them. mt=Ξ²1mtβˆ’1+(1βˆ’Ξ²1)gtvt=Ξ²2vtβˆ’1+(1βˆ’Ξ²2)gt2 \begin{align} m_t &= \beta_1m_{t-1} + (1-\beta_1)g_t \\v_t &= \beta_2v_{t-1} + (1-\beta_2)g_t^2 \end{align}

Update rule

We set the quantities mt^\hat{m_t} and vt^\hat{v_t} as follows: mt^=mt^1βˆ’Ξ²1tvt^=vt^1βˆ’Ξ²2t \begin{align} \hat{m_t} &= \frac{\hat{m_t}}{1-\beta_1^t} \\\hat{v_t} &= \frac{\hat{v_t}}{1-\beta_2^t} \end{align}

such that:

E[mt^]=E[gt]E[vt^]=E[gt2] \begin{align} E[\hat{m_t}] &= E[g_t] \\E[\hat{v_t}] &= E[g_t^2] \end{align}

So the gradient update rule becomes: wt=wtβˆ’1βˆ’Ξ·mt^vt^+Ο΅ w_t = w_{t-1} - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}

Consequences

  1. Learning rate is limited to between 0 and Ξ·\eta (by Cauchy–Schwarz Inequality)
  2. Smaller variance (e.g. all values near mean) means higher update step

Design Choices

Why is the uncentered variance used?

  • We do not have access to the mean global gradient (across all batches / timesteps) but we can safely assume it to be zero.
  • The variance term in Adam has the same expectation as the uncentered variance of the global variance (or centered assuming 0 mean gradient) vt: given historical gradients and assume global mean 0, what is expected variance (for current step, or any step cuz i.i.d.) mt: what is expected gradient mean (for current step, or any step cuz i.i.d.)

dividing them is liek sharp ratio, how much to believe in current mean estimate