Adam
Introduction
Adam (Adaptive Moment Estimation) is an adaptive learning rate optimization algorithm published in ICLR 2015. It combines the concepts behind RMSProp and SGD with momentum.
Maths
Moving averages
is moving average of gradient is moving average of squared gradient (i.e. uncentered variance) and are set to 0.9 and 0.999. No one ever changes them.
Update rule
We set the quantities and as follows:
such that:
So the gradient update rule becomes:
Consequences
- Learning rate is limited to between 0 and (by CauchyβSchwarz Inequality)
- Smaller variance (e.g. all values near mean) means higher update step
Design Choices
Why is the uncentered variance used?
- We do not have access to the mean global gradient (across all batches / timesteps) but we can safely assume it to be zero.
- The variance term in Adam has the same expectation as the uncentered variance of the global variance (or centered assuming 0 mean gradient) vt: given historical gradients and assume global mean 0, what is expected variance (for current step, or any step cuz i.i.d.) mt: what is expected gradient mean (for current step, or any step cuz i.i.d.)
dividing them is liek sharp ratio, how much to believe in current mean estimate