Args:
learning_rate: A Tensor or a floating point value. The learning rate.控制了权重的更新比率(如 0.001)。较大的值(如 0.3)在学习率更新前会有更快的初始学习,而较小的值(如 1.0E-5)会令训练收敛到更好的性能。
beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.一阶矩估计的指数衰减率
beta2: A float value or a constant float tensor.The exponential decay rate for the 2nd moment estimates.二阶矩估计的指数衰减率
epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before
Section 2.1), not the epsilon in Algorithm 1 of the paper.该参数是非常小的数,其为了防止在实现中除以零
use_locking: If True use locks for update operations.
name: Optional name for the operations created when applying gradients.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
The update rule for `variable` with gradient `g` uses an optimization described at the end of section2 of the paper:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
------------------------------------------------------------------------
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
2.算法:
假定 f(θ) 为噪声目标函数:即关于参数θ可微的随机标量函数。目标:减少该函数的期望值 E[f(θ)]。其中 f1(θ), ..., , fT (θ) 表示在随后时间步 1, ..., T 上的随机函数值。
更新梯度的指数移动均值(mt)和平方梯度(vt),而参数 β1、β2 ∈ [0, 1) 控制了这些移动均值(moving average)指数衰减率。移动均值本身使用梯度的一阶矩(均值)和二阶原始矩(有偏方差)进行估计。
算法的效率可以通过改变计算顺序而得到提升,例如将伪代码最后三行循环语句替代为以下两个:
2.1 Adam 的更新规则 ADAM’S UPDATE RULE