Tensorflow训练时增加learning rate decay

关于是不是需要做learning rate decay,以及选取多大的learning rate,跟优化器(Optimizer)、Batchsize以及任务本身有关。
一般而言,选用tf.train.Momentum.Optimizer需要配合lr decay;
选用tf.train.Adam.Optimizer不需要lr decay。

但是关于Adam是否需要做learning rate decay有很多说法:
Should we do learning rate decay for adam optimizer?

摘取优秀回答:
It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated. But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update). The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

    def _optimize(self, loss, global_):
        # optimizer = tf.train.MomentumOptimizer(
        #     learning_rate=0.001, momentum=0.9)
        learning_rate = tf.train.exponential_decay(self.config.init_lr, global_step=global_,
                                                   decay_steps=self.config.decay_step,
                                                   decay_rate=self.config.decay_rate,
                                                   staircase=True)
        tf.summary.scalar("learning_rate", learning_rate)

        optimizer = tf.train.AdamOptimizer(learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8,
               use_locking=False, name="Adam")
        trainable_var = tf.trainable_variables()
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        grads_and_vars = optimizer.compute_gradients(loss, trainable_var)
        grads_and_vars = [(tf.clip_by_norm(g, self.config.GRADIENT_CLIP_NORM), v)
                          for g, v in grads_and_vars]

        with tf.control_dependencies(update_ops):
            # apply_gradient_op = optimizer.minimize(loss)
            apply_gradient_op = optimizer.apply_gradients(grads_and_vars, name='train_op')
        placeholder_float32 = tf.constant(0, dtype=tf.float32)
        # tf.summary.scalar("accuracy", rate)
        return placeholder_float32, apply_gradient_op, apply_gradient_op, learning_rate

这里global_应该是一个tensor 常量,通过训练时更新迭代步数来得到step, 然后通过feed_dictstep喂给global_,即可以达到lr decay的效果。

lr_decay.png

这里将staircase设置为True,可以看到阶梯状lr下降趋势。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容