[PyTorch]Transformer-xl中的学习率schedule

  • 定义调度器
#### scheduler
if args.scheduler == 'cosine':
    # here we do not set eta_min to lr_min to be backward compatible
    # because in previous versions eta_min is default to 0
    # rather than the default value of lr_min 1e-6
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer,
        args.max_step, eta_min=args.eta_min) # should use eta_min arg
    if args.sample_softmax > 0:
        scheduler_sparse = optim.lr_scheduler.CosineAnnealingLR(optimizer_sparse,
            args.max_step, eta_min=args.eta_min) # should use eta_min arg
elif args.scheduler == 'inv_sqrt':
    # originally used for Transformer (in Attention is all you need)
    def lr_lambda(step):
        # return a multiplier instead of a learning rate
        if step == 0 and args.warmup_step == 0:
            return 1.
        else:
            return 1. / (step ** 0.5) if step > args.warmup_step \
                   else step / (args.warmup_step ** 1.5)
    scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_lambda)
elif args.scheduler == 'dev_perf':
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer,
        factor=args.decay_rate, patience=args.patience, min_lr=args.lr_min)
    if args.sample_softmax > 0:
        scheduler_sparse = optim.lr_scheduler.ReduceLROnPlateau(optimizer_sparse,
            factor=args.decay_rate, patience=args.patience, min_lr=args.lr_min)
elif args.scheduler == 'constant':
    pass
  • step-wise学习率退火,可以看到在warmup阶段学习率是慢慢的上升的,而过了warmup阶段使用相应的学习率schedule fun进行改变
# step-wise learning rate annealing
train_step += 1
if args.scheduler in ['cosine', 'constant', 'dev_perf']:
    # linear warmup stage
    if train_step < args.warmup_step:
        curr_lr = args.lr * train_step / args.warmup_step
        optimizer.param_groups[0]['lr'] = curr_lr
        if args.sample_softmax > 0:
            optimizer_sparse.param_groups[0]['lr'] = curr_lr * 2
    else:
        if args.scheduler == 'cosine':
            scheduler.step(train_step)
            if args.sample_softmax > 0:
                scheduler_sparse.step(train_step)
elif args.scheduler == 'inv_sqrt':
    scheduler.step(train_step)
  • 如果是基于开发集的学习率退火,那么就在在evaluate的时候吧loss放进去
# dev-performance based learning rate annealing
if args.scheduler == 'dev_perf':
    scheduler.step(val_loss)
    if args.sample_softmax > 0:
        scheduler_sparse.step(val_loss)
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • The Inner Game of Tennis W Timothy Gallwey Jonathan Cape ...
    网事_79a3阅读 12,294评论 3 20
  • 摘要:CNN基础知识介绍及TensorFlow具体实现,对于初学者或者求职者而言是一份不可多得的资料。 定义: 简...
    城市中迷途小书童阅读 587评论 0 0
  • 一开始注意到她,是因为朋友圈经常看到她晒为儿子精心准备的早餐美图,食材丰富,健康又营养。一边流口水一边在想,这孩子...
    周小简阅读 155评论 0 0
  • 文|青三盗 打小起,“嘴笨”如影形随,说话成了件困难的事情,尤其不擅长“嘴甜”这项本领,甚至是抗拒,在我的字典里它...
    青三盗阅读 756评论 0 1
  • 本周开始每周向3位品牌营销大牛/大号学习,除了希望学习大牛经验提升专业能力,也希望锻炼自己的语言和复述能力。语言能...
    龟言贵语阅读 305评论 0 0