Project Adam: Building an Efficient and Scalable Deep Learning Training System

1. Introduction

main contributions:

  • Optimizing and balancing both computation and communication through whole system co-design
  • Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well
  • Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy

2. System architecture

  • 通过parameter server,异步更新shared model
  • Adam是一个general-purpose系统,因为SGD能够训练任何基于BP的DNN模型

2.1. Fast data server

  • 一些机器当作data serving machine,提供数据服务,减少了model training machine的负载

2.2. Model training

  • We partition our models vertically across the model worker machines,垂直划分模型能够将卷积层cross-machine的通信最小化
  • 一个机器上是多线程,共享一份模型,用无锁的方式更新本地的shared model
  • 其他优化方法:pass a pointer rather than copy data, cache locality
  • 减轻straggler的影响:为了避免快的machine等待慢的machine的数据,允许线程并行处理多个images;只要一定数量的image被处理完,就认为一个epoch已经结束
  • PS通信:两种通信策略。accumulate updates,定期发送给PS,PS直接将更新加到全局参数上,这对于卷积层有效,因为weight sharing;对于全连接层,参数更大,发送activation和gradient vector到PS,矩阵乘法在PS上计算,这能够较少通信开销。

2.3. 全局Parameter Server

  • Hash存储。shards are hashed into storage buckets that are distributed equally among the parameter server machines.
  • batch updates。applying all updates in a batch to a block of parameters before moving to next block in the shard.
  • lock free。We use lock free data structures for queues and hash tables. In addition, we implement lock free memory allocation
  • inconsistency. DNN models are capable of learning even in the presence of small amounts of lost
    updates. 能够容忍少量的delayed updates
  • Fault tolerance. 每个parameter shard有3份copies;primary给secondary machine发送updates时使用2-phase commit protocol
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容