1. Introduction

main contributions:

Optimizing and balancing both computation and communication through whole system co-design
Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well
Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy

2. System architecture

We partition our models vertically across the model worker machines，垂直划分模型能够将卷积层cross-machine的通信最小化
一个机器上是多线程，共享一份模型，用无锁的方式更新本地的shared model
其他优化方法：pass a pointer rather than copy data, cache locality
减轻straggler的影响：为了避免快的machine等待慢的machine的数据，允许线程并行处理多个images；只要一定数量的image被处理完，就认为一个epoch已经结束
PS通信：两种通信策略。accumulate updates，定期发送给PS，PS直接将更新加到全局参数上，这对于卷积层有效，因为weight sharing；对于全连接层，参数更大，发送activation和gradient vector到PS，矩阵乘法在PS上计算，这能够较少通信开销。

Hash存储。shards are hashed into storage buckets that are distributed equally among the parameter server machines.
batch updates。applying all updates in a batch to a block of parameters before moving to next block in the shard.
lock free。We use lock free data structures for queues and hash tables. In addition, we implement lock free memory allocation
inconsistency. DNN models are capable of learning even in the presence of small amounts of lost
updates. 能够容忍少量的delayed updates
Fault tolerance. 每个parameter shard有3份copies；primary给secondary machine发送updates时使用2-phase commit protocol