序列并行
Sequence parallelism is a memory-efficient parallelism method used in the training of Transformer models with long sequences on GPUs. It breaks the input sequence length limitation and allows for efficient training with longer sequences.
Here's how it works:
- The input sequence is split into multiple chunks.
- Each chunk is fed into its corresponding device (e.g., a GPU).
- Ring-style communication is integrated with self-attention calculation.
With sequence parallelism, a single device no longer needs to hold the entire sequence. Moreover, with efficient attention mechanisms that have linear complexity, sequence parallelism can enable the training of a Transformer with an infinitely long sequence.
Experiments have shown that sequence parallelism performs well when scaling with batch size and sequence length. It's also compatible with most existing parallelisms (e.g., data parallelism, pipeline parallelism, and tensor parallelism), making 4D parallelism possible.
4D并行
四个维度:数据、流水线、张量、序列。