The End Of Knowledge - Vault 6/90 - CVPR - 2023 - Using Megatron to Train Large Language Models

graph LR classDef training fill:#f9d4d4, font-weight:bold, font-size:14px classDef parallelism fill:#d4f9d4, font-weight:bold, font-size:14px classDef optimization fill:#d4d4f9, font-weight:bold, font-size:14px classDef resilience fill:#f9f9d4, font-weight:bold, font-size:14px A[Using Megatron to
Train Large Language
Models] --> B[Training
Challenges] A --> C[Parallelism] A --> D[Optimization] A --> E[Resilience] B --> B1[Memory limits
challenge large
model training. 1] B --> B2[Huge models dont
fit in GPU
memory. 2] B --> B3[Training expensive,
years on one
GPU. 3] B --> B4[Scaling across
accelerators necessary. 4] B --> B5[Large batches need
more tokens. 5] C --> C1[Combining parallelism
for efficiency. 6] C --> C2[Data parallelism:
model copies, shard
input. 7] C --> C3[Model parallelism:
split model across
devices. 8] C --> C4[Tensor parallelism:
partition layers,
operators. 9] C --> C5[Pipeline parallelism:
shard model
layers. 10] C --> C6[Combine parallelism
dimensions carefully. 15] D --> D1[Data parallelism needs
costly
all-reduce. 11] D --> D2[Tensor parallelism:
distributed multiplications. 12] D --> D3[Pipeline parallelism
risks idle
periods. 13] D --> D4[Micro-batching reduces
idle
time. 14] D --> D5[Lower-level optimizations
boost
performance. 20] D --> D6[Custom kernels,
JIT keep
compute-bound. 21] E --> E1[Node failures,
periodic checkpointing
needed. 28] E --> E2[Checkpoint interval
depends on failure
rates. 29] E --> E3[Efficient failure
recovery future
work. 30] E --> E4[MoE models need
different
parallelism. 23] E --> E5[MoE models shift
memory to
weights. 26] E --> E6[Expert parallelism
introduces new
patterns. 27] class A,B,B1,B2,B3,B4,B5 training class C,C1,C2,C3,C4,C5 parallelism class D,D1,D2,D3,D4,D5,D6 optimization class E,E1,E2,E3,E4,E5,E6 resilience

Resume:

1.- Training large language models is challenging due to memory limitations of current accelerators.

2.- Models with hundreds of billions of parameters don't fit in GPU memory.

3.- Training is computationally expensive, taking years on a single GPU for models like GPT-

4.- Scaling across multiple accelerators is necessary to make training feasible.

5.- Large batch sizes can require more tokens to achieve the same loss, limiting data parallelism scalability.

6.- Combining different parallelism dimensions is crucial for efficient large-scale training.

7.- Data parallelism involves replicating the model across devices and sharding the input data.

8.- Model parallelism splits a single model copy across multiple devices.

9.- Tensor model parallelism partitions individual layers or operators across devices.

10.- Pipeline model parallelism shards layers of the model across different devices.

11.- Data parallelism requires expensive all-reduce operations for weight gradients after each iteration.

12.- Tensor model parallelism involves distributed matrix multiplications, requiring communication between devices.

13.- Pipeline parallelism can lead to idle periods (pipeline bubble) if not carefully managed.

14.- Micro-batching in pipeline parallelism reduces idle time by processing smaller batches.

15.- Combining parallelism dimensions (PTD parallelism) requires careful consideration of interactions between modes.

16.- Increasing tensor model parallel size decreases pipeline bubble size but may increase cross-server communication.

17.- Optimal parallelism strategy depends on hardware configuration, like GPUs per server and inter-server communication speed.

18.- Different pipelining schedules can trade off pipeline bubble size for more communication.

19.- Factors like global batch size and micro-batch size affect communication, pipeline bubble, and memory footprint.

20.- Lower-level optimizations are necessary for good out-of-the-box performance.

21.- Custom kernels and PyTorch JIT help keep operators compute-bound rather than memory-bound.

22.- Efficient scaling achieved for large models and GPU counts, with 52% of theoretical performance for trillion-parameter model.

23.- Mixture of Experts (MoE) models have different weight-activation ratios, requiring different optimal parallelism strategies.

24.- Automating discovery of optimal parallelization strategies for arbitrary models and hardware is an open question.

25.- Open-source implementation (Megatron) available on GitHub with features like Flash Attention.

26.- MoE models have more weight parameters, shifting memory footprint from activations to weights.

27.- Expert parallelism in MoE models introduces new communication patterns.

28.- Node failures are an issue at large scale, currently addressed by periodic checkpointing.

29.- Optimal checkpointing interval based on cluster failure rates is important.

30.- More efficient failure recovery strategies are an area for future work.

Knowledge Vault built byDavid Vivancos 2024