Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:
Resume:
1.- Training large language models is challenging due to memory limitations of current accelerators.
2.- Models with hundreds of billions of parameters don't fit in GPU memory.
3.- Training is computationally expensive, taking years on a single GPU for models like GPT-
4.- Scaling across multiple accelerators is necessary to make training feasible.
5.- Large batch sizes can require more tokens to achieve the same loss, limiting data parallelism scalability.
6.- Combining different parallelism dimensions is crucial for efficient large-scale training.
7.- Data parallelism involves replicating the model across devices and sharding the input data.
8.- Model parallelism splits a single model copy across multiple devices.
9.- Tensor model parallelism partitions individual layers or operators across devices.
10.- Pipeline model parallelism shards layers of the model across different devices.
11.- Data parallelism requires expensive all-reduce operations for weight gradients after each iteration.
12.- Tensor model parallelism involves distributed matrix multiplications, requiring communication between devices.
13.- Pipeline parallelism can lead to idle periods (pipeline bubble) if not carefully managed.
14.- Micro-batching in pipeline parallelism reduces idle time by processing smaller batches.
15.- Combining parallelism dimensions (PTD parallelism) requires careful consideration of interactions between modes.
16.- Increasing tensor model parallel size decreases pipeline bubble size but may increase cross-server communication.
17.- Optimal parallelism strategy depends on hardware configuration, like GPUs per server and inter-server communication speed.
18.- Different pipelining schedules can trade off pipeline bubble size for more communication.
19.- Factors like global batch size and micro-batch size affect communication, pipeline bubble, and memory footprint.
20.- Lower-level optimizations are necessary for good out-of-the-box performance.
21.- Custom kernels and PyTorch JIT help keep operators compute-bound rather than memory-bound.
22.- Efficient scaling achieved for large models and GPU counts, with 52% of theoretical performance for trillion-parameter model.
23.- Mixture of Experts (MoE) models have different weight-activation ratios, requiring different optimal parallelism strategies.
24.- Automating discovery of optimal parallelization strategies for arbitrary models and hardware is an open question.
25.- Open-source implementation (Megatron) available on GitHub with features like Flash Attention.
26.- MoE models have more weight parameters, shifting memory footprint from activations to weights.
27.- Expert parallelism in MoE models introduces new communication patterns.
28.- Node failures are an issue at large scale, currently addressed by periodic checkpointing.
29.- Optimal checkpointing interval based on cluster failure rates is important.
30.- More efficient failure recovery strategies are an area for future work.
Knowledge Vault built byDavid Vivancos 2024