Knowledge Vault 6 /70 - ICML 2021
Optimal Complexity in Decentralized Training
Yucheng Lu ยท Christopher De Sa
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef decentralized fill:#f9d4d4, font-weight:bold, font-size:14px classDef optimization fill:#d4f9d4, font-weight:bold, font-size:14px classDef algorithms fill:#d4d4f9, font-weight:bold, font-size:14px classDef experiments fill:#f9f9d4, font-weight:bold, font-size:14px A[Optimal Complexity in
Decentralized Training] --> B[Decentralized
ML] A --> C[Optimization
Challenges] A --> D[Algorithms
and
Methods] A --> E[Experiments
and
Performance] B --> B1[Training models across
distributed
systems. 1] B --> B2[Categorizing decentralized
ML
systems. 2] B --> B3[Privacy-preserving training
on edge
devices. 7] B --> B4[Decentralized communication
between
workers. 8] B --> B5[Various network structures
for
workers. 9] B --> B6[Weighted averaging
in gossip
protocols. 15] C --> C1[Optimal convergence rates
for
algorithms. 3] C --> C2[Challenges in optimizing
complex
models. 11] C --> C3[Random sampling for
efficient
training. 12] C --> C4[Different data distributions
across
workers. 13] C --> C5[Complexity related to
gradient
computations. 18] C --> C6[Complexity in information
exchange. 19] D --> D1[Best possible convergence
rates. 4] D --> D2[Synchronized updates
across
workers. 6] D --> D3[Update model coordinates
via
gradients. 14] D --> D4[Separating computation
and communication
phases. 20] D --> D5[Optimal algorithm using
graph
factorization. 21] D --> D6[Uses accelerated gossip
and gradient
tracking. 22] E --> E1[Improves convergence
rate. 23] E --> E2[Estimates global
gradients. 24] E --> E3[Iterations for algorithm
convergence. 26] E --> E4[Performance evaluation
on image
classification. 27] E --> E5[Stability and convergence
tests. 28] E --> E6[Efficiency in decentralized
training. 30] class A,B,B1,B2,B3,B4,B5,B6 decentralized class C,C1,C2,C3,C4,C5,C6 optimization class D,D1,D2,D3,D4,D5,D6 algorithms class E,E1,E2,E3,E4,E5,E6 experiments

Resume:

1.- Decentralized machine learning: Training models across distributed systems without centralized control.

2.- Layered taxonomy: Categorizing decentralized ML systems into application, protocol, and topology layers.

3.- Theoretical limits: Exploring the optimal convergence rates for decentralized algorithms.

4.- Optimal decentralized algorithms: Introducing methods that achieve the best possible convergence rates.

5.- Data parallelism: Distributing data across multiple workers for parallel processing.

6.- Centralized training: Using methods like all-reduce for synchronized updates across all workers.

7.- Federated learning: Keeping data on edge devices to preserve privacy while training models.

8.- Gossip protocol: A decentralized communication method for information exchange between workers.

9.- Arbitrary graph topology: Connecting workers in various network structures beyond fully connected graphs.

10.- Empirical risk minimization: Formulating the optimization problem for training machine learning models.

11.- Non-convex optimization: Addressing challenges in optimizing complex models like deep neural networks.

12.- Stochastic algorithms: Using random sampling methods like SGD for efficient training.

13.- Data heterogeneity: Dealing with differences in data distribution across workers.

14.- Zero-respecting algorithms: Methods that only update model coordinates through gradients or communication.

15.- Gossip matrix: A fixed matrix used for weighted averaging in gossip protocols.

16.- Spectral gap: The difference between the largest and second-largest eigenvalues of the gossip matrix.

17.- Lower bounds: Theoretical minimum complexity for decentralized training algorithms.

18.- Sampling complexity: The component of algorithmic complexity related to gradient computations.

19.- Communication complexity: The component of algorithmic complexity related to information exchange between workers.

20.- Biphase communication: A paradigm separating computation and communication phases for improved consistency.

21.- DeFacto algorithm: An optimal decentralized algorithm using graph factorization techniques.

22.- DeTAG algorithm: An optimal decentralized algorithm using accelerated gossip and gradient tracking.

23.- Accelerated gossip: A technique to improve the convergence rate of gossip protocols.

24.- Gradient tracking: A method to estimate global gradients in decentralized settings.

25.- Mixing time: The time required for a Markov chain to approach its stationary distribution.

26.- Iteration complexity: The number of iterations required for an algorithm to converge.

27.- CIFAR-10 experiment: Evaluating algorithm performance on image classification with different data shuffling strategies.

28.- ResNet on CIFAR-100: Testing algorithm stability and convergence under various spectral gap conditions.

29.- Optimal face length: The ideal duration of communication phases in biphase communication.

30.- Throughput preservation: Maintaining computational efficiency while improving consistency in decentralized training.

Knowledge Vault built byDavid Vivancos 2024