Knowledge Vault 6 /66 - ICML 2021
Bayesian Learning via Stochastic Gradient Langevin Dynamics
Yee Teh & Max Welling
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef sgld fill:#f9d4d4, font-weight:bold, font-size:14px classDef theory fill:#d4f9d4, font-weight:bold, font-size:14px classDef applications fill:#d4d4f9, font-weight:bold, font-size:14px classDef challenges fill:#f9f9d4, font-weight:bold, font-size:14px A[Bayesian Learning via
Stochastic Gradient Langevin
Dynamics] --> B[SGLD
Algorithm] A --> C[Theoretical
Developments] A --> D[Applications
and
Extensions] A --> E[Challenges
and
Considerations] B --> B1[Stochastic gradient
descent + Langevin
dynamics. 1] B --> B2[Combines SGD
with Gaussian
noise. 6] B --> B3[Requires decreasing
step
size. 7] B --> B4[Switches from
optimization to
sampling. 9] B --> B5[Uses Fisher scoring
or Riemannian
geometry. 11] B --> B6[SGLD adapted
to Hamiltonian
Monte Carlo. 12] C --> C1[Convergence,
bias-variance
trade-offs. 10] C --> C2[Studies on SGLD
convergence. 14] C --> C3[SGLD converges
slower than
MCMC. 15] C --> C4[Non-asymptotic bounds
for non-convex
settings. 18] C --> C5[Bounds based on
mutual
information. 19] C --> C6[Better performance
with lower-temperature
posteriors. 20] D --> D1[General recipe
for stochastic gradient
MCMC. 13] D --> D2[Stein Variational
Gradient Descent. 27] D --> D3[Handling changing
data
distributions. 28] D --> D4[Transfer learning for
multiple
datasets. 29] D --> D5[Probabilistic ML
importance. 30] D --> D6[Posterior characteristics,
efficient
sampling. 21] E --> E1[Developed during
Bayesian ML
peak. 2] E --> E2[Inefficient MCMC
for large
datasets. 3] E --> E3[Initial MCMC phase
as
optimization. 4] E --> E4[Noise improves neural
network
generalization. 5] E --> E5[Bias-variance
trade-offs. 16] E --> E6[Challenge in incorporating
domain
knowledge. 22] class A,B,B1,B2,B3,B4,B5,B6 sgld class C,C1,C2,C3,C4,C5,C6 theory class D,D1,D2,D3,D4,D5,D6 applications class E,E1,E2,E3,E4,E5,E6 challenges

Resume:

1.- SGLD (Stochastic Gradient Langevin Dynamics): Algorithm combining stochastic gradient descent with Langevin dynamics for scalable Bayesian inference.

2.- Historical context: SGLD developed during Bayesian machine learning's peak, as deep learning was emerging.

3.- Big data challenge: Traditional MCMC methods inefficient for large datasets, while SGD scales well.

4.- Burn-in as optimization: Initial phase of MCMC is essentially optimization, wasting resources on precise updates.

5.- Noise for generalization: Adding noise (e.g., dropout) improves generalization in neural networks.

6.- SGLD algorithm: Combines SGD with injected Gaussian noise, transitioning from optimization to sampling.

7.- Annealing step size: SGLD requires decreasing step size over time to converge to correct distribution.

8.- Metropolis-Hastings step: As step size decreases, acceptance probability approaches 1, allowing skip of accept-reject step.

9.- Automatic transition: SGLD naturally switches from optimization to sampling as injected noise dominates gradient noise.

10.- Theoretical developments: Subsequent work analyzed SGLD's convergence, bias-variance trade-offs, and relationships to continuous-time processes.

11.- Preconditioned SGLD: Extensions using Fisher scoring or Riemannian geometry to improve sampling efficiency.

12.- Stochastic Gradient Hamiltonian Monte Carlo: Adaptation of SGLD to Hamiltonian Monte Carlo for better exploration.

13.- Unified framework: Ma et al. provided a general recipe for stochastic gradient MCMC algorithms.

14.- Convergence analysis: Studies on weak vs. strong convergence, consistency, and central limit theorems for SGLD.

15.- Convergence rate: SGLD converges at m^(-1/3) rate, slower than standard MCMC's m^(-1/2) rate.

16.- Fixed step size analysis: Investigating trade-offs between bias and variance with constant step size.

17.- Wasserstein distance: Used to bound convergence of SGLD to diffusion limit and excess risk.

18.- Excess risk bounds: Non-asymptotic bounds derived for SGLD in non-convex settings.

19.- Generalization error: Bounds based on mutual information between dataset and SGLD iterates.

20.- Cold posteriors: Theoretical and practical evidence suggesting better performance with lower-temperature posteriors.

21.- Bayesian deep learning: Growing research area with open questions about posterior characteristics and efficient sampling methods.

22.- Prior specification: Challenge of incorporating meaningful domain knowledge as priors in Bayesian deep learning.

23.- Simplicity and generalizability: SGLD's success attributed to its simple implementation and room for extensions.

24.- Theoretical analysis by others: Mathematical community provided rigorous analysis post-publication.

25.- Data-dependent bounds: Recent work produces bounds that depend on specific dataset characteristics.

26.- Tall data algorithms: Developments in MCMC for datasets with many samples but low dimensionality.

27.- Stein Variational Gradient Descent: Alternative method using deterministic particle interactions for posterior approximation.

28.- Online and adaptive SGLD: Potential extensions for handling changing data distributions over time.

29.- Hierarchical Bayesian modeling: Approach for transfer learning and relating multiple datasets or tasks.

30.- Inference algorithms' importance: Crucial for probabilistic machine learning, especially with latent variable models.

Knowledge Vault built byDavid Vivancos 2024