Bayesian Learning via Stochastic Gradient Langevin Dynamics
Yee Teh & Max Welling
1.- SGLD (Stochastic Gradient Langevin Dynamics): Algorithm combining stochastic gradient descent with Langevin dynamics for scalable Bayesian inference.

2.- Historical context: SGLD developed during Bayesian machine learning's peak, as deep learning was emerging.

3.- Big data challenge: Traditional MCMC methods inefficient for large datasets, while SGD scales well.

4.- Burn-in as optimization: Initial phase of MCMC is essentially optimization, wasting resources on precise updates.

5.- Noise for generalization: Adding noise (e.g., dropout) improves generalization in neural networks.

6.- SGLD algorithm: Combines SGD with injected Gaussian noise, transitioning from optimization to sampling.

7.- Annealing step size: SGLD requires decreasing step size over time to converge to correct distribution.

8.- Metropolis-Hastings step: As step size decreases, acceptance probability approaches 1, allowing skip of accept-reject step.

9.- Automatic transition: SGLD naturally switches from optimization to sampling as injected noise dominates gradient noise.

10.- Theoretical developments: Subsequent work analyzed SGLD's convergence, bias-variance trade-offs, and relationships to continuous-time processes.

11.- Preconditioned SGLD: Extensions using Fisher scoring or Riemannian geometry to improve sampling efficiency.

12.- Stochastic Gradient Hamiltonian Monte Carlo: Adaptation of SGLD to Hamiltonian Monte Carlo for better exploration.

13.- Unified framework: Ma et al. provided a general recipe for stochastic gradient MCMC algorithms.

14.- Convergence analysis: Studies on weak vs. strong convergence, consistency, and central limit theorems for SGLD.

15.- Convergence rate: SGLD converges at m^(-1/3) rate, slower than standard MCMC's m^(-1/2) rate.

16.- Fixed step size analysis: Investigating trade-offs between bias and variance with constant step size.

17.- Wasserstein distance: Used to bound convergence of SGLD to diffusion limit and excess risk.

18.- Excess risk bounds: Non-asymptotic bounds derived for SGLD in non-convex settings.

19.- Generalization error: Bounds based on mutual information between dataset and SGLD iterates.

20.- Cold posteriors: Theoretical and practical evidence suggesting better performance with lower-temperature posteriors.

21.- Bayesian deep learning: Growing research area with open questions about posterior characteristics and efficient sampling methods.

22.- Prior specification: Challenge of incorporating meaningful domain knowledge as priors in Bayesian deep learning.

23.- Simplicity and generalizability: SGLD's success attributed to its simple implementation and room for extensions.

24.- Theoretical analysis by others: Mathematical community provided rigorous analysis post-publication.

25.- Data-dependent bounds: Recent work produces bounds that depend on specific dataset characteristics.

26.- Tall data algorithms: Developments in MCMC for datasets with many samples but low dimensionality.

27.- Stein Variational Gradient Descent: Alternative method using deterministic particle interactions for posterior approximation.

28.- Online and adaptive SGLD: Potential extensions for handling changing data distributions over time.

29.- Hierarchical Bayesian modeling: Approach for transfer learning and relating multiple datasets or tasks.

30.- Inference algorithms' importance: Crucial for probabilistic machine learning, especially with latent variable models.

