The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef physics fill:#f9d4d4, font-weight:bold, font-size:14px; classDef optimization fill:#d4f9d4, font-weight:bold, font-size:14px; classDef solutions fill:#d4d4f9, font-weight:bold, font-size:14px; classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px; classDef entropy fill:#f9d4f9, font-weight:bold, font-size:14px; A[Riccardo Zecchina
ICLR 2017] --> B[Statistical physics and machine learning
share fundamental problems. 1] A --> C[Constraint satisfaction and optimization
problems' hardness studied. 2] C --> D[Boolean problems: solvable to
unsolvable phase transition. 3] D --> E[Solution space breaks symmetry
as constraints increase. 4] E --> F[New physics algorithms solve
problems in hard region. 5] A --> G[Neural networks' weight space
has exponential domains. 6] G --> H[Learning thought hard due
to isolated solutions. 7] H --> I[Special tools needed to
reveal rare clusters. 8] A --> J[Local entropy method introduced
to amplify rare regions. 9] J --> K['Flat minima' clusters exist
near critical capacity. 10] K --> L[Dense cluster solutions
generalize very well. 11] K --> M[Results generalize to
multiple layer networks. 12] A --> N[Successful algorithms avoid simply
minimizing loss. 13] N --> O[Local entropy-guided annealing
finds dense regions. 14] O --> P[Local entropy approximated by
coupling system replicas. 15] P --> Q[Replicated Markov chains, SGD,
belief propagation derived. 16] K --> R[Dense regions confirmed numerically
and by replica computation. 17] Q --> S[Replica method interprets why
elastic SGD works. 18] A --> T[Wide flat minima idea
not new, numerically confirmed. 19] A --> U[Out-of-equilibrium physics key
for non-convex learning. 20] U --> V[Opportunities to accelerate learning
exploiting dense geometry. 21] U --> W[Tools may enable unsupervised
Bayesian inference learning. 22] A --> X[Ongoing work: stochastic neural
networks analysis. 23] A --> Y[Gibbs measures concentrate on
typical narrow minima. 24] K --> Z[Wide flat minima robust
for non-trivial data. 25] class A,B,C,D,E,F,U,X,Y physics; class G,H,N,T optimization; class I,K,L,M,R,Z solutions; class J,O,P,Q,S,V,W entropy;

Resume:

1.-Statistical physics and machine learning share fundamental problems. The talk discusses the geometrical structure of minima in non-convex optimization and learning problems.

2.-Whether constraint satisfaction and optimization problems from natural distributions are hard to solve was studied at the interface of computer science and physics.

3.-Boolean satisfiability problems undergo a phase transition from solvable to unsolvable as constraints increase relative to variables. Algorithms have exponential running time near the boundary.

4.-As constraints increase, the solution space breaks symmetry, going from a giant connected cluster to exponentially many small clusters and local minima.

5.-In the hard but solvable region, Markov chains get trapped, but new algorithms from statistical physics can solve the problems.

6.-In neural networks, the weight space is divided into exponentially many domains of different sizes. Some dominate the probability distribution (Gibbs measure).

7.-Learning in neural networks was thought to be hard due to isolated solutions, but algorithms were able to learn, contradicting analytical results.

8.-Subdominant solution clusters in neural networks are so rare that special analytical tools are needed to reveal them.

9.-The local entropy, a large deviation method, was introduced to amplify the weight of rare dense solution regions.

10.-Analytical calculations show very dense "flat minima" solution clusters exist in simple networks up to near critical capacity before disappearing.

11.-Solutions in the dense clusters generalize very well, almost as well as optimal Bayesian integration over all solutions.

12.-Results generalize to multiple layer networks. Dense solution regions are a structural property, not dependent on data.

13.-Successful learning algorithms avoid simply minimizing loss because the stationary distribution should focus on rare but dense solution regions, not typical solutions.

14.-A local entropy-guided simulated annealing algorithm was designed that can learn when standard simulated annealing fails by finding dense solution regions.

15.-The local entropy can be approximated by coupling replicas of the system to concentrate the measure on dense regions without explicitly computing entropy.

16.-Replicated Markov chains, stochastic gradient descent, and message-passing belief propagation algorithms can be derived that automatically focus on dense solution regions.

17.-The existence of dense solution regions was confirmed numerically for two-layer networks and analytically by a complicated replica computation.

18.-The replica method provides an interpretation of why elastic averaging SGD with momentum works - it samples from the robust ensemble distribution.

19.-The idea of wide flat minima is not new. Numerical experiments confirm the coexistence of wide and sharp minima in loss functions of deep networks.

20.-Out-of-equilibrium statistical physics of rare states and large deviation methods are key frameworks for understanding learning in non-convex problems.

21.-There are opportunities for accelerating learning by exploiting the geometry of dense states and using very low precision weights.

22.-The tools may enable unsupervised learning by utilizing dense regions for Bayesian inference.

23.-Ongoing work includes analysis of stochastic neural networks. Non-polarized weight distributions automatically end up in dense states.

24.-Gibbs measures concentrate on typical narrow minima in non-convex problems. Flat minima are described by the subdominant tail of the distribution.

25.-Calculating the volume of solution space requires considering particular data distributions, but wide flat minima seem to be a robust property for non-trivial datasets.

Knowledge Vault built byDavid Vivancos 2024