Riccardo Zecchina ICLR 2017 - Invited Talk - Learning by Local Entropy Maximization

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef physics fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef optimization fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef solutions fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef learning fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef entropy fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Riccardo Zecchina

ICLR 2017] --> B[Statistical physics and machine learning

share fundamental problems. 1] A --> C[Constraint satisfaction and optimization

problems' hardness studied. 2] C --> D[Boolean problems: solvable to

unsolvable phase transition. 3] D --> E[Solution space breaks symmetry

as constraints increase. 4] E --> F[New physics algorithms solve

problems in hard region. 5] A --> G[Neural networks' weight space

has exponential domains. 6] G --> H[Learning thought hard due

to isolated solutions. 7] H --> I[Special tools needed to

reveal rare clusters. 8] A --> J[Local entropy method introduced

to amplify rare regions. 9] J --> K['Flat minima' clusters exist

near critical capacity. 10] K --> L[Dense cluster solutions

generalize very well. 11] K --> M[Results generalize to

multiple layer networks. 12] A --> N[Successful algorithms avoid simply

minimizing loss. 13] N --> O[Local entropy-guided annealing

finds dense regions. 14] O --> P[Local entropy approximated by

coupling system replicas. 15] P --> Q[Replicated Markov chains, SGD,

belief propagation derived. 16] K --> R[Dense regions confirmed numerically

and by replica computation. 17] Q --> S[Replica method interprets why

elastic SGD works. 18] A --> T[Wide flat minima idea

not new, numerically confirmed. 19] A --> U[Out-of-equilibrium physics key

for non-convex learning. 20] U --> V[Opportunities to accelerate learning

exploiting dense geometry. 21] U --> W[Tools may enable unsupervised

Bayesian inference learning. 22] A --> X[Ongoing work: stochastic neural

networks analysis. 23] A --> Y[Gibbs measures concentrate on

typical narrow minima. 24] K --> Z[Wide flat minima robust

for non-trivial data. 25] class A,B,C,D,E,F,U,X,Y physics; class G,H,N,T optimization; class I,K,L,M,R,Z solutions; class J,O,P,Q,S,V,W entropy;

ICLR 2017] --> B[Statistical physics and machine learning

share fundamental problems. 1] A --> C[Constraint satisfaction and optimization

problems' hardness studied. 2] C --> D[Boolean problems: solvable to

unsolvable phase transition. 3] D --> E[Solution space breaks symmetry

as constraints increase. 4] E --> F[New physics algorithms solve

problems in hard region. 5] A --> G[Neural networks' weight space

has exponential domains. 6] G --> H[Learning thought hard due

to isolated solutions. 7] H --> I[Special tools needed to

reveal rare clusters. 8] A --> J[Local entropy method introduced

to amplify rare regions. 9] J --> K['Flat minima' clusters exist

near critical capacity. 10] K --> L[Dense cluster solutions

generalize very well. 11] K --> M[Results generalize to

multiple layer networks. 12] A --> N[Successful algorithms avoid simply

minimizing loss. 13] N --> O[Local entropy-guided annealing

finds dense regions. 14] O --> P[Local entropy approximated by

coupling system replicas. 15] P --> Q[Replicated Markov chains, SGD,

belief propagation derived. 16] K --> R[Dense regions confirmed numerically

and by replica computation. 17] Q --> S[Replica method interprets why

elastic SGD works. 18] A --> T[Wide flat minima idea

not new, numerically confirmed. 19] A --> U[Out-of-equilibrium physics key

for non-convex learning. 20] U --> V[Opportunities to accelerate learning

exploiting dense geometry. 21] U --> W[Tools may enable unsupervised

Bayesian inference learning. 22] A --> X[Ongoing work: stochastic neural

networks analysis. 23] A --> Y[Gibbs measures concentrate on

typical narrow minima. 24] K --> Z[Wide flat minima robust

for non-trivial data. 25] class A,B,C,D,E,F,U,X,Y physics; class G,H,N,T optimization; class I,K,L,M,R,Z solutions; class J,O,P,Q,S,V,W entropy;

**Resume: **

**1.-**Statistical physics and machine learning share fundamental problems. The talk discusses the geometrical structure of minima in non-convex optimization and learning problems.

**2.-**Whether constraint satisfaction and optimization problems from natural distributions are hard to solve was studied at the interface of computer science and physics.

**3.-**Boolean satisfiability problems undergo a phase transition from solvable to unsolvable as constraints increase relative to variables. Algorithms have exponential running time near the boundary.

**4.-**As constraints increase, the solution space breaks symmetry, going from a giant connected cluster to exponentially many small clusters and local minima.

**5.-**In the hard but solvable region, Markov chains get trapped, but new algorithms from statistical physics can solve the problems.

**6.-**In neural networks, the weight space is divided into exponentially many domains of different sizes. Some dominate the probability distribution (Gibbs measure).

**7.-**Learning in neural networks was thought to be hard due to isolated solutions, but algorithms were able to learn, contradicting analytical results.

**8.-**Subdominant solution clusters in neural networks are so rare that special analytical tools are needed to reveal them.

**9.-**The local entropy, a large deviation method, was introduced to amplify the weight of rare dense solution regions.

**10.-**Analytical calculations show very dense "flat minima" solution clusters exist in simple networks up to near critical capacity before disappearing.

**11.-**Solutions in the dense clusters generalize very well, almost as well as optimal Bayesian integration over all solutions.

**12.-**Results generalize to multiple layer networks. Dense solution regions are a structural property, not dependent on data.

**13.-**Successful learning algorithms avoid simply minimizing loss because the stationary distribution should focus on rare but dense solution regions, not typical solutions.

**14.-**A local entropy-guided simulated annealing algorithm was designed that can learn when standard simulated annealing fails by finding dense solution regions.

**15.-**The local entropy can be approximated by coupling replicas of the system to concentrate the measure on dense regions without explicitly computing entropy.

**16.-**Replicated Markov chains, stochastic gradient descent, and message-passing belief propagation algorithms can be derived that automatically focus on dense solution regions.

**17.-**The existence of dense solution regions was confirmed numerically for two-layer networks and analytically by a complicated replica computation.

**18.-**The replica method provides an interpretation of why elastic averaging SGD with momentum works - it samples from the robust ensemble distribution.

**19.-**The idea of wide flat minima is not new. Numerical experiments confirm the coexistence of wide and sharp minima in loss functions of deep networks.

**20.-**Out-of-equilibrium statistical physics of rare states and large deviation methods are key frameworks for understanding learning in non-convex problems.

**21.-**There are opportunities for accelerating learning by exploiting the geometry of dense states and using very low precision weights.

**22.-**The tools may enable unsupervised learning by utilizing dense regions for Bayesian inference.

**23.-**Ongoing work includes analysis of stochastic neural networks. Non-polarized weight distributions automatically end up in dense states.

**24.-**Gibbs measures concentrate on typical narrow minima in non-convex problems. Flat minima are described by the subdominant tail of the distribution.

**25.-**Calculating the volume of solution space requires considering particular data distributions, but wide flat minima seem to be a robust property for non-trivial datasets.

Knowledge Vault built byDavid Vivancos 2024