Pierre Baldi ICLR 2015 - Keynote - The Ebb and Flow of Deep Learning: a Theory of Local Learning

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef rules fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef framework fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef learning fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef capacity fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef theory fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Pierre Baldi

ICLR 2015] --> B[Local rules adjust weights

using synaptic variables. 1] A --> C[Framework defines variables,

functional form. 2] C --> D[Polynomial rules analyzed in

linear, non-linear networks. 3] C --> E[Framework discovers rules,

reveals group symmetries. 4] A --> F[Deep local learning stacks

rules, learns representations. 5] F --> G[Complex functions need

propagated target information. 6] G --> H[Target propagation partitions

learning algorithms. 7] A --> I[Feedback channel capacity:

bits/weight over operations/weight. 8] I --> J[Backpropagation outperforms,

achieves maximum capacity. 9] A --> K[Theory clarifies Hebbian learning,

sparsity of rules. 10] K --> L[Replace Hebbian with local

variables, functional form. 11] A --> M[Linear networks: weight changes

depend on data moments. 12] M --> N[Linear recurrence solved

exactly in linear networks. 13] M --> O[Non-linear networks estimated

by dropout, Taylor expansions. 14] M --> P[Local rules often diverge

in linear networks. 15] A --> Q[Single linear threshold limited

to linearly separable functions. 16] A --> R[Deep local learning can't

find error function minima. 17] R --> S[Complex deep learning needs

target feedback to weights. 18] S --> T[Optimal deep weights depend

on inputs and targets. 19] S --> U[Optimal deep learning requires

feedback channel physically. 20] U --> V[Feedback uses forward

or separate backward connections. 21] U --> W[Backpropagation optimal,

highest capacity feedback. 22] A --> X[Has evolution discovered

stochastic gradient descent? 23] A --> Y[Hebb only isometry-invariant

rule for Hopfield nets. 24] A --> Z[Gradient descent same for

logistic, tanh binary units. 25] A --> AA[New convergent rules: decay

terms, bounded weights. 26] A --> AB[Sampling deep targets trains

non-differentiable networks. 27] AB --> AC[Sample activations, optimize layer,

fix rest of network. 28] AB --> AD[Multiple perturbations provide

more gradients, higher cost. 29] A --> AE[Backpropagation optimal: bits

transmitted, error improvement. 30] class A,B,K,L,Y,Z,AA rules; class C,D,E framework; class F,G,H,Q,R,S,T,U,V,W,X,AB,AC,AD,AE learning; class I,J capacity; class M,N,O,P theory;

ICLR 2015] --> B[Local rules adjust weights

using synaptic variables. 1] A --> C[Framework defines variables,

functional form. 2] C --> D[Polynomial rules analyzed in

linear, non-linear networks. 3] C --> E[Framework discovers rules,

reveals group symmetries. 4] A --> F[Deep local learning stacks

rules, learns representations. 5] F --> G[Complex functions need

propagated target information. 6] G --> H[Target propagation partitions

learning algorithms. 7] A --> I[Feedback channel capacity:

bits/weight over operations/weight. 8] I --> J[Backpropagation outperforms,

achieves maximum capacity. 9] A --> K[Theory clarifies Hebbian learning,

sparsity of rules. 10] K --> L[Replace Hebbian with local

variables, functional form. 11] A --> M[Linear networks: weight changes

depend on data moments. 12] M --> N[Linear recurrence solved

exactly in linear networks. 13] M --> O[Non-linear networks estimated

by dropout, Taylor expansions. 14] M --> P[Local rules often diverge

in linear networks. 15] A --> Q[Single linear threshold limited

to linearly separable functions. 16] A --> R[Deep local learning can't

find error function minima. 17] R --> S[Complex deep learning needs

target feedback to weights. 18] S --> T[Optimal deep weights depend

on inputs and targets. 19] S --> U[Optimal deep learning requires

feedback channel physically. 20] U --> V[Feedback uses forward

or separate backward connections. 21] U --> W[Backpropagation optimal,

highest capacity feedback. 22] A --> X[Has evolution discovered

stochastic gradient descent? 23] A --> Y[Hebb only isometry-invariant

rule for Hopfield nets. 24] A --> Z[Gradient descent same for

logistic, tanh binary units. 25] A --> AA[New convergent rules: decay

terms, bounded weights. 26] A --> AB[Sampling deep targets trains

non-differentiable networks. 27] AB --> AC[Sample activations, optimize layer,

fix rest of network. 28] AB --> AD[Multiple perturbations provide

more gradients, higher cost. 29] A --> AE[Backpropagation optimal: bits

transmitted, error improvement. 30] class A,B,K,L,Y,Z,AA rules; class C,D,E framework; class F,G,H,Q,R,S,T,U,V,W,X,AB,AC,AD,AE learning; class I,J capacity; class M,N,O,P theory;

**Resume: **

**1.-**Learning rules adjust synaptic weights based on local variables available to each synapse in a physical neural system.

**2.-**A systematic framework for studying local learning rules defines the local variables and the functional form combining them.

**3.-**Polynomial local learning rules are analyzed in linear and non-linear networks to understand their behavior and capabilities.

**4.-**The framework enables discovery of new learning rules and reveals connections between learning rules and group symmetries.

**5.-**Deep local learning by stacking local rules in feedforward networks can learn representations but not complex input-output functions.

**6.-**Learning complex input-output functions requires local deep learning where target information is propagated to deep layers.

**7.-**How target information is propagated to deep layers partitions the space of possible learning algorithms.

**8.-**The capacity of a learning algorithm's feedback channel is defined as bits about the gradient per weight divided by operations per weight.

**9.-**Calculations show backpropagation outperforms alternatives, achieving the maximum possible feedback channel capacity.

**10.-**The theory clarifies the concept of Hebbian learning, what it can learn, and the sparsity of learning rules discovered so far.

**11.-**Hebbian learning should be replaced with a clear definition of local variables and the functional form combining them.

**12.-**In linear networks, the expectation of weight changes depends only on first and second moments of the data.

**13.-**When the learning recurrence is linear in the weights, it can be solved exactly in linear networks.

**14.-**In non-linear networks, expectations of activity-dependent terms can be estimated using a dropout approximation and Taylor expansions.

**15.-**Many local rules lead to divergent weights in linear networks, with some exceptions like gradient descent on a convex objective.

**16.-**Local learning in a single linear threshold unit is limited to learning linearly separable functions.

**17.-**In deep feedforward networks, deep local learning cannot produce weights that are critical points of the error function.

**18.-**For deep networks to learn complex functions, target information must be fed back to influence the deep weights.

**19.-**In an optimal system, deep weights must depend on both the inputs and targets/outputs of the system.

**20.-**Physical implementations of optimal deep learning require a feedback channel to send target information to deep weights.

**21.-**Feedback to deep weights can either use forward connections in reverse or a separate set of backward connections.

**22.-**Feedback channel capacity calculations show backpropagation is optimal, achieving the highest possible capacity.

**23.-**An open question is whether biological neural systems have discovered some form of stochastic gradient descent during evolution.

**24.-**The simple Hebb rule is the only isometry-invariant learning rule for Hopfield networks.

**25.-**The gradient descent learning rule is the same for binary units with logistic or tanh activation functions.

**26.-**Many new convergent learning rules can be derived by adding decay terms to Hebb rules or bounding weights.

**27.-**Sampling-based deep targets algorithms can train non-differentiable networks reasonably well.

**28.-**These algorithms sample activations to generate targets that optimize a layer while holding the rest of the network fixed.

**29.-**Sampling multiple perturbations provides more gradient information at additional computational cost.

**30.-**Backpropagation is optimal in terms of bits transmitted and improvement in the error function per operation.

Knowledge Vault built byDavid Vivancos 2024