Knowledge Vault 2/16 - ICLR 2014-2023
Pierre Baldi ICLR 2015 - Keynote - The Ebb and Flow of Deep Learning: a Theory of Local Learning
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef rules fill:#f9d4d4, font-weight:bold, font-size:14px; classDef framework fill:#d4f9d4, font-weight:bold, font-size:14px; classDef learning fill:#d4d4f9, font-weight:bold, font-size:14px; classDef capacity fill:#f9f9d4, font-weight:bold, font-size:14px; classDef theory fill:#f9d4f9, font-weight:bold, font-size:14px; A[Pierre Baldi
ICLR 2015] --> B[Local rules adjust weights
using synaptic variables. 1] A --> C[Framework defines variables,
functional form. 2] C --> D[Polynomial rules analyzed in
linear, non-linear networks. 3] C --> E[Framework discovers rules,
reveals group symmetries. 4] A --> F[Deep local learning stacks
rules, learns representations. 5] F --> G[Complex functions need
propagated target information. 6] G --> H[Target propagation partitions
learning algorithms. 7] A --> I[Feedback channel capacity:
bits/weight over operations/weight. 8] I --> J[Backpropagation outperforms,
achieves maximum capacity. 9] A --> K[Theory clarifies Hebbian learning,
sparsity of rules. 10] K --> L[Replace Hebbian with local
variables, functional form. 11] A --> M[Linear networks: weight changes
depend on data moments. 12] M --> N[Linear recurrence solved
exactly in linear networks. 13] M --> O[Non-linear networks estimated
by dropout, Taylor expansions. 14] M --> P[Local rules often diverge
in linear networks. 15] A --> Q[Single linear threshold limited
to linearly separable functions. 16] A --> R[Deep local learning can't
find error function minima. 17] R --> S[Complex deep learning needs
target feedback to weights. 18] S --> T[Optimal deep weights depend
on inputs and targets. 19] S --> U[Optimal deep learning requires
feedback channel physically. 20] U --> V[Feedback uses forward
or separate backward connections. 21] U --> W[Backpropagation optimal,
highest capacity feedback. 22] A --> X[Has evolution discovered
stochastic gradient descent? 23] A --> Y[Hebb only isometry-invariant
rule for Hopfield nets. 24] A --> Z[Gradient descent same for
logistic, tanh binary units. 25] A --> AA[New convergent rules: decay
terms, bounded weights. 26] A --> AB[Sampling deep targets trains
non-differentiable networks. 27] AB --> AC[Sample activations, optimize layer,
fix rest of network. 28] AB --> AD[Multiple perturbations provide
more gradients, higher cost. 29] A --> AE[Backpropagation optimal: bits
transmitted, error improvement. 30] class A,B,K,L,Y,Z,AA rules; class C,D,E framework; class F,G,H,Q,R,S,T,U,V,W,X,AB,AC,AD,AE learning; class I,J capacity; class M,N,O,P theory;

Resume:

1.-Learning rules adjust synaptic weights based on local variables available to each synapse in a physical neural system.

2.-A systematic framework for studying local learning rules defines the local variables and the functional form combining them.

3.-Polynomial local learning rules are analyzed in linear and non-linear networks to understand their behavior and capabilities.

4.-The framework enables discovery of new learning rules and reveals connections between learning rules and group symmetries.

5.-Deep local learning by stacking local rules in feedforward networks can learn representations but not complex input-output functions.

6.-Learning complex input-output functions requires local deep learning where target information is propagated to deep layers.

7.-How target information is propagated to deep layers partitions the space of possible learning algorithms.

8.-The capacity of a learning algorithm's feedback channel is defined as bits about the gradient per weight divided by operations per weight.

9.-Calculations show backpropagation outperforms alternatives, achieving the maximum possible feedback channel capacity.

10.-The theory clarifies the concept of Hebbian learning, what it can learn, and the sparsity of learning rules discovered so far.

11.-Hebbian learning should be replaced with a clear definition of local variables and the functional form combining them.

12.-In linear networks, the expectation of weight changes depends only on first and second moments of the data.

13.-When the learning recurrence is linear in the weights, it can be solved exactly in linear networks.

14.-In non-linear networks, expectations of activity-dependent terms can be estimated using a dropout approximation and Taylor expansions.

15.-Many local rules lead to divergent weights in linear networks, with some exceptions like gradient descent on a convex objective.

16.-Local learning in a single linear threshold unit is limited to learning linearly separable functions.

17.-In deep feedforward networks, deep local learning cannot produce weights that are critical points of the error function.

18.-For deep networks to learn complex functions, target information must be fed back to influence the deep weights.

19.-In an optimal system, deep weights must depend on both the inputs and targets/outputs of the system.

20.-Physical implementations of optimal deep learning require a feedback channel to send target information to deep weights.

21.-Feedback to deep weights can either use forward connections in reverse or a separate set of backward connections.

22.-Feedback channel capacity calculations show backpropagation is optimal, achieving the highest possible capacity.

23.-An open question is whether biological neural systems have discovered some form of stochastic gradient descent during evolution.

24.-The simple Hebb rule is the only isometry-invariant learning rule for Hopfield networks.

25.-The gradient descent learning rule is the same for binary units with logistic or tanh activation functions.

26.-Many new convergent learning rules can be derived by adding decay terms to Hebb rules or bounding weights.

27.-Sampling-based deep targets algorithms can train non-differentiable networks reasonably well.

28.-These algorithms sample activations to generate targets that optimize a layer while holding the rest of the network fixed.

29.-Sampling multiple perturbations provides more gradient information at additional computational cost.

30.-Backpropagation is optimal in terms of bits transmitted and improvement in the error function per operation.

Knowledge Vault built byDavid Vivancos 2024