Andrew M. Saxe; James L. McClelland; Surya Ganguli ICLR 2014 - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef deeplearning fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef deeplinear fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef initialization fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef nonlinear fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Andrew M. Saxe et al.] --> B[Deep learning successes,

difficult theory. 1] A --> C[Theoretical perspective on

training, rates, initializations. 2] A --> D[Deep linear networks:

analysis without nonlinearities. 3] D --> E[Long plateaus, drops in error,

faster pretrained convergence. 4] D --> F[3-layer network reveals

coupled differential equations. 5] F --> G[Weights converge to

input-output SVD. 6] F --> H[Special solutions describe

entire learning trajectory. 7] D --> I[Learning time inversely

proportional to singular value. 8] D --> J[Solutions decouple, good

approximations generally. 9] D --> K[Approach extends to

deeper linear networks. 10] K --> L[Combined gradient on

order of layers. 11] K --> M[Optimal rate scales

as 1/layers. 12] M --> N[Finite slowdown with

special initial conditions. 13] M --> O[Learning time approximately

depth-independent. 14] D --> P[Experiments confirm finite

slowdown prediction. 15] D --> Q[Mode learning time depends

on singular value size. 16] Q --> R[1/depth rate, fast learning

with decoupled conditions. 17] R --> S[Pretraining finds good

decoupled conditions. 18] Q --> T[Random orthogonal initializations

perform similarly. 19] T --> U[Orthogonal outperforms scaled

Gaussian initializations. 20] T --> V[Orthogonal matrices

preserve norms exactly. 21] A --> W[Near-isometry initialization

for gradient propagation. 22] W --> X[Scaled orthogonal counteracts

contractive nonlinearities. 23] X --> Y[Faster training, better error

on MNIST. 24] X --> Z[Large gains enable

few-iteration deep learning. 25] Z --> AA[Accuracy drop suggests

small weights regularize. 26] Z --> AB[Quick training with large weights

suggests saddle points. 27] D --> AC[Theory extends to

non-square weight matrices. 28] A --> AD[Vanishing gradients manifest

in deep linear networks. 29] A --> AE[LSTMs help vanishing gradients,

don't fully achieve isometry. 30] class A,B,C deeplearning; class D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,AC,AD deeplinear; class S,T,U,V,W,X initialization; class Y,Z,AA,AB,AE nonlinear;

difficult theory. 1] A --> C[Theoretical perspective on

training, rates, initializations. 2] A --> D[Deep linear networks:

analysis without nonlinearities. 3] D --> E[Long plateaus, drops in error,

faster pretrained convergence. 4] D --> F[3-layer network reveals

coupled differential equations. 5] F --> G[Weights converge to

input-output SVD. 6] F --> H[Special solutions describe

entire learning trajectory. 7] D --> I[Learning time inversely

proportional to singular value. 8] D --> J[Solutions decouple, good

approximations generally. 9] D --> K[Approach extends to

deeper linear networks. 10] K --> L[Combined gradient on

order of layers. 11] K --> M[Optimal rate scales

as 1/layers. 12] M --> N[Finite slowdown with

special initial conditions. 13] M --> O[Learning time approximately

depth-independent. 14] D --> P[Experiments confirm finite

slowdown prediction. 15] D --> Q[Mode learning time depends

on singular value size. 16] Q --> R[1/depth rate, fast learning

with decoupled conditions. 17] R --> S[Pretraining finds good

decoupled conditions. 18] Q --> T[Random orthogonal initializations

perform similarly. 19] T --> U[Orthogonal outperforms scaled

Gaussian initializations. 20] T --> V[Orthogonal matrices

preserve norms exactly. 21] A --> W[Near-isometry initialization

for gradient propagation. 22] W --> X[Scaled orthogonal counteracts

contractive nonlinearities. 23] X --> Y[Faster training, better error

on MNIST. 24] X --> Z[Large gains enable

few-iteration deep learning. 25] Z --> AA[Accuracy drop suggests

small weights regularize. 26] Z --> AB[Quick training with large weights

suggests saddle points. 27] D --> AC[Theory extends to

non-square weight matrices. 28] A --> AD[Vanishing gradients manifest

in deep linear networks. 29] A --> AE[LSTMs help vanishing gradients,

don't fully achieve isometry. 30] class A,B,C deeplearning; class D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,AC,AD deeplinear; class S,T,U,V,W,X initialization; class Y,Z,AA,AB,AE nonlinear;

**Resume: **

**1.-**Deep learning has had many recent successes but the theory is difficult due to the composition of nonlinearities.

**2.-**This talk takes a theoretical perspective to gain intuition on how training time, learning rates, and initializations impact learning.

**3.-**Deep linear networks remove nonlinearities to make analysis possible while still exhibiting some interesting learning phenomena.

**4.-**Deep linear networks can show long plateaus followed by drops in training error, and faster convergence from pre-trained initializations.

**5.-**Focusing on a simple 3-layer linear network trained on input-output pairs reveals coupled nonlinear differential equations governing the learning dynamics.

**6.-**At convergence, the weights converge to the SVD of the input-output correlation matrix.

**7.-**Exact solutions can be found for special initial conditions that describe the entire learning trajectory as learning the singular values over time.

**8.-**Learning time for each mode is inversely proportional to the size of the corresponding singular value - stronger correlations are learned faster.

**9.-**Solutions rapidly decouple even from random initial conditions, so the analytic solutions are good approximations in general.

**10.-**The same approach extends to deeper linear networks, with each effective singular value evolving according to a more complex differential equation.

**11.-**The combined gradient across all layers is on the order of the number of layers.

**12.-**The optimal learning rate scales as 1/m where m is the number of layers, based on bounding the maximum eigenvalue.

**13.-**Despite the 1/m learning rate, the time difference between deep and shallow networks remains finite if using the special initial conditions.

**14.-**This is because the gradient norm is order m while learning rate is 1/m, so learning time is approximately independent of depth.

**15.-**Experiments on deep linear networks up to 100 layers show saturation in learning time as depth increases, confirming the finite slowdown prediction.

**16.-**In summary, deep linear networks have nontrivial learning dynamics and each mode's learning time depends on its singular value size.

**17.-**Optimal learning rate scales as 1/depth but networks can still learn quickly if initialized with decoupled conditions.

**18.-**Pretraining is one way to find good decoupled initial conditions, analogous to helping optimization in the nonlinear case.

**19.-**Pretraining in a deep linear network simply sets each weight matrix to orthogonal, suggesting random orthogonal initializations could work too.

**20.-**Random orthogonal initializations perform similarly to pretraining and enable fast depth-independent learning times, outperforming carefully scaled random Gaussian initializations.

**21.-**Carefully scaled random matrices preserve vector norms only on average, amplifying some directions while attenuating others, whereas orthogonal matrices preserve norms exactly.

**22.-**For nonlinear networks, a good initialization may be a near-isometry on as large a subspace as possible to allow gradient propagation.

**23.-**Scaling random orthogonal weight matrices by a gain slightly greater than 1 helps counteract contractive nonlinearities to achieve many singular values near 1.

**24.-**30-layer nonlinear networks trained on MNIST showed faster training and slightly better test error using orthogonal initializations scaled just above 1.

**25.-**Even larger gains (e.g. 2-10x) allow very deep networks to learn in just a few iterations, but with an accuracy tradeoff.

**26.-**The accuracy drop with high gains suggests small initial weights are important for regularization and learning smooth functions.

**27.-**The ability to train quickly with large initial weights suggests training difficulties may arise more from saddle points than local minima.

**28.-**The theory extends to non-square weight matrices by using SVDs with ones and zeros.

**29.-**The vanishing gradient problem does manifest in deep linear networks as it does in nonlinear ones.

**30.-**LSTMs help with vanishing gradients by preserving norm for self-loops but don't fully achieve a near-isometry.

Knowledge Vault built byDavid Vivancos 2024