The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Andrew M. Saxe et al. ICLR 2014

graph LR classDef deeplearning fill:#f9d4d4, font-weight:bold, font-size:14px; classDef deeplinear fill:#d4f9d4, font-weight:bold, font-size:14px; classDef initialization fill:#f9f9d4, font-weight:bold, font-size:14px; classDef nonlinear fill:#f9d4f9, font-weight:bold, font-size:14px; A[Andrew M. Saxe et al.] --> B[Deep learning successes,
difficult theory. 1] A --> C[Theoretical perspective on
training, rates, initializations. 2] A --> D[Deep linear networks:
analysis without nonlinearities. 3] D --> E[Long plateaus, drops in error,
faster pretrained convergence. 4] D --> F[3-layer network reveals
coupled differential equations. 5] F --> G[Weights converge to
input-output SVD. 6] F --> H[Special solutions describe
entire learning trajectory. 7] D --> I[Learning time inversely
proportional to singular value. 8] D --> J[Solutions decouple, good
approximations generally. 9] D --> K[Approach extends to
deeper linear networks. 10] K --> L[Combined gradient on
order of layers. 11] K --> M[Optimal rate scales
as 1/layers. 12] M --> N[Finite slowdown with
special initial conditions. 13] M --> O[Learning time approximately
depth-independent. 14] D --> P[Experiments confirm finite
slowdown prediction. 15] D --> Q[Mode learning time depends
on singular value size. 16] Q --> R[1/depth rate, fast learning
with decoupled conditions. 17] R --> S[Pretraining finds good
decoupled conditions. 18] Q --> T[Random orthogonal initializations
perform similarly. 19] T --> U[Orthogonal outperforms scaled
Gaussian initializations. 20] T --> V[Orthogonal matrices
preserve norms exactly. 21] A --> W[Near-isometry initialization
for gradient propagation. 22] W --> X[Scaled orthogonal counteracts
contractive nonlinearities. 23] X --> Y[Faster training, better error
on MNIST. 24] X --> Z[Large gains enable
few-iteration deep learning. 25] Z --> AA[Accuracy drop suggests
small weights regularize. 26] Z --> AB[Quick training with large weights
suggests saddle points. 27] D --> AC[Theory extends to
non-square weight matrices. 28] A --> AD[Vanishing gradients manifest
in deep linear networks. 29] A --> AE[LSTMs help vanishing gradients,
don't fully achieve isometry. 30] class A,B,C deeplearning; class D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,AC,AD deeplinear; class S,T,U,V,W,X initialization; class Y,Z,AA,AB,AE nonlinear;

Resume:

1.-Deep learning has had many recent successes but the theory is difficult due to the composition of nonlinearities.

2.-This talk takes a theoretical perspective to gain intuition on how training time, learning rates, and initializations impact learning.

3.-Deep linear networks remove nonlinearities to make analysis possible while still exhibiting some interesting learning phenomena.

4.-Deep linear networks can show long plateaus followed by drops in training error, and faster convergence from pre-trained initializations.

5.-Focusing on a simple 3-layer linear network trained on input-output pairs reveals coupled nonlinear differential equations governing the learning dynamics.

6.-At convergence, the weights converge to the SVD of the input-output correlation matrix.

7.-Exact solutions can be found for special initial conditions that describe the entire learning trajectory as learning the singular values over time.

8.-Learning time for each mode is inversely proportional to the size of the corresponding singular value - stronger correlations are learned faster.

9.-Solutions rapidly decouple even from random initial conditions, so the analytic solutions are good approximations in general.

10.-The same approach extends to deeper linear networks, with each effective singular value evolving according to a more complex differential equation.

11.-The combined gradient across all layers is on the order of the number of layers.

12.-The optimal learning rate scales as 1/m where m is the number of layers, based on bounding the maximum eigenvalue.

13.-Despite the 1/m learning rate, the time difference between deep and shallow networks remains finite if using the special initial conditions.

14.-This is because the gradient norm is order m while learning rate is 1/m, so learning time is approximately independent of depth.

15.-Experiments on deep linear networks up to 100 layers show saturation in learning time as depth increases, confirming the finite slowdown prediction.

16.-In summary, deep linear networks have nontrivial learning dynamics and each mode's learning time depends on its singular value size.

17.-Optimal learning rate scales as 1/depth but networks can still learn quickly if initialized with decoupled conditions.

18.-Pretraining is one way to find good decoupled initial conditions, analogous to helping optimization in the nonlinear case.

19.-Pretraining in a deep linear network simply sets each weight matrix to orthogonal, suggesting random orthogonal initializations could work too.

20.-Random orthogonal initializations perform similarly to pretraining and enable fast depth-independent learning times, outperforming carefully scaled random Gaussian initializations.

21.-Carefully scaled random matrices preserve vector norms only on average, amplifying some directions while attenuating others, whereas orthogonal matrices preserve norms exactly.

22.-For nonlinear networks, a good initialization may be a near-isometry on as large a subspace as possible to allow gradient propagation.

23.-Scaling random orthogonal weight matrices by a gain slightly greater than 1 helps counteract contractive nonlinearities to achieve many singular values near 1.

24.-30-layer nonlinear networks trained on MNIST showed faster training and slightly better test error using orthogonal initializations scaled just above 1.

25.-Even larger gains (e.g. 2-10x) allow very deep networks to learn in just a few iterations, but with an accuracy tradeoff.

26.-The accuracy drop with high gains suggests small initial weights are important for regularization and learning smooth functions.

27.-The ability to train quickly with large initial weights suggests training difficulties may arise more from saddle points than local minima.

28.-The theory extends to non-square weight matrices by using SVDs with ones and zeros.

29.-The vanishing gradient problem does manifest in deep linear networks as it does in nonlinear ones.

30.-LSTMs help with vanishing gradients by preserving norm for self-loops but don't fully achieve a near-isometry.

Knowledge Vault built byDavid Vivancos 2024