Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Deep learning has had many recent successes but the theory is difficult due to the composition of nonlinearities.
2.-This talk takes a theoretical perspective to gain intuition on how training time, learning rates, and initializations impact learning.
3.-Deep linear networks remove nonlinearities to make analysis possible while still exhibiting some interesting learning phenomena.
4.-Deep linear networks can show long plateaus followed by drops in training error, and faster convergence from pre-trained initializations.
5.-Focusing on a simple 3-layer linear network trained on input-output pairs reveals coupled nonlinear differential equations governing the learning dynamics.
6.-At convergence, the weights converge to the SVD of the input-output correlation matrix.
7.-Exact solutions can be found for special initial conditions that describe the entire learning trajectory as learning the singular values over time.
8.-Learning time for each mode is inversely proportional to the size of the corresponding singular value - stronger correlations are learned faster.
9.-Solutions rapidly decouple even from random initial conditions, so the analytic solutions are good approximations in general.
10.-The same approach extends to deeper linear networks, with each effective singular value evolving according to a more complex differential equation.
11.-The combined gradient across all layers is on the order of the number of layers.
12.-The optimal learning rate scales as 1/m where m is the number of layers, based on bounding the maximum eigenvalue.
13.-Despite the 1/m learning rate, the time difference between deep and shallow networks remains finite if using the special initial conditions.
14.-This is because the gradient norm is order m while learning rate is 1/m, so learning time is approximately independent of depth.
15.-Experiments on deep linear networks up to 100 layers show saturation in learning time as depth increases, confirming the finite slowdown prediction.
16.-In summary, deep linear networks have nontrivial learning dynamics and each mode's learning time depends on its singular value size.
17.-Optimal learning rate scales as 1/depth but networks can still learn quickly if initialized with decoupled conditions.
18.-Pretraining is one way to find good decoupled initial conditions, analogous to helping optimization in the nonlinear case.
19.-Pretraining in a deep linear network simply sets each weight matrix to orthogonal, suggesting random orthogonal initializations could work too.
20.-Random orthogonal initializations perform similarly to pretraining and enable fast depth-independent learning times, outperforming carefully scaled random Gaussian initializations.
21.-Carefully scaled random matrices preserve vector norms only on average, amplifying some directions while attenuating others, whereas orthogonal matrices preserve norms exactly.
22.-For nonlinear networks, a good initialization may be a near-isometry on as large a subspace as possible to allow gradient propagation.
23.-Scaling random orthogonal weight matrices by a gain slightly greater than 1 helps counteract contractive nonlinearities to achieve many singular values near 1.
24.-30-layer nonlinear networks trained on MNIST showed faster training and slightly better test error using orthogonal initializations scaled just above 1.
25.-Even larger gains (e.g. 2-10x) allow very deep networks to learn in just a few iterations, but with an accuracy tradeoff.
26.-The accuracy drop with high gains suggests small initial weights are important for regularization and learning smooth functions.
27.-The ability to train quickly with large initial weights suggests training difficulties may arise more from saddle points than local minima.
28.-The theory extends to non-square weight matrices by using SVDs with ones and zeros.
29.-The vanishing gradient problem does manifest in deep linear networks as it does in nonlinear ones.
30.-LSTMs help with vanishing gradients by preserving norm for self-loops but don't fully achieve a near-isometry.
Knowledge Vault built byDavid Vivancos 2024