Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Deep learning has made remarkable breakthroughs in the last 5 years, enabling capabilities far beyond what was envisioned in 2011.
2.-To push deep learning into products affecting lives, we need a theoretical understanding to ensure models are trustable, predictable and behave safely.
3.-There's no consensus on what makes optimizing deep models hard. Hypotheses include difficulty finding global minima as network size increases.
4.-Another hypothesis is that good local minima have mostly near-zero eigenvalues in the Hessian, with very few positive or negative ones.
5.-Conventional wisdom is that saddle points make convergence difficult, but they are actually very difficult to converge to if initialized randomly.
6.-Optimizing deep models isn't hard - zero training error can be easily achieved on MNIST and CIFAR-10 without regularization.
7.-Test error diverging from near-zero train error indicates overfitting. But classification error can plateau at a reasonable level without diverging.
8.-Fundamental theory of ML: Population risk = Training error + Generalization gap. Zero training error doesn't necessarily mean overfitting.
9.-Bias-variance tradeoff: Increased model complexity reduces bias but increases variance. Deep learning operates in the "high variance" regime.
10.-Deep models with vastly more parameters than data points can fit arbitrary label patterns, even random noise. Capacity is very high.
11.-Regularization, early stopping, data augmentation etc. help generalization but aren't necessary conditions. Large unregularized nets outperform regularized shallower nets.
12.-Overfitting is an issue even for simple models like linear regression when number of parameters exceeds number of data points.
13.-With more parameters than data points, there are infinitely many global minima, all with the same Hessian with many zero eigenvalues.
14.-SGD converges to the minimum norm solution out of all global minima for linear regression problems.
15.-Minimum norm is reasonable for generalization because it picks a solution that leverages useful structure/regularity in parameters.
16.-Kernel trick allows exact minimum norm solution to be computed for kernel regression on datasets like MNIST in a few minutes.
17.-Kernel regression with no regularization or preprocessing gets 1.2% test error on MNIST, 0.6% error with a wavelet transform.
18.-On CIFAR-10, kernel regression on top of random convolutional features gets 16% test error, or 14% with some regularization.
19.-Shallow models like kernel machines can do surprisingly well just by interpolating the training data, questioning the need for depth.
20.-For linear models, margin is the inverse of parameter norm. Maximum margin means being as far as possible from the data.
21.-Minimum norm (max margin) solutions change very little with small perturbations, corresponding to "flat" optima. Sharp optima change a lot.
22.-Some theory bounds test error in terms of margin. Challenge is getting practical margin bounds for deep nets.
23.-Regularization makes the connection between optimization and generalization unclear. Analyzing optimization without regularization can provide clarity.
24.-Saddle points may not actually be a major problem for optimization, despite being an active research focus recently.
25.-Interpolating the training data doesn't necessarily lead to overfitting and poor generalization.
26.-Large margin classification is a promising framework for thinking about generalization in deep learning.
27.-Algorithmic stability likely leads to model stability and generalization. Stable training algorithms tend to result in models that generalize.
28.-Well-established ideas in statistical learning theory can provide insight into deep learning and help demystify recent empirical observations.
29.-The theoretical community hopes that a better formal understanding of deep learning will enable it to be deployed safely and reliably.
30.-As deep learning increasingly impacts all of society, it is critical that the models be predictable, robust and trustworthy.
Knowledge Vault built byDavid Vivancos 2024