The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef deeplearning fill:#f9d4d4, font-weight:bold, font-size:14px; classDef theory fill:#d4f9d4, font-weight:bold, font-size:14px; classDef optimization fill:#d4d4f9, font-weight:bold, font-size:14px; classDef generalization fill:#f9f9d4, font-weight:bold, font-size:14px; classDef kernels fill:#f9d4f9, font-weight:bold, font-size:14px; A[Benjamin Recht
ICLR 2017] --> B[Deep learning: remarkable
breakthroughs, capabilities. 1] A --> C[Trustable, predictable, safe
models needed. 2] A --> D[No consensus on optimization
difficulty causes. 3] A --> I[Population risk = training error
+ generalization gap. 8] A --> K[Bias-variance tradeoff: complexity reduces
bias, increases variance. 9] A --> S[Kernel trick allows exact
minimum norm computation. 16] D --> E[Good local minima have
mostly near-zero eigenvalues. 4] D --> F[Saddle points difficult to
converge to randomly. 5] D --> G[Zero training error easily
achieved without regularization. 6] G --> H[Test error divergence indicates overfitting. 7] I --> J[Zero training error doesn't
necessarily mean overfitting. 8] K --> L[Large unregularized nets can
fit arbitrary labels. 10] L --> M[Capacity is very high. 10] K --> N[Regularization, early stopping help
generalization, not necessary. 11] K --> O[Overfitting issue even for
simple overparameterized models. 12] O --> P[Infinitely many global minima
with zero eigenvalues. 13] S --> T[Kernel regression performs well
on MNIST, CIFAR-10. 17, 18] S --> Q[SGD converges to minimum
norm solution. 14] Q --> R[Minimum norm leverages useful
parameter structure/regularity. 15] K --> U[Shallow models 19] K --> V[Margin is inverse of
parameter norm. 20] V --> W[Minimum norm solutions change
little with perturbations. 21] K --> X[Theory bounds test error
in terms of margin. 22] K --> Y[Analyzing optimization without regularization
provides clarity. 23] K --> Z[Saddle points may not
be major problem. 24] K --> AA[Interpolation doesn't necessarily lead
to poor generalization. 25] C --> AB[Large margin classification promising
for deep learning generalization. 26] C --> AC[Algorithmic stability leads to
model stability, generalization. 27] C --> AD[Statistical learning theory provides
insight, demystifies observations. 28] C --> AE[Theoretical understanding enables safe,
reliable deployment. 29] C --> AF[Predictable, robust, trustworthy models
critical as impact grows. 30] class A,B deeplearning; class C,I,J,K,L,M,N,O,P,U,V,W,X,Y,Z,AA,AB,AC,AD,AE,AF theory; class D,E,F,G,H,Q,R optimization; class S,T kernels;

Resume:

1.-Deep learning has made remarkable breakthroughs in the last 5 years, enabling capabilities far beyond what was envisioned in 2011.

2.-To push deep learning into products affecting lives, we need a theoretical understanding to ensure models are trustable, predictable and behave safely.

3.-There's no consensus on what makes optimizing deep models hard. Hypotheses include difficulty finding global minima as network size increases.

4.-Another hypothesis is that good local minima have mostly near-zero eigenvalues in the Hessian, with very few positive or negative ones.

5.-Conventional wisdom is that saddle points make convergence difficult, but they are actually very difficult to converge to if initialized randomly.

6.-Optimizing deep models isn't hard - zero training error can be easily achieved on MNIST and CIFAR-10 without regularization.

7.-Test error diverging from near-zero train error indicates overfitting. But classification error can plateau at a reasonable level without diverging.

8.-Fundamental theory of ML: Population risk = Training error + Generalization gap. Zero training error doesn't necessarily mean overfitting.

9.-Bias-variance tradeoff: Increased model complexity reduces bias but increases variance. Deep learning operates in the "high variance" regime.

10.-Deep models with vastly more parameters than data points can fit arbitrary label patterns, even random noise. Capacity is very high.

11.-Regularization, early stopping, data augmentation etc. help generalization but aren't necessary conditions. Large unregularized nets outperform regularized shallower nets.

12.-Overfitting is an issue even for simple models like linear regression when number of parameters exceeds number of data points.

13.-With more parameters than data points, there are infinitely many global minima, all with the same Hessian with many zero eigenvalues.

14.-SGD converges to the minimum norm solution out of all global minima for linear regression problems.

15.-Minimum norm is reasonable for generalization because it picks a solution that leverages useful structure/regularity in parameters.

16.-Kernel trick allows exact minimum norm solution to be computed for kernel regression on datasets like MNIST in a few minutes.

17.-Kernel regression with no regularization or preprocessing gets 1.2% test error on MNIST, 0.6% error with a wavelet transform.

18.-On CIFAR-10, kernel regression on top of random convolutional features gets 16% test error, or 14% with some regularization.

19.-Shallow models like kernel machines can do surprisingly well just by interpolating the training data, questioning the need for depth.

20.-For linear models, margin is the inverse of parameter norm. Maximum margin means being as far as possible from the data.

21.-Minimum norm (max margin) solutions change very little with small perturbations, corresponding to "flat" optima. Sharp optima change a lot.

22.-Some theory bounds test error in terms of margin. Challenge is getting practical margin bounds for deep nets.

23.-Regularization makes the connection between optimization and generalization unclear. Analyzing optimization without regularization can provide clarity.

24.-Saddle points may not actually be a major problem for optimization, despite being an active research focus recently.

25.-Interpolating the training data doesn't necessarily lead to overfitting and poor generalization.

26.-Large margin classification is a promising framework for thinking about generalization in deep learning.

27.-Algorithmic stability likely leads to model stability and generalization. Stable training algorithms tend to result in models that generalize.

28.-Well-established ideas in statistical learning theory can provide insight into deep learning and help demystify recent empirical observations.

29.-The theoretical community hopes that a better formal understanding of deep learning will enable it to be deployed safely and reliably.

30.-As deep learning increasingly impacts all of society, it is critical that the models be predictable, robust and trustworthy.

Knowledge Vault built byDavid Vivancos 2024