The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Chiyuan Zhang et al. ICLR 2017

graph LR classDef random fill:#f9d4d4, font-weight:bold, font-size:14px; classDef regularization fill:#d4f9d4, font-weight:bold, font-size:14px; classDef generalization fill:#d4d4f9, font-weight:bold, font-size:14px; classDef optimization fill:#f9f9d4, font-weight:bold, font-size:14px; classDef complexity fill:#f9d4f9, font-weight:bold, font-size:14px; A[Chiyuan Zhang et al
ICLR 2017] --> B[DNNs fit random labels
with zero error. 1] B --> C[Fitting noisy labels: easy
optimization, small time increase. 2] A --> D[Traditional complexity measures fail
to explain DNN generalization. 3] B --> E[DNNs perfectly fit
random noise images. 4] B --> F[Label randomness: steady
generalization error, easy optimization. 5] A --> G[Regularizers help but aren't
necessary for generalization control. 6] B --> H[Various architectures fit randomized
CIFAR10 with 100% accuracy. 7] B --> I[InceptionV3 >95% accuracy on
randomized ImageNet without tuning. 8] B --> J[Label randomization: slower
convergence, perfect fitting. 9] D --> K[Traditional theory can't distinguish
good vs. bad generalization. 10] D --> L[Rademacher complexity 1,
trivial generalization bound. 11] D --> M[VC and fat-shattering
dimension fail for DNNs. 12] D --> N[Uniform stability ignores data,
can't explain DNN generalization. 13] G --> O[No regularization: DNNs
still generalize well. 14] G --> P[Data augmentation >other
regularizers, but not required. 15] G --> Q[Early stopping, BatchNorm
help modestly. 16] D --> R[Expressivity results focus
on entire domain. 17] D --> S[2-layer ReLU fits
any sample labels. 18] D --> T[Linear models fit
any labels if overparameterized. 19] T --> U[SGD solution in
training data span. 20] T --> V['Kernel trick' fits
any labels via Gram matrix. 21] T --> W[Minimum-norm linear models
generalize without regularization. 22] W --> X[Kernel model regularization
doesn't improve performance. 23] T --> Y[Minimum-norm intuition incomplete
for complex models. 24] A --> Z[DNNs' effective capacity
shatters training data. 25] Z --> AA[Traditional complexity measures
inadequate for large DNNs. 26] C --> AB[Easy optimization continues
even without generalization. 27] D --> AC[No 'simplicity' complexity
measure yet for large DNNs. 28] F --> AD[Label randomness: steady
generalization decline, easy optimization. 29] Z --> AE[Open questions on
DNNs' effective capacity. 30] class B,E,F,H,I,J random; class G,O,P,Q,X regularization; class D,K,L,M,N,R,S,T,Y,AA,AC,AD,AE complexity; class C,AB,U,V,W optimization;

Resume:

1.-Deep neural networks can completely fit a random labeling of the training data, achieving zero training error.

2.-Despite fitting noisy labels, optimization of neural networks remains easy - training time only increases by a small constant factor.

3.-Experiments show that traditional complexity measures like VC-dimension, Rademacher complexity, and uniform stability fail to explain neural network generalization.

4.-Replacing true images with random noise still allows neural networks to perfectly fit the training data.

5.-As the level of randomness in labels is increased, generalization error grows steadily, but optimization remains easy.

6.-Explicit regularizers like weight decay, dropout, and data augmentation help but are not necessary or sufficient for controlling generalization error.

7.-Inception, AlexNet and MLPs can all fit a random labeling of CIFAR10 training data with 100% accuracy.

8.-On ImageNet with random labels, InceptionV3 still achieves over 95% top-1 training accuracy without hyperparameter tuning.

9.-With some label randomization, networks take longer to converge but still fit the corrupted training set perfectly.

10.-Traditional statistical learning theory is unable to distinguish between neural networks that generalize well and those that don't.

11.-Rademacher complexity of neural networks is close to 1, providing a trivial bound insufficient to explain generalization.

12.-VC-dimension and fat-shattering dimension bounds for neural networks are very large and also fail to explain generalization in practice.

13.-Uniform stability of training algorithm does not take data or label distribution into account and cannot explain neural network generalization.

14.-With regularization turned off, neural networks still generalize well, suggesting regularizers are not fundamental to controlling generalization error.

15.-Data augmentation improves generalization more than other regularization techniques, but models perform well even without any regularization.

16.-Early stopping can improve generalization but is not always helpful. Batch normalization stabilizes training and improves generalization modestly.

17.-Expressivity results for neural networks focus on functions over the entire domain rather than finite samples used in practice.

18.-A simple 2-layer ReLU network with 2n+d weights can fit any labeling of any sample of size n in d dimensions.

19.-Linear models can fit any labels exactly if number of parameters exceeds number of data points, even without regularization.

20.-Stochastic gradient descent converges to a solution that lies in the span of the training data points.

21.-The "kernel trick" allows linear models to fit any labels by using a Gram matrix of dot products between data points.

22.-Fitting training labels exactly with minimum-norm linear models yields good test performance on MNIST and CIFAR10 without regularization.

23.-Adding regularization to kernel models does not improve performance, showing good generalization is possible without explicit regularization.

24.-The minimum-norm intuition from linear models provides some insight but does not fully predict generalization in more complex models.

25.-The effective capacity of successful neural networks is large enough to shatter the training data and fit random labels.

26.-Traditional measures of model complexity are inadequate to explain the generalization ability of large neural networks.

27.-Optimization continues to be easy empirically even if the model is not generalizing, showing ease of optimization is not the cause of generalization.

28.-The authors argue we have not yet discovered a formal complexity measure under which large neural networks are effectively "simple."

29.-Increasing randomness in labels causes a steady increase in generalization error while optimization remains easy.

30.-The experiments show there are still open questions about what precisely constitutes the effective capacity of neural networks.

Knowledge Vault built byDavid Vivancos 2024