Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals ICLR 2017 - UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef random fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef regularization fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef generalization fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef optimization fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef complexity fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Chiyuan Zhang et al

ICLR 2017] --> B[DNNs fit random labels

with zero error. 1] B --> C[Fitting noisy labels: easy

optimization, small time increase. 2] A --> D[Traditional complexity measures fail

to explain DNN generalization. 3] B --> E[DNNs perfectly fit

random noise images. 4] B --> F[Label randomness: steady

generalization error, easy optimization. 5] A --> G[Regularizers help but aren't

necessary for generalization control. 6] B --> H[Various architectures fit randomized

CIFAR10 with 100% accuracy. 7] B --> I[InceptionV3 >95% accuracy on

randomized ImageNet without tuning. 8] B --> J[Label randomization: slower

convergence, perfect fitting. 9] D --> K[Traditional theory can't distinguish

good vs. bad generalization. 10] D --> L[Rademacher complexity 1,

trivial generalization bound. 11] D --> M[VC and fat-shattering

dimension fail for DNNs. 12] D --> N[Uniform stability ignores data,

can't explain DNN generalization. 13] G --> O[No regularization: DNNs

still generalize well. 14] G --> P[Data augmentation >other

regularizers, but not required. 15] G --> Q[Early stopping, BatchNorm

help modestly. 16] D --> R[Expressivity results focus

on entire domain. 17] D --> S[2-layer ReLU fits

any sample labels. 18] D --> T[Linear models fit

any labels if overparameterized. 19] T --> U[SGD solution in

training data span. 20] T --> V['Kernel trick' fits

any labels via Gram matrix. 21] T --> W[Minimum-norm linear models

generalize without regularization. 22] W --> X[Kernel model regularization

doesn't improve performance. 23] T --> Y[Minimum-norm intuition incomplete

for complex models. 24] A --> Z[DNNs' effective capacity

shatters training data. 25] Z --> AA[Traditional complexity measures

inadequate for large DNNs. 26] C --> AB[Easy optimization continues

even without generalization. 27] D --> AC[No 'simplicity' complexity

measure yet for large DNNs. 28] F --> AD[Label randomness: steady

generalization decline, easy optimization. 29] Z --> AE[Open questions on

DNNs' effective capacity. 30] class B,E,F,H,I,J random; class G,O,P,Q,X regularization; class D,K,L,M,N,R,S,T,Y,AA,AC,AD,AE complexity; class C,AB,U,V,W optimization;

ICLR 2017] --> B[DNNs fit random labels

with zero error. 1] B --> C[Fitting noisy labels: easy

optimization, small time increase. 2] A --> D[Traditional complexity measures fail

to explain DNN generalization. 3] B --> E[DNNs perfectly fit

random noise images. 4] B --> F[Label randomness: steady

generalization error, easy optimization. 5] A --> G[Regularizers help but aren't

necessary for generalization control. 6] B --> H[Various architectures fit randomized

CIFAR10 with 100% accuracy. 7] B --> I[InceptionV3 >95% accuracy on

randomized ImageNet without tuning. 8] B --> J[Label randomization: slower

convergence, perfect fitting. 9] D --> K[Traditional theory can't distinguish

good vs. bad generalization. 10] D --> L[Rademacher complexity 1,

trivial generalization bound. 11] D --> M[VC and fat-shattering

dimension fail for DNNs. 12] D --> N[Uniform stability ignores data,

can't explain DNN generalization. 13] G --> O[No regularization: DNNs

still generalize well. 14] G --> P[Data augmentation >other

regularizers, but not required. 15] G --> Q[Early stopping, BatchNorm

help modestly. 16] D --> R[Expressivity results focus

on entire domain. 17] D --> S[2-layer ReLU fits

any sample labels. 18] D --> T[Linear models fit

any labels if overparameterized. 19] T --> U[SGD solution in

training data span. 20] T --> V['Kernel trick' fits

any labels via Gram matrix. 21] T --> W[Minimum-norm linear models

generalize without regularization. 22] W --> X[Kernel model regularization

doesn't improve performance. 23] T --> Y[Minimum-norm intuition incomplete

for complex models. 24] A --> Z[DNNs' effective capacity

shatters training data. 25] Z --> AA[Traditional complexity measures

inadequate for large DNNs. 26] C --> AB[Easy optimization continues

even without generalization. 27] D --> AC[No 'simplicity' complexity

measure yet for large DNNs. 28] F --> AD[Label randomness: steady

generalization decline, easy optimization. 29] Z --> AE[Open questions on

DNNs' effective capacity. 30] class B,E,F,H,I,J random; class G,O,P,Q,X regularization; class D,K,L,M,N,R,S,T,Y,AA,AC,AD,AE complexity; class C,AB,U,V,W optimization;

**Resume: **

**1.-**Deep neural networks can completely fit a random labeling of the training data, achieving zero training error.

**2.-**Despite fitting noisy labels, optimization of neural networks remains easy - training time only increases by a small constant factor.

**3.-**Experiments show that traditional complexity measures like VC-dimension, Rademacher complexity, and uniform stability fail to explain neural network generalization.

**4.-**Replacing true images with random noise still allows neural networks to perfectly fit the training data.

**5.-**As the level of randomness in labels is increased, generalization error grows steadily, but optimization remains easy.

**6.-**Explicit regularizers like weight decay, dropout, and data augmentation help but are not necessary or sufficient for controlling generalization error.

**7.-**Inception, AlexNet and MLPs can all fit a random labeling of CIFAR10 training data with 100% accuracy.

**8.-**On ImageNet with random labels, InceptionV3 still achieves over 95% top-1 training accuracy without hyperparameter tuning.

**9.-**With some label randomization, networks take longer to converge but still fit the corrupted training set perfectly.

**10.-**Traditional statistical learning theory is unable to distinguish between neural networks that generalize well and those that don't.

**11.-**Rademacher complexity of neural networks is close to 1, providing a trivial bound insufficient to explain generalization.

**12.-**VC-dimension and fat-shattering dimension bounds for neural networks are very large and also fail to explain generalization in practice.

**13.-**Uniform stability of training algorithm does not take data or label distribution into account and cannot explain neural network generalization.

**14.-**With regularization turned off, neural networks still generalize well, suggesting regularizers are not fundamental to controlling generalization error.

**15.-**Data augmentation improves generalization more than other regularization techniques, but models perform well even without any regularization.

**16.-**Early stopping can improve generalization but is not always helpful. Batch normalization stabilizes training and improves generalization modestly.

**17.-**Expressivity results for neural networks focus on functions over the entire domain rather than finite samples used in practice.

**18.-**A simple 2-layer ReLU network with 2n+d weights can fit any labeling of any sample of size n in d dimensions.

**19.-**Linear models can fit any labels exactly if number of parameters exceeds number of data points, even without regularization.

**20.-**Stochastic gradient descent converges to a solution that lies in the span of the training data points.

**21.-**The "kernel trick" allows linear models to fit any labels by using a Gram matrix of dot products between data points.

**22.-**Fitting training labels exactly with minimum-norm linear models yields good test performance on MNIST and CIFAR10 without regularization.

**23.-**Adding regularization to kernel models does not improve performance, showing good generalization is possible without explicit regularization.

**24.-**The minimum-norm intuition from linear models provides some insight but does not fully predict generalization in more complex models.

**25.-**The effective capacity of successful neural networks is large enough to shatter the training data and fit random labels.

**26.-**Traditional measures of model complexity are inadequate to explain the generalization ability of large neural networks.

**27.-**Optimization continues to be easy empirically even if the model is not generalizing, showing ease of optimization is not the cause of generalization.

**28.-**The authors argue we have not yet discovered a formal complexity measure under which large neural networks are effectively "simple."

**29.-**Increasing randomness in labels causes a steady increase in generalization error while optimization remains easy.

**30.-**The experiments show there are still open questions about what precisely constitutes the effective capacity of neural networks.

Knowledge Vault built byDavid Vivancos 2024