Jonathan Frankle · Michael Carbin ICLR 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef lottery fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef pruning fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef tickets fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef initialization fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef networks fill:#f9d4f9, font-weight:bold, font-size:14px;
classDef results fill:#d4f9f9, font-weight:bold, font-size:14px;
classDef future fill:#f9d4d4, font-weight:bold, font-size:14px;
A[J. Frankle & M. Carbin

ICLR 2019] --> B[Lottery ticket hypothesis:

sparse trainable subnetworks 1] A --> C[Iterative pruning: train,

prune, reset, sparsify 2] C --> D[One-shot pruning finds tickets,

not as small 5] A --> E[Winning tickets: sparse, faster,

more accurate subnetworks 3] E --> F[Random reinitialization degrades

winning ticket performance 4] C --> G[Convolutional nets: 10-20%

size, faster, more accurate 6] G --> H[Pruning + dropout:

greater accuracy improvements 7] C --> I[Deeper nets need lower

learning rates, warmup 8] E --> J[Winning tickets: higher accuracy,

faster learning, better generalization 9] E --> K[Winning tickets match accuracy

at 10-20% size 10] F --> L[Original initialization critical,

not just architecture 11] E --> M[Winning tickets have

smaller generalization gap 12] C --> N[Resetting weights each round

outperforms continuous training 13] C --> O[Early stopping iteration proxies

learning speed 14] C --> P[Adam, SGD, momentum work

at various learning rates 15] C --> Q[Slower pruning 20% vs 60%

leads to smaller tickets 16] C --> R[Convolutional nets: prune FC layers

faster than conv 17] C --> S[Winning tickets found across

Gaussian initialization scales 18] C --> T[Larger LeNets yield higher

accuracy winning tickets 19] C --> U[Winning tickets found with

and without dropout 20] C --> V[Pruning conv and FC layers

together most effective 21] E --> W[Winning tickets have shifted,

bimodal weight initialization 22] E --> X[Winning ticket units have

similar input, varying output 23] E --> Y[Winning tickets robust to

noise in initializations 24] E --> Z[Winning ticket weights move

further from initializations 25] C --> AA[Global pruning beats layer-wise

in deep nets 26] I --> AB[Warmup enables winning tickets

at higher learning rates 27] AB --> AC[5-20k warmup iterations work

best for VGG, ResNet 28] A --> AD[Insights into overparameterization,

optimization of neural nets 29] A --> AE[Future work: improve training,

networks, theory with tickets 30] class A,AD,AE future; class B lottery; class C,D,N,O,P,Q,R,S,T,U,V,AA,I,AB,AC pruning; class E,F,J,K,L,M,W,X,Y,Z tickets; class G,H networks; class L,S initialization;

ICLR 2019] --> B[Lottery ticket hypothesis:

sparse trainable subnetworks 1] A --> C[Iterative pruning: train,

prune, reset, sparsify 2] C --> D[One-shot pruning finds tickets,

not as small 5] A --> E[Winning tickets: sparse, faster,

more accurate subnetworks 3] E --> F[Random reinitialization degrades

winning ticket performance 4] C --> G[Convolutional nets: 10-20%

size, faster, more accurate 6] G --> H[Pruning + dropout:

greater accuracy improvements 7] C --> I[Deeper nets need lower

learning rates, warmup 8] E --> J[Winning tickets: higher accuracy,

faster learning, better generalization 9] E --> K[Winning tickets match accuracy

at 10-20% size 10] F --> L[Original initialization critical,

not just architecture 11] E --> M[Winning tickets have

smaller generalization gap 12] C --> N[Resetting weights each round

outperforms continuous training 13] C --> O[Early stopping iteration proxies

learning speed 14] C --> P[Adam, SGD, momentum work

at various learning rates 15] C --> Q[Slower pruning 20% vs 60%

leads to smaller tickets 16] C --> R[Convolutional nets: prune FC layers

faster than conv 17] C --> S[Winning tickets found across

Gaussian initialization scales 18] C --> T[Larger LeNets yield higher

accuracy winning tickets 19] C --> U[Winning tickets found with

and without dropout 20] C --> V[Pruning conv and FC layers

together most effective 21] E --> W[Winning tickets have shifted,

bimodal weight initialization 22] E --> X[Winning ticket units have

similar input, varying output 23] E --> Y[Winning tickets robust to

noise in initializations 24] E --> Z[Winning ticket weights move

further from initializations 25] C --> AA[Global pruning beats layer-wise

in deep nets 26] I --> AB[Warmup enables winning tickets

at higher learning rates 27] AB --> AC[5-20k warmup iterations work

best for VGG, ResNet 28] A --> AD[Insights into overparameterization,

optimization of neural nets 29] A --> AE[Future work: improve training,

networks, theory with tickets 30] class A,AD,AE future; class B lottery; class C,D,N,O,P,Q,R,S,T,U,V,AA,I,AB,AC pruning; class E,F,J,K,L,M,W,X,Y,Z tickets; class G,H networks; class L,S initialization;

**Resume: **

**1.-**The lottery ticket hypothesis proposes that dense neural networks contain sparse subnetworks that can be trained in isolation to full accuracy.

**2.-**Iterative pruning involves training, pruning, and resetting a network over several rounds, gradually sparsifying it while attempting to maintain accuracy.

**3.-**Winning tickets are sparse subnetworks that train faster and reach higher accuracy than the original network when reset and trained in isolation.

**4.-**Randomly reinitializing winning tickets degrades their performance, showing the importance of a fortuitous initialization for enabling effective training of sparse networks.

**5.-**_One-shot pruning, where a network is pruned just once after training, can find winning tickets but not as small as iterative pruning.

**6.-**In convolutional networks, iterative pruning finds winning tickets 10-20% the size of the original network, showing dramatically improved accuracy and training speed.

**7.-**Training the pruned networks with dropout leads to even greater accuracy improvements, suggesting pruning and dropout have complementary regularizing effects.

**8.-**On deeper networks like VGG-19 and ResNet-18, iterative pruning requires lower learning rates or learning rate warmup to find winning tickets.

**9.-**Winning tickets reach higher test accuracy at smaller sizes and learn faster than the original networks across fully-connected and convolutional architectures.

**10.-**Winning tickets found through iterative pruning match the accuracy of the original network at 10-20% of the size on the architectures tested.

**11.-**Winning tickets that are randomly reinitialized perform significantly worse, indicating the importance of the original initialization rather than just the architecture.

**12.-**The gap between training and test accuracy is smaller for winning tickets, suggesting they generalize better than the original overparameterized networks.

**13.-**Different iterative pruning strategies were evaluated, with resetting network weights each round performing better than continuing training without resetting weights.

**14.-**The iteration at which early stopping occurs on the validation set is used as a proxy metric for the speed of learning.

**15.-**Adam, SGD, and SGD with momentum optimizers were tested at various learning rates, all yielding winning tickets with iterative pruning.

**16.-**Slower pruning rates (e.g. removing 20% per iteration vs 60%) lead to finding smaller winning tickets that maintain performance.

**17.-**Different layer-wise pruning rates were compared for convolutional networks, with fully-connected layers pruned faster than convolutional layers for best results.

**18.-**Gaussian initializations with different standard deviations were tested; winning tickets were found in all cases with iterative pruning.

**19.-**Larger Lenet networks yielded winning tickets that reached higher accuracy, but relative performance was similar across different sized Lenets.

**20.-**Winning tickets were found when training with and without dropout, though presence of dropout affected learning speed in the unpruned networks.

**21.-**Pruning just convolutional or fully-connected layers alone was less effective than pruning both for reaching small winning ticket sizes.

**22.-**Winning ticket initializations form bimodal distributions shifted away from zero as networks are pruned, unlike the original Gaussian initializations.

**23.-**Units in winning tickets have similar levels of incoming connectivity after pruning, while some units retain far more outgoing connectivity.

**24.-**Adding Gaussian noise to winning ticket initializations only gradually degrades accuracy, showing robustness to perturbations in their initial weight values.

**25.-**Winning ticket weights consistently move further from their initializations compared to weights pruned early, suggesting pruning finds fortuitous initialization trajectories.

**26.-**Globally pruning across all layers performs better than layer-wise pruning for finding small winning tickets in very deep networks (VGG-19, ResNet-18).

**27.-**Learning rate warmup enables finding winning tickets at larger learning rates in deep networks when standard iterative pruning struggles.

**28.-**Evaluating different warmup durations, 5k-20k iterations of warmup improved results with 20k (ResNet-18) and 10k (VGG-19) working best.

**29.-**The lottery ticket hypothesis may provide insight into the role of overparameterization and the optimization of neural networks.

**30.-**Future work aims to leverage winning tickets to improve training performance, design better networks, and advance theoretical understanding of neural networks.

Knowledge Vault built byDavid Vivancos 2024