The End Of Knowledge - Vault 5/14 - CVPR - 2016 - Deep Residual Learning for Image Recognition

graph LR classDef competitions fill:#f9d4d4, font-weight:bold, font-size:14px classDef network fill:#d4f9d4, font-weight:bold, font-size:14px classDef performance fill:#d4d4f9, font-weight:bold, font-size:14px classDef resnet fill:#f9f9d4, font-weight:bold, font-size:14px classDef applications fill:#f9d4f9, font-weight:bold, font-size:14px A[Deep Residual Learning
for Image Recognition] --> B[ResNet won
2015 competitions. 1] A --> C[Network depth
increases over time. 2] C --> D[Depth improves
object detection results. 3] C --> E[AlexNet 2012, VGGNet 2014,
ResNet 2015. 4] C --> F[More layers dont
ensure better performance. 5] C --> G[Deeper networks can have
higher errors. 6] G --> H[Deeper models should lower
training error. 7] G --> I[Solvers struggle with
very deep networks. 8] A --> J[ResNet uses
identity skip connections. 9] J --> K[Weights easier to
adjust with identity. 10] J --> L[ResNet: VGG design +
skip connections. 11] J --> M[ResNets outperform
plain nets in depth. 12] J --> N[34-layer ResNet
beats 18-layer one. 13] J --> O[Fewer filters needed
in deeper models. 14] A --> P[ResNets useful
beyond classification. 15] P --> Q[ResNet-101
outperforms VGG-16. 16] Q --> R[ResNet-101 trained COCO detector
excels. 17] P --> S[ResNets excel in
various benchmarks. 18] P --> T[ResNets applied
beyond vision tasks. 19] A --> U[Depth enables
easier training. 20] U --> V[ResNets: easy training,
accurate, transferable. 21] U --> W[Future: 200-layer ImageNet,
1000-layer CIFAR-10. 22] U --> X[Pretrained models,
training code released. 23] U --> Y[Million-layer networks
unlikely soon. 24] U --> Z[Depth is one network
design dimension. 25] U --> AA[Depth not always
cost-effective. 26] U --> AB[Balance depth with
other factors. 27] P --> AC[Deeper models can
use fewer filters. 28] P --> AD[ResNet-101 shows
feature transferability. 29] P --> AE[ResNets: state-of-the-art,
versatile. 30] class A,B competitions class C,D,E,F,G,H,I,J,K,L,M,N,O network class P,Q,R,S,T performance class U,V,W,X,Y,Z,AA,AB resnet class AC,AD,AE applications

Resume:

1.- ResNet won 5 main tracks in 2015 ImageNet & COCO competitions, often by a large margin

2.- ImageNet benchmark shows increasing network depth over time, from non-deep methods to 150+ layer networks

3.- Increasing depth has greatly improved results on tasks like Pascal VOC object detection

4.- AlexNet (8 layers) was state-of-the-art in 2012, VGGNet/GoogleNet (20 layers) in 2014, ResNet (150+ layers) in 2015

5.- Simply stacking more layers does not guarantee better performance

6.- Experiments show deeper plain networks can have higher training and test error than shallower networks

7.- Intuition: Deeper models have richer solution space so shouldn't have higher training error

8.- Hypothesis: Current solvers (SGD, backprop) have optimization difficulties for very deep networks

9.- ResNet solution: Have layers learn residual functions with reference to layer inputs, using identity skip connections

10.- Hypothesis: Easier to set weights to 0 if identity is optimal, easier to learn small fluctuations on identity

11.- ResNet design: Similar to VGG - 3x3 conv layers, double filters when halving spatial size. Convert to ResNet with skip connections

12.- Results on CIFAR-10: Plain nets' error increases with depth, ResNets' error decreases even past 100 layers

13.- ImageNet: 34-layer ResNet outperforms 18-layer one, error decreases up to 152 layers while keeping lower complexity than VGG

14.- Hypothesis: Expressiveness of deeper models means fewer filters needed, allowing deeper ResNets with low complexity

15.- ResNets are useful as feature extractors for other vision tasks beyond just classifiers

16.- ResNet-101 features gave 28% gain over VGG-16 for object detection

17.- COCO object detection: 80-category detector trained on ResNet-101 features detects many object classes in images/video

18.- ResNets lead on many benchmarks - PASCAL VOC, VQA challenge, human pose est., depth est., segment proposal

19.- ResNets also used beyond vision - image generation, NLP, speech recognition, computational advertising

20.- Central idea is going deeper by making it easier to train very deep nets

21.- Conclusions: ResNets are easy to train, gain accuracy from depth, and provide good transferable features

22.- Follow-up work: 200-layer ImageNet, 1000-layer CIFAR-10 ResNets

23.- Released pretrained ImageNet models in Caffe, Facebook released Torch training code. Many 3rd party implementations available

24.- Author doesn't expect million layer networks by next CVPR

25.- Depth is one dimension of network design space to explore, along with width etc.

26.- Going deeper not always most economical for a given computational budget

27.- ResNets enable training deeper nets but optimal balancing with other factors needed

28.- Deeper models more expressive so can potentially use fewer filters

29.- Simply replacing VGG-16 with ResNet-101 gave large object detection gains, showing feature transferability

30.- ResNets are state-of-the-art across many vision benchmarks and have applications beyond vision as well

Knowledge Vault built byDavid Vivancos 2024