The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Karen Simonyan, Andrew Zisserman ICLR 2015

graph LR classDef arch fill:#f9d4d4, font-weight:bold, font-size:14px; classDef train fill:#d4f9d4, font-weight:bold, font-size:14px; classDef eval fill:#d4d4f9, font-weight:bold, font-size:14px; classDef results fill:#f9f9d4, font-weight:bold, font-size:14px; classDef impact fill:#f9d4f9, font-weight:bold, font-size:14px; A[Karen Simonyan et al
ICLR 2015] --> B[Deeper ConvNets improve
ImageNet classification. 1] B --> C[Depth evaluation on
same architecture. 2] C --> D[Deeper models than
prior art. 3] A --> E[Uses small 3x3
kernels throughout. 7] E --> F[Stacked 3x3 without pooling:
larger receptive field. 9] E --> G[Stacked 3x3: more nonlinearity,
fewer parameters. 10] E --> H[3x3 kernels simplify
architecture design. 11] A --> I[Architectures: 11 to
19 layers. 12] A --> J[Fixed 224x224 input,
rescale, crop. 13] J --> K[Multi-scale training:
rescale 256-512, crop. 14] A --> L[Standard augmentation used,
no distortions. 15] A --> M[Mini-batch gradient
descent optimization. 16] A --> N[11-layer initializes
deeper nets. 17] A --> O[Two testing: crop sampling,
fully convolutional. 19] O --> P[Multi-scale evaluation on
multiple resolutions helps. 20] A --> Q[Implementation: Caffe,
multiple GPUs. 21] A --> R[Results: depth important,
multi-scale helps. 22] R --> S[Dense and crop
evaluation complementary. 23] R --> T[Won 2014 ImageNet
localization and classification. 24] A --> U[Deep features outperform
shallower on datasets. 27] A --> V[Depth crucial for ImageNet,
3x3 works well. 29] A --> W[Models publicly available
for download. 5] W --> X[Released models enabled detection,
segmentation advances. 28] X --> Y[16, 19-layer models
released for use. 30] class B,C,D,E,F,G,H,I arch; class J,K,L,M,N train; class O,P,Q eval; class R,S,T,U,V results; class W,X,Y impact;

Resume:

1.-Convolutional networks have been getting deeper over time to improve performance on ImageNet classification.

2.-The work evaluates convolutional networks of different depths on ImageNet that share the same architecture design except for depth.

3.-Models are much deeper compared to prior state-of-the-art like AlexNet.

4.-Deeper features are evaluated on other datasets.

5.-Models were made publicly available for the community to download and use.

6.-A single family of networks is explored where only the depth differs, fixing other key design choices.

7.-Very small 3x3 convolutional kernels are used in all layers with stride 1, differing from prior work.

8.-Other conventional details are used like max pooling, dropout, fully connected layers, with the last layer performing classification.

9.-Stacked 3x3 conv layers without pooling in between have a larger receptive field than a single layer.

10.-Stacked 3x3 layers have more non-linearity making the decision function more discriminative, and have fewer parameters.

11.-Committing to 3x3 kernels throughout makes architecture design easier.

12.-Architectures are constructed by starting with 11 layers and injecting more 3x3 conv layers to get 13, 16, 19 layers.

13.-Input is a fixed 224x224 image. Conventional approach is rescaling to preserve aspect ratio then taking a random crop.

14.-Multi-scale training is used, rescaling each image to a randomly sampled size between 256-512 before taking a fixed crop.

15.-Standard augmentation like horizontal flips and RGB offsets are used, but no advanced automatic distortions.

16.-Networks are optimized with mini-batch gradient descent with momentum. Convergence is fast in ~74 epochs due to small kernels.

17.-11-layer model is initialized from Gaussian and used to initialize deeper nets without fixing the layers.

18.-Entirely random initialization per layer is also possible if scaled to preserve magnitudes.

19.-Two testing approaches: Random crop sampling with prediction combination, and fully convolutional evaluation to get class score maps.

20.-Both testing approaches are tried along with combining their predictions. Multi-scale evaluation by applying to multiple resolutions helps.

21.-Implementation used modified Caffe toolbox supporting multiple GPUs with synchronous data parallelism. 3.7x speedup with 4 GPUs.

22.-Results show depth is important, with 16 and 19-layer nets beating 11-layer nets substantially. Multi-scale training and testing help.

23.-Dense evaluation and multiple crop evaluation yield comparable results and are complementary when combined.

24.-The approach won the 2014 ImageNet localization challenge and got 2nd in classification after GoogleNet. Single model got 7% error.

25.-Both VGG and GoogleNet used very deep networks with multi-scale training. VGG used simple 3x3 kernels, GoogleNet used complex Inception.

26.-Even better results reported after, building on the deep 3x3 VGG nets but wider with more aggressive augmentation.

27.-Deep representations work well as feature extractors on other datasets. Deeper features beat less deep ones even with simple classifiers.

28.-The publicly released 16 and 19-layer models enabled advances in object detection, segmentation, captioning after release.

29.-Convolutional depth is very important for ImageNet classification. Networks built with stacked 3x3 conv layers work well.

30.-The 16 and 19-layer models were released and can be used in any package with a Caffe or Torch backend.

Knowledge Vault built byDavid Vivancos 2024