Karen Simonyan, Andrew Zisserman ICLR 2015 - Very Deep Convolutional Networks for Large-Scale Image Recognition

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef arch fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef train fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef eval fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef results fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef impact fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Karen Simonyan et al

ICLR 2015] --> B[Deeper ConvNets improve

ImageNet classification. 1] B --> C[Depth evaluation on

same architecture. 2] C --> D[Deeper models than

prior art. 3] A --> E[Uses small 3x3

kernels throughout. 7] E --> F[Stacked 3x3 without pooling:

larger receptive field. 9] E --> G[Stacked 3x3: more nonlinearity,

fewer parameters. 10] E --> H[3x3 kernels simplify

architecture design. 11] A --> I[Architectures: 11 to

19 layers. 12] A --> J[Fixed 224x224 input,

rescale, crop. 13] J --> K[Multi-scale training:

rescale 256-512, crop. 14] A --> L[Standard augmentation used,

no distortions. 15] A --> M[Mini-batch gradient

descent optimization. 16] A --> N[11-layer initializes

deeper nets. 17] A --> O[Two testing: crop sampling,

fully convolutional. 19] O --> P[Multi-scale evaluation on

multiple resolutions helps. 20] A --> Q[Implementation: Caffe,

multiple GPUs. 21] A --> R[Results: depth important,

multi-scale helps. 22] R --> S[Dense and crop

evaluation complementary. 23] R --> T[Won 2014 ImageNet

localization and classification. 24] A --> U[Deep features outperform

shallower on datasets. 27] A --> V[Depth crucial for ImageNet,

3x3 works well. 29] A --> W[Models publicly available

for download. 5] W --> X[Released models enabled detection,

segmentation advances. 28] X --> Y[16, 19-layer models

released for use. 30] class B,C,D,E,F,G,H,I arch; class J,K,L,M,N train; class O,P,Q eval; class R,S,T,U,V results; class W,X,Y impact;

ICLR 2015] --> B[Deeper ConvNets improve

ImageNet classification. 1] B --> C[Depth evaluation on

same architecture. 2] C --> D[Deeper models than

prior art. 3] A --> E[Uses small 3x3

kernels throughout. 7] E --> F[Stacked 3x3 without pooling:

larger receptive field. 9] E --> G[Stacked 3x3: more nonlinearity,

fewer parameters. 10] E --> H[3x3 kernels simplify

architecture design. 11] A --> I[Architectures: 11 to

19 layers. 12] A --> J[Fixed 224x224 input,

rescale, crop. 13] J --> K[Multi-scale training:

rescale 256-512, crop. 14] A --> L[Standard augmentation used,

no distortions. 15] A --> M[Mini-batch gradient

descent optimization. 16] A --> N[11-layer initializes

deeper nets. 17] A --> O[Two testing: crop sampling,

fully convolutional. 19] O --> P[Multi-scale evaluation on

multiple resolutions helps. 20] A --> Q[Implementation: Caffe,

multiple GPUs. 21] A --> R[Results: depth important,

multi-scale helps. 22] R --> S[Dense and crop

evaluation complementary. 23] R --> T[Won 2014 ImageNet

localization and classification. 24] A --> U[Deep features outperform

shallower on datasets. 27] A --> V[Depth crucial for ImageNet,

3x3 works well. 29] A --> W[Models publicly available

for download. 5] W --> X[Released models enabled detection,

segmentation advances. 28] X --> Y[16, 19-layer models

released for use. 30] class B,C,D,E,F,G,H,I arch; class J,K,L,M,N train; class O,P,Q eval; class R,S,T,U,V results; class W,X,Y impact;

**Resume: **

**1.-**Convolutional networks have been getting deeper over time to improve performance on ImageNet classification.

**2.-**The work evaluates convolutional networks of different depths on ImageNet that share the same architecture design except for depth.

**3.-**Models are much deeper compared to prior state-of-the-art like AlexNet.

**4.-**Deeper features are evaluated on other datasets.

**5.-**Models were made publicly available for the community to download and use.

**6.-**A single family of networks is explored where only the depth differs, fixing other key design choices.

**7.-**Very small 3x3 convolutional kernels are used in all layers with stride 1, differing from prior work.

**8.-**Other conventional details are used like max pooling, dropout, fully connected layers, with the last layer performing classification.

**9.-**Stacked 3x3 conv layers without pooling in between have a larger receptive field than a single layer.

**10.-**Stacked 3x3 layers have more non-linearity making the decision function more discriminative, and have fewer parameters.

**11.-**Committing to 3x3 kernels throughout makes architecture design easier.

**12.-**Architectures are constructed by starting with 11 layers and injecting more 3x3 conv layers to get 13, 16, 19 layers.

**13.-**Input is a fixed 224x224 image. Conventional approach is rescaling to preserve aspect ratio then taking a random crop.

**14.-**Multi-scale training is used, rescaling each image to a randomly sampled size between 256-512 before taking a fixed crop.

**15.-**Standard augmentation like horizontal flips and RGB offsets are used, but no advanced automatic distortions.

**16.-**Networks are optimized with mini-batch gradient descent with momentum. Convergence is fast in ~74 epochs due to small kernels.

**17.-**11-layer model is initialized from Gaussian and used to initialize deeper nets without fixing the layers.

**18.-**Entirely random initialization per layer is also possible if scaled to preserve magnitudes.

**19.-**Two testing approaches: Random crop sampling with prediction combination, and fully convolutional evaluation to get class score maps.

**20.-**Both testing approaches are tried along with combining their predictions. Multi-scale evaluation by applying to multiple resolutions helps.

**21.-**Implementation used modified Caffe toolbox supporting multiple GPUs with synchronous data parallelism. 3.7x speedup with 4 GPUs.

**22.-**Results show depth is important, with 16 and 19-layer nets beating 11-layer nets substantially. Multi-scale training and testing help.

**23.-**Dense evaluation and multiple crop evaluation yield comparable results and are complementary when combined.

**24.-**The approach won the 2014 ImageNet localization challenge and got 2nd in classification after GoogleNet. Single model got 7% error.

**25.-**Both VGG and GoogleNet used very deep networks with multi-scale training. VGG used simple 3x3 kernels, GoogleNet used complex Inception.

**26.-**Even better results reported after, building on the deep 3x3 VGG nets but wider with more aggressive augmentation.

**27.-**Deep representations work well as feature extractors on other datasets. Deeper features beat less deep ones even with simple classifiers.

**28.-**The publicly released 16 and 19-layer models enabled advances in object detection, segmentation, captioning after release.

**29.-**Convolutional depth is very important for ImageNet classification. Networks built with stacked 3x3 conv layers work well.

**30.-**The 16 and 19-layer models were released and can be used in any package with a Caffe or Torch backend.

Knowledge Vault built byDavid Vivancos 2024