Knowledge Vault 2/4 - ICLR 2014-2023
Vincent Vanhoucke ICLR 2014 - Invited Talk - Learning Visual Representations at Scale
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

Vincent Vanhoucke
ICLR 2014
Ex-speech scientist now
image recognition at Google 1
Convnets powerful hammer
across vision tasks 2
Goal build faster
convnets, efficient scaling 4
Parallelize training model
split vs data copy 5
Faster, efficient convnets
reduce redundant filters 15
Video classification challenges
fusion, 3D conv 21
Fine-grained dog classification
shows transfer power 3
Model parallelism has
communication penalty 6
Data parallelism has
parameter sync penalty 6
Implemented distributed model
and data parallelism 7
Design parallelization across
compute topologies, densities 9
Works best low
compute density 8
Breaks down high-density
cores like GPUs 8
Separable convolutions independent
channels, 1x1 projection 16
Fewer params 5-10x,
accurate, faster convergence 17
Scale object detection
more classes, data 18
Generic salient detector,
no sliding windows 19
Competitive VOC ImageNet
results, scalable 20
Bounding box inference
details in slides 30
Promising Youtube feature
transfer beats UCF-101 22
Physical modeling, graphics
help video compression 29
Big convnets +
finetuning beats ML 23
Robustness suggests deep
learning closer to right 24
Computation bottleneck, bigger
nets with dropout 25
Model size GPU
memory limited 26
Model parallelism convolutions,
data fully connected 10
Challenge fully connected
needs all conv data 11
Broadcast conv output,
parallel next layer chunks 12
Pipeline broadcast next
batch during current 13
3.74x 4 GPU,
6.32x 8 GPU speedup 14
Large minibatches work,
overlap communication computation 28

Resume:

1.-Speaker is recovering speech recognition scientist now working on image recognition in deep learning infrastructure group at Google.

2.-Convolutional nets are a powerful "hammer" approach that work well across many computer vision tasks like search, labeling, segmentation, detection.

3.-Example of fine-grained dog breed classification - using pre-trained ImageNet model with new data gets 2nd place, showing transfer learning power.

4.-Goal is to build better convolutional nets faster, make training more efficient as you scale to larger data and models.

5.-Two approaches to parallelize neural net training - model parallelism (split network across machines) and data parallelism (copy model, split data).

6.-Model parallelism has communication penalty exchanging data between machines. Data parallelism has communication penalty synchronizing parameters between workers.

7.-Implemented distributed system doing both model and data parallelism. Uses asynchronous SGD. But efficiency still poor as you add machines.

8.-Works best with low compute density (fast network, slow cores). Breaks down with high-density cores like GPUs due to communication bottleneck.

9.-Goal is to design parallelization approach that works across different compute topologies and densities to keep up with fast-changing hardware.

10.-New idea from Alex Krizhevsky - use model parallelism for convolutional layers, data parallelism for fully connected layers.

11.-Challenge is fully-connected layers need all data from convolutional layer, causing communication bottleneck when switching parallelization approaches.

12.-Solution - broadcast convolutional layer output to all workers, have them work on chunks of next layer in parallel.

13.-Clever pipelining - start broadcasting next batch to fully-connected during computation of current batch to overlap communication and computation.

14.-Can get 3.74x speedup on 4 GPUs (near optimal 4x), 6.32x on 8 GPUs. Faster than other approaches in literature.

15.-Next topic - making convnets faster and more efficient. Filters are often redundant, e.g. RGB filters in first layer very similar.

16.-Separable convolutions - first convolve each input channel independently to make many thinner feature maps, then 1x1 convolution to project.

17.-Uses many fewer parameters, 5-10x reduction typical. Just as accurate, converges faster. Simple to implement. Works best for large-scale tasks.

18.-Next frontier is scaling object detection with convnets to more classes and data. Current approaches slow and class-specific.

19.-New approach - build generic "salient object" detector using conv features to directly propose object regions, no sliding windows.

20.-Gets competitive results on VOC and ImageNet detection in a much more scalable way by restricting model and region proposal complexity.

21.-Video classification next challenge. No clear best approaches yet - late fusion, early fusion, 3D convolution, hybrid? Computationally limited.

22.-But seeing promising transfer learning - convnet features learned on Youtube videos beat state-of-the-art on UCF-101 benchmark, without convnet finetuning.

23.-Over and over, big convnet models plus smaller-scale task-specific finetuning yields state-of-the-art results, much more robust than past ML.

24.-This robustness to new data suggests deep learning models are closer to "right" approach, not just overfitting.

25.-Biggest bottleneck is computation - with unlimited compute, could train much bigger nets with more dropout regularization and almost surely improve.

26.-Currently model size limited by GPU memory. Some models train for months to eke out small gains. More computation would help a lot.

27.-In the separable convolutions, typically use a "depth multiplier" of 8 - turn each input channel into 8 feature maps before 1x1.

28.-Very large minibatches work surprisingly well, don't usually hurt convergence. Allows better overlapping of communication and computation during distributed training.

29.-Physical modeling and inverting graphics pipeline could help video, strong prior for data compression. But unclear how much it helps in general.

30.-Details of bounding box inference approach in slides not discussed in depth - left for audience to discuss with authors afterwards.

Knowledge Vault built byDavid Vivancos 2024