The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef google fill:#f9d4d4, font-weight:bold, font-size:14px; classDef convnets fill:#d4f9d4, font-weight:bold, font-size:14px; classDef parallelization fill:#d4d4f9, font-weight:bold, font-size:14px; classDef efficiency fill:#f9f9d4, font-weight:bold, font-size:14px; classDef objectdetection fill:#f9d4f9, font-weight:bold, font-size:14px; classDef video fill:#d4f9f9, font-weight:bold, font-size:14px; A[Vincent Vanhoucke
ICLR 2014] --> B[Ex-speech scientist now
image recognition at Google 1] A --> C[Convnets powerful hammer
across vision tasks 2] A --> E[Goal build faster
convnets, efficient scaling 4] A --> F[Parallelize training model
split vs data copy 5] A --> R[Faster, efficient convnets
reduce redundant filters 15] A --> X[Video classification challenges
fusion, 3D conv 21] C --> D[Fine-grained dog classification
shows transfer power 3] F --> G[Model parallelism has
communication penalty 6] F --> H[Data parallelism has
parameter sync penalty 6] F --> I[Implemented distributed model
and data parallelism 7] F --> L[Design parallelization across
compute topologies, densities 9] I --> J[Works best low
compute density 8] I --> K[Breaks down high-density
cores like GPUs 8] R --> S[Separable convolutions independent
channels, 1x1 projection 16] S --> T[Fewer params 5-10x,
accurate, faster convergence 17] S --> AD[Separable depth multiplier
8 input to features 27] E --> U[Scale object detection
more classes, data 18] U --> V[Generic salient detector,
no sliding windows 19] V --> W[Competitive VOC ImageNet
results, scalable 20] V --> AG[Bounding box inference
details in slides 30] X --> Y[Promising Youtube feature
transfer beats UCF-101 22] X --> AF[Physical modeling, graphics
help video compression 29] E --> Z[Big convnets +
finetuning beats ML 23] Z --> AA[Robustness suggests deep
learning closer to right 24] E --> AB[Computation bottleneck, bigger
nets with dropout 25] AB --> AC[Model size GPU
memory limited 26] L --> M[Model parallelism convolutions,
data fully connected 10] M --> N[Challenge fully connected
needs all conv data 11] M --> O[Broadcast conv output,
parallel next layer chunks 12] O --> P[Pipeline broadcast next
batch during current 13] O --> Q[3.74x 4 GPU,
6.32x 8 GPU speedup 14] E --> AE[Large minibatches work,
overlap communication computation 28] class A,B google; class C,D,R,S,T convnets; class E,F,G,H,I,J,K,L,M,N,O,P,Q parallelization; class U,V,W,AG objectdetection; class X,Y,AF video; class Z,AA,AB,AC,AD,AE efficiency;

Resume:

1.-Speaker is recovering speech recognition scientist now working on image recognition in deep learning infrastructure group at Google.

2.-Convolutional nets are a powerful "hammer" approach that work well across many computer vision tasks like search, labeling, segmentation, detection.

3.-Example of fine-grained dog breed classification - using pre-trained ImageNet model with new data gets 2nd place, showing transfer learning power.

4.-Goal is to build better convolutional nets faster, make training more efficient as you scale to larger data and models.

5.-Two approaches to parallelize neural net training - model parallelism (split network across machines) and data parallelism (copy model, split data).

6.-Model parallelism has communication penalty exchanging data between machines. Data parallelism has communication penalty synchronizing parameters between workers.

7.-Implemented distributed system doing both model and data parallelism. Uses asynchronous SGD. But efficiency still poor as you add machines.

8.-Works best with low compute density (fast network, slow cores). Breaks down with high-density cores like GPUs due to communication bottleneck.

9.-Goal is to design parallelization approach that works across different compute topologies and densities to keep up with fast-changing hardware.

10.-New idea from Alex Krizhevsky - use model parallelism for convolutional layers, data parallelism for fully connected layers.

11.-Challenge is fully-connected layers need all data from convolutional layer, causing communication bottleneck when switching parallelization approaches.

12.-Solution - broadcast convolutional layer output to all workers, have them work on chunks of next layer in parallel.

13.-Clever pipelining - start broadcasting next batch to fully-connected during computation of current batch to overlap communication and computation.

14.-Can get 3.74x speedup on 4 GPUs (near optimal 4x), 6.32x on 8 GPUs. Faster than other approaches in literature.

15.-Next topic - making convnets faster and more efficient. Filters are often redundant, e.g. RGB filters in first layer very similar.

16.-Separable convolutions - first convolve each input channel independently to make many thinner feature maps, then 1x1 convolution to project.

17.-Uses many fewer parameters, 5-10x reduction typical. Just as accurate, converges faster. Simple to implement. Works best for large-scale tasks.

18.-Next frontier is scaling object detection with convnets to more classes and data. Current approaches slow and class-specific.

19.-New approach - build generic "salient object" detector using conv features to directly propose object regions, no sliding windows.

20.-Gets competitive results on VOC and ImageNet detection in a much more scalable way by restricting model and region proposal complexity.

21.-Video classification next challenge. No clear best approaches yet - late fusion, early fusion, 3D convolution, hybrid? Computationally limited.

22.-But seeing promising transfer learning - convnet features learned on Youtube videos beat state-of-the-art on UCF-101 benchmark, without convnet finetuning.

23.-Over and over, big convnet models plus smaller-scale task-specific finetuning yields state-of-the-art results, much more robust than past ML.

24.-This robustness to new data suggests deep learning models are closer to "right" approach, not just overfitting.

25.-Biggest bottleneck is computation - with unlimited compute, could train much bigger nets with more dropout regularization and almost surely improve.

26.-Currently model size limited by GPU memory. Some models train for months to eke out small gains. More computation would help a lot.

27.-In the separable convolutions, typically use a "depth multiplier" of 8 - turn each input channel into 8 feature maps before 1x1.

28.-Very large minibatches work surprisingly well, don't usually hurt convergence. Allows better overlapping of communication and computation during distributed training.

29.-Physical modeling and inverting graphics pipeline could help video, strong prior for data compression. But unclear how much it helps in general.

30.-Details of bounding box inference approach in slides not discussed in depth - left for audience to discuss with authors afterwards.

Knowledge Vault built byDavid Vivancos 2024