Knowledge Vault 2/18 - ICLR 2014-2023
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun ICLR 2015 - Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef performance fill:#f9d4d4, font-weight:bold, font-size:14px; classDef convolutions fill:#d4f9d4, font-weight:bold, font-size:14px; classDef fourier fill:#d4d4f9, font-weight:bold, font-size:14px; classDef implementations fill:#f9f9d4, font-weight:bold, font-size:14px; classDef optimizations fill:#f9d4f9, font-weight:bold, font-size:14px; A[Nicolas Vasilache et all
ICLR 2015] --> B[Facebook improves
convolutional neural networks. 1] A --> C[Convolutions computationally expensive,
justify GPUs. 2] A --> D[Perform convolutions efficiently
using FFT. 3] D --> E[Convolutions become pointwise
multiplications in Fourier. 4] A --> F[Contributions: FFTs, autotuning,
optimized implementations. 5] A --> G[Problem now bandwidth-bound,
not compute-bound. 6] G --> H[Performance ceiling raised
to memory bandwidth. 7] A --> I[Fourier convolutions:
FFT, operations, transform back. 8] I --> J[Intermediate buffers
store temporary values. 9] A --> K[Initial implementation:
NVIDIA libraries, autotuning. 10] K --> L[Autotuner considered efficiency
vs memory tradeoffs. 11] A --> M[Vendor FFT tuned
for large datasets. 12] M --> N[New FPFFT for
small, efficient FFTs. 13] N --> O[FPFFT treated GPU
as vector machine. 14] O --> P[Vector machine less effective,
limited shuffles. 15] A --> Q[Memory tradeoff: parallelism,
collisions, efficiency, reuse. 16] Q --> R[9x memory increase largest layer,
tiling helps. 17] A --> S[ConvNets need small FFTs,
same cost. 18] A --> T[Optimize algorithm first,
then implementation. 19] A --> U[QFFT+QBLAST vs
cuDNN heatmaps. 20] U --> V[Small kernel counts
latency limited. 21] A --> W[Torch numbers from
Dec 2014. 22] A --> X[VGG input size forces
interpolation, tiling helps. 23] A --> Y[Latest work: tiled FFT,
padding, reuse. 24] Y --> Z[Faster FFT in progress,
numbers improving. 25] A --> AA[Float16 could provide
2x improvement. 26] A --> AB[Tiling decomposes large convolutions,
reuses buffers. 27] AB --> AC[Out-of-memory error reverts
to low memory mode. 28] A --> AD[Bottleneck shifted to memory
bandwidth, room for optimization. 29] A --> AE[Future developments: low-precision,
memory bandwidth constraints. 30] class B,C performance; class D,E,I,J fourier; class F,K,L,M,N,O,P,T,W,X,Y,Z,AA,AB,AC,AD,AE implementations; class G,H,Q,R,S optimizations; class U,V convolutions;

Resume:

1.-Facebook has been working on improving the performance of convolutional neural networks, focusing on making the convolutions more efficient.

2.-Convolutions are computationally expensive, taking up over 80% of the time in convolutional nets. They are the main justification for using GPUs.

3.-The approach is to perform convolutions more efficiently in the Fourier basis using the Fast Fourier Transform (FFT).

4.-In the Fourier domain, convolutions become pointwise multiplications, reducing computational cost from N^2 to NlogN when using FFT.

5.-Main contributions: Decomposing convolutions into FFTs, transpose, and matrix multiplications; adding auto-tuning; developing higher performance FFT and matrix multiplication implementations.

6.-The problem is now bandwidth-bound rather than compute-bound. Bandwidth requirements can be shifted from memory to cache using HPC techniques.

7.-The achievable performance ceiling has been raised from being limited by GPU FLOPs to now being limited by memory bandwidth.

8.-Convolutions are represented in the Fourier domain by applying FFT to input tensors, performing operations, and transforming back.

9.-Intermediate buffers are used to store temporary values. Some are always necessary while others depend on whether explicit transposition is done.

10.-The initial implementation used NVIDIA libraries and added autotuning to explore the design space of batched vs iterated calls, FFT interpolation, etc.

11.-The autotuner also considered tradeoffs between efficiency and additional memory consumption for transposition and matrix multiplication (FBMM).

12.-The vendor FFT libraries are tuned for large HPC datasets, but ConvNets can work well with many small FFTs.

13.-A new FPFFT implementation was developed to reduce inefficiencies in the NVIDIA FFT library and allow on-the-fly zero padding.

14.-The paper's FPFFT implementation treated the GPU as a wide vector machine, but was still limited by the number of shuffle operations.

15.-Viewing the GPU as a wide vector machine is less effective due to the limited number of shuffles compared to multiplications.

16.-Memory consumption is a tradeoff between parallelism, avoiding collisions, efficiency, and reuse. It needs to be balanced against available memory.

17.-The paper's version used 9x memory increase for the largest layer, 3x when only using FBFFT and FBMM. Tiling helps for large inputs.

18.-The key insight is that ConvNets only need 16x16 or 32x32 FFTs, with the cost being the same for kernel sizes up to 15x15.

19.-Focus on the algorithm first, then optimize. The problem is now main memory bandwidth limited. Numbers presented are from a Dec 2014 implementation.

20.-Heatmaps compare QFFT+QBLAST vs cuDNN. For 3x3 kernels there are already regimes where QFFT+QBLAST wins, more so for 5x5, 7x7, 9x9.

21.-Very small kernel counts (<100) are latency limited rather than bandwidth limited and unlikely to benefit from the FFT approach.

22.-Latest numbers on Torch are based on the Dec 2014 implementation, not the most current work. Performance is decent but can be improved.

23.-On VGG, input size forces interpolation in a large Fourier basis, but tiling helps significantly in this case.

24.-Latest work includes tiled FFT, implicit padding, buffer reuse, memory management to reuse computed FFTs, and asynchronous implementation.

25.-Faster FFT is in progress and numbers will continue to improve. The main message is that while the problem is memory bandwidth limited, there is still room to grow.

26.-Using float16 could provide up to 2x performance improvement since the problem is memory bandwidth limited.

27.-Tiling allows decomposing a large convolution into smaller ones with lower memory footprint. Buffers can be reused or expanded to fit available memory.

28.-If an out-of-memory error is detected, the implementation can revert to a low memory consumption mode.

29.-The work has shifted the performance bottleneck from computation to memory bandwidth, but there are still opportunities for further optimization.

30.-Exciting future developments are anticipated, leveraging techniques like low-precision (float16) computation to further improve performance within the memory bandwidth constraints.

Knowledge Vault built byDavid Vivancos 2024