Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun ICLR 2015 - Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef performance fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef convolutions fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef fourier fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef implementations fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef optimizations fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Nicolas Vasilache et all

ICLR 2015] --> B[Facebook improves

convolutional neural networks. 1] A --> C[Convolutions computationally expensive,

justify GPUs. 2] A --> D[Perform convolutions efficiently

using FFT. 3] D --> E[Convolutions become pointwise

multiplications in Fourier. 4] A --> F[Contributions: FFTs, autotuning,

optimized implementations. 5] A --> G[Problem now bandwidth-bound,

not compute-bound. 6] G --> H[Performance ceiling raised

to memory bandwidth. 7] A --> I[Fourier convolutions:

FFT, operations, transform back. 8] I --> J[Intermediate buffers

store temporary values. 9] A --> K[Initial implementation:

NVIDIA libraries, autotuning. 10] K --> L[Autotuner considered efficiency

vs memory tradeoffs. 11] A --> M[Vendor FFT tuned

for large datasets. 12] M --> N[New FPFFT for

small, efficient FFTs. 13] N --> O[FPFFT treated GPU

as vector machine. 14] O --> P[Vector machine less effective,

limited shuffles. 15] A --> Q[Memory tradeoff: parallelism,

collisions, efficiency, reuse. 16] Q --> R[9x memory increase largest layer,

tiling helps. 17] A --> S[ConvNets need small FFTs,

same cost. 18] A --> T[Optimize algorithm first,

then implementation. 19] A --> U[QFFT+QBLAST vs

cuDNN heatmaps. 20] U --> V[Small kernel counts

latency limited. 21] A --> W[Torch numbers from

Dec 2014. 22] A --> X[VGG input size forces

interpolation, tiling helps. 23] A --> Y[Latest work: tiled FFT,

padding, reuse. 24] Y --> Z[Faster FFT in progress,

numbers improving. 25] A --> AA[Float16 could provide

2x improvement. 26] A --> AB[Tiling decomposes large convolutions,

reuses buffers. 27] AB --> AC[Out-of-memory error reverts

to low memory mode. 28] A --> AD[Bottleneck shifted to memory

bandwidth, room for optimization. 29] A --> AE[Future developments: low-precision,

memory bandwidth constraints. 30] class B,C performance; class D,E,I,J fourier; class F,K,L,M,N,O,P,T,W,X,Y,Z,AA,AB,AC,AD,AE implementations; class G,H,Q,R,S optimizations; class U,V convolutions;

ICLR 2015] --> B[Facebook improves

convolutional neural networks. 1] A --> C[Convolutions computationally expensive,

justify GPUs. 2] A --> D[Perform convolutions efficiently

using FFT. 3] D --> E[Convolutions become pointwise

multiplications in Fourier. 4] A --> F[Contributions: FFTs, autotuning,

optimized implementations. 5] A --> G[Problem now bandwidth-bound,

not compute-bound. 6] G --> H[Performance ceiling raised

to memory bandwidth. 7] A --> I[Fourier convolutions:

FFT, operations, transform back. 8] I --> J[Intermediate buffers

store temporary values. 9] A --> K[Initial implementation:

NVIDIA libraries, autotuning. 10] K --> L[Autotuner considered efficiency

vs memory tradeoffs. 11] A --> M[Vendor FFT tuned

for large datasets. 12] M --> N[New FPFFT for

small, efficient FFTs. 13] N --> O[FPFFT treated GPU

as vector machine. 14] O --> P[Vector machine less effective,

limited shuffles. 15] A --> Q[Memory tradeoff: parallelism,

collisions, efficiency, reuse. 16] Q --> R[9x memory increase largest layer,

tiling helps. 17] A --> S[ConvNets need small FFTs,

same cost. 18] A --> T[Optimize algorithm first,

then implementation. 19] A --> U[QFFT+QBLAST vs

cuDNN heatmaps. 20] U --> V[Small kernel counts

latency limited. 21] A --> W[Torch numbers from

Dec 2014. 22] A --> X[VGG input size forces

interpolation, tiling helps. 23] A --> Y[Latest work: tiled FFT,

padding, reuse. 24] Y --> Z[Faster FFT in progress,

numbers improving. 25] A --> AA[Float16 could provide

2x improvement. 26] A --> AB[Tiling decomposes large convolutions,

reuses buffers. 27] AB --> AC[Out-of-memory error reverts

to low memory mode. 28] A --> AD[Bottleneck shifted to memory

bandwidth, room for optimization. 29] A --> AE[Future developments: low-precision,

memory bandwidth constraints. 30] class B,C performance; class D,E,I,J fourier; class F,K,L,M,N,O,P,T,W,X,Y,Z,AA,AB,AC,AD,AE implementations; class G,H,Q,R,S optimizations; class U,V convolutions;

**Resume: **

**1.-**Facebook has been working on improving the performance of convolutional neural networks, focusing on making the convolutions more efficient.

**2.-**Convolutions are computationally expensive, taking up over 80% of the time in convolutional nets. They are the main justification for using GPUs.

**3.-**The approach is to perform convolutions more efficiently in the Fourier basis using the Fast Fourier Transform (FFT).

**4.-**In the Fourier domain, convolutions become pointwise multiplications, reducing computational cost from N^2 to NlogN when using FFT.

**5.-**Main contributions: Decomposing convolutions into FFTs, transpose, and matrix multiplications; adding auto-tuning; developing higher performance FFT and matrix multiplication implementations.

**6.-**The problem is now bandwidth-bound rather than compute-bound. Bandwidth requirements can be shifted from memory to cache using HPC techniques.

**7.-**The achievable performance ceiling has been raised from being limited by GPU FLOPs to now being limited by memory bandwidth.

**8.-**Convolutions are represented in the Fourier domain by applying FFT to input tensors, performing operations, and transforming back.

**9.-**Intermediate buffers are used to store temporary values. Some are always necessary while others depend on whether explicit transposition is done.

**10.-**The initial implementation used NVIDIA libraries and added autotuning to explore the design space of batched vs iterated calls, FFT interpolation, etc.

**11.-**The autotuner also considered tradeoffs between efficiency and additional memory consumption for transposition and matrix multiplication (FBMM).

**12.-**The vendor FFT libraries are tuned for large HPC datasets, but ConvNets can work well with many small FFTs.

**13.-**A new FPFFT implementation was developed to reduce inefficiencies in the NVIDIA FFT library and allow on-the-fly zero padding.

**14.-**The paper's FPFFT implementation treated the GPU as a wide vector machine, but was still limited by the number of shuffle operations.

**15.-**Viewing the GPU as a wide vector machine is less effective due to the limited number of shuffles compared to multiplications.

**16.-**Memory consumption is a tradeoff between parallelism, avoiding collisions, efficiency, and reuse. It needs to be balanced against available memory.

**17.-**The paper's version used 9x memory increase for the largest layer, 3x when only using FBFFT and FBMM. Tiling helps for large inputs.

**18.-**The key insight is that ConvNets only need 16x16 or 32x32 FFTs, with the cost being the same for kernel sizes up to 15x15.

**19.-**Focus on the algorithm first, then optimize. The problem is now main memory bandwidth limited. Numbers presented are from a Dec 2014 implementation.

**20.-**Heatmaps compare QFFT+QBLAST vs cuDNN. For 3x3 kernels there are already regimes where QFFT+QBLAST wins, more so for 5x5, 7x7, 9x9.

**21.-**Very small kernel counts (<100) are latency limited rather than bandwidth limited and unlikely to benefit from the FFT approach.

**22.-**Latest numbers on Torch are based on the Dec 2014 implementation, not the most current work. Performance is decent but can be improved.

**23.-**On VGG, input size forces interpolation in a large Fourier basis, but tiling helps significantly in this case.

**24.-**Latest work includes tiled FFT, implicit padding, buffer reuse, memory management to reuse computed FFTs, and asynchronous implementation.

**25.-**Faster FFT is in progress and numbers will continue to improve. The main message is that while the problem is memory bandwidth limited, there is still room to grow.

**26.-**Using float16 could provide up to 2x performance improvement since the problem is memory bandwidth limited.

**27.-**Tiling allows decomposing a large convolution into smaller ones with lower memory footprint. Buffers can be reused or expanded to fit available memory.

**28.-**If an out-of-memory error is detected, the implementation can revert to a low memory consumption mode.

**29.-**The work has shifted the performance bottleneck from computation to memory bandwidth, but there are still opportunities for further optimization.

**30.-**Exciting future developments are anticipated, leveraging techniques like low-precision (float16) computation to further improve performance within the memory bandwidth constraints.

Knowledge Vault built byDavid Vivancos 2024