Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Facebook has been working on improving the performance of convolutional neural networks, focusing on making the convolutions more efficient.
2.-Convolutions are computationally expensive, taking up over 80% of the time in convolutional nets. They are the main justification for using GPUs.
3.-The approach is to perform convolutions more efficiently in the Fourier basis using the Fast Fourier Transform (FFT).
4.-In the Fourier domain, convolutions become pointwise multiplications, reducing computational cost from N^2 to NlogN when using FFT.
5.-Main contributions: Decomposing convolutions into FFTs, transpose, and matrix multiplications; adding auto-tuning; developing higher performance FFT and matrix multiplication implementations.
6.-The problem is now bandwidth-bound rather than compute-bound. Bandwidth requirements can be shifted from memory to cache using HPC techniques.
7.-The achievable performance ceiling has been raised from being limited by GPU FLOPs to now being limited by memory bandwidth.
8.-Convolutions are represented in the Fourier domain by applying FFT to input tensors, performing operations, and transforming back.
9.-Intermediate buffers are used to store temporary values. Some are always necessary while others depend on whether explicit transposition is done.
10.-The initial implementation used NVIDIA libraries and added autotuning to explore the design space of batched vs iterated calls, FFT interpolation, etc.
11.-The autotuner also considered tradeoffs between efficiency and additional memory consumption for transposition and matrix multiplication (FBMM).
12.-The vendor FFT libraries are tuned for large HPC datasets, but ConvNets can work well with many small FFTs.
13.-A new FPFFT implementation was developed to reduce inefficiencies in the NVIDIA FFT library and allow on-the-fly zero padding.
14.-The paper's FPFFT implementation treated the GPU as a wide vector machine, but was still limited by the number of shuffle operations.
15.-Viewing the GPU as a wide vector machine is less effective due to the limited number of shuffles compared to multiplications.
16.-Memory consumption is a tradeoff between parallelism, avoiding collisions, efficiency, and reuse. It needs to be balanced against available memory.
17.-The paper's version used 9x memory increase for the largest layer, 3x when only using FBFFT and FBMM. Tiling helps for large inputs.
18.-The key insight is that ConvNets only need 16x16 or 32x32 FFTs, with the cost being the same for kernel sizes up to 15x15.
19.-Focus on the algorithm first, then optimize. The problem is now main memory bandwidth limited. Numbers presented are from a Dec 2014 implementation.
20.-Heatmaps compare QFFT+QBLAST vs cuDNN. For 3x3 kernels there are already regimes where QFFT+QBLAST wins, more so for 5x5, 7x7, 9x9.
21.-Very small kernel counts (<100) are latency limited rather than bandwidth limited and unlikely to benefit from the FFT approach.
22.-Latest numbers on Torch are based on the Dec 2014 implementation, not the most current work. Performance is decent but can be improved.
23.-On VGG, input size forces interpolation in a large Fourier basis, but tiling helps significantly in this case.
24.-Latest work includes tiled FFT, implicit padding, buffer reuse, memory management to reuse computed FFTs, and asynchronous implementation.
25.-Faster FFT is in progress and numbers will continue to improve. The main message is that while the problem is memory bandwidth limited, there is still room to grow.
26.-Using float16 could provide up to 2x performance improvement since the problem is memory bandwidth limited.
27.-Tiling allows decomposing a large convolution into smaller ones with lower memory footprint. Buffers can be reused or expanded to fit available memory.
28.-If an out-of-memory error is detected, the implementation can revert to a low memory consumption mode.
29.-The work has shifted the performance bottleneck from computation to memory bandwidth, but there are still opportunities for further optimization.
30.-Exciting future developments are anticipated, leveraging techniques like low-precision (float16) computation to further improve performance within the memory bandwidth constraints.
Knowledge Vault built byDavid Vivancos 2024