Knowledge Vault 2/86 - ICLR 2014-2023
Kunle Olukotun ICLR 2022 - Invited Talk - Accelerating AI Systems: Let the Data Flow!
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef computation fill:#f9d4d4, font-weight:bold, font-size:14px; classDef machinelearning fill:#d4f9d4, font-weight:bold, font-size:14px; classDef performance fill:#d4d4f9, font-weight:bold, font-size:14px; classDef models fill:#f9f9d4, font-weight:bold, font-size:14px; classDef architecture fill:#f9d4f9, font-weight:bold, font-size:14px; classDef compiler fill:#d4f9f9, font-weight:bold, font-size:14px; classDef applications fill:#f9d4d4, font-weight:bold, font-size:14px; A[Kunle Olukotun
ICLR 2022] --> B[Moore's Law slowdown
limits computation 1] A --> C[ML ubiquitous, insatiable
computing demands 2] A --> G[Trend: complex models,
higher accuracy 6] A --> K[Convergence of training
and inference 10] A --> M[ML models use high-level
dataflow DSLs 12] A --> P[Future ML needs: efficient
compute, large models 15] B --> D[100-1000x performance needed
despite slowdown 3] D --> E[High performance, efficiency,
programmability required 4] E --> F[Dataflow computing proposed
as solution 5] G --> H[Transformer models doubling
every 2.5 months 7] H --> I[Sparse models researched
for efficiency 8] I --> J[Pixelated butterfly:
sparsity + block compute 9] K --> L[Continuous retraining,
adapting to drift 11] M --> N[Domain-specific operators decompose
to parallel patterns 13] N --> O[Parallel patterns optimize
computation, data access 14] P --> Q[Plasticine reconfigurable dataflow
architecture developed 16] Q --> R[SambaNova Systems implements
RDU in SN10 17] R --> S[SN10: compute, memory
units, networks 18] S --> T[Unroll PyTorch/TensorFlow
for parallelism 19] T --> U[Pattern units exploit
parallelism, data supply 20] T --> V[Spatial dataflow: graph layout,
fusion, pipelining 21] P --> W[SambaFlow compiler: PyTorch/TensorFlow
to RDU 22] W --> X[Spatial dataflow 2-6x
better than TPUs 23] X --> Y[Pixelated butterfly + spatial
dataflow 2x boost 24] P --> Z[RDU exploits parallelism,
20x sparse speedup 25] Z --> AA[RDU: high capacity without
high-bandwidth memory 26] AA --> AB[RDU trains large models
with fewer chips 27] P --> AC[RDU enables faster
drug discovery learning 28] AC --> AD[RDU: higher resolution,
accuracy in vision 29] AD --> AE[RDU: 20x better inference
throughput, latency 30] class A,B,D,E,F computation; class C,G,H,I,J,K,L machinelearning; class M,N,O,P,Q,R,S,T,U,V,W,X,Y architecture; class Z,AA,AB,AC,AD,AE applications;

Resume:

1.-Moore's Law is slowing down, limiting computation by power, and conventional CPU-based systems are no longer sufficient.

2.-Machine learning is being used in various aspects of society, leading to insatiable computing demands for training and serving models.

3.-The challenge is to achieve a hundredfold to thousandfold performance improvement on ML applications despite the Moore's Law slowdown.

4.-The solution should achieve high performance, efficiency, and programmability, with processor-like flexibility and ASIC-like efficiency.

5.-Data flow computing is proposed as the answer to meet these requirements.

6.-The overwhelming trend in ML is building more complex models with higher accuracies, exemplified by large language models.

7.-The size of transformer-based models is doubling every 2.5 months, with models reaching trillion parameters, but training them is inefficient.

8.-Sparse models are being researched to achieve smaller memory and compute requirements while maintaining accuracy.

9.-The pixelated butterfly technique combines butterfly sparsity patterns with block computation for efficient hardware utilization.

10.-There is a convergence of training and inference, enabling serving the same model that was trained without requalification.

11.-Continuous retraining becomes possible with a converged platform, adapting to distribution drift in the inference data.

12.-ML models are developed using high-level domain-specific language frameworks like PyTorch and TensorFlow, representing dataflow computation graphs.

13.-Domain-specific operators can be decomposed into hierarchical parallel patterns that can be optimized for different hardware architectures.

14.-Parallel patterns like map, reduce, group by describe both parallel computation and data access for performance optimization.

15.-Future ML models require massive energy-efficient compute, terabyte-sized models, efficient sparsity execution, and convergence of training and inference.

16.-The Plasticine reconfigurable dataflow architecture was developed to efficiently execute parallel patterns using dedicated compute and memory units.

17.-SambaNova Systems was founded to implement the reconfigurable dataflow architecture, resulting in the SN10 chip with substantial compute and memory capabilities.

18.-The SN10 chip has a checkerboard of compute and memory units, wide data paths, and static/dynamic networks to efficiently move data.

19.-The goal is to take PyTorch/TensorFlow models and unroll them in space to exploit vector, pipeline, and spatial stream parallelism.

20.-The pattern compute unit exploits vector and pipeline parallelism, while pattern memory units provide high-bandwidth data supply and transformations.

21.-Spatial dataflow improves execution by laying out the computation graph in space, enabling on-chip kernel fusion and meta-pipelining.

22.-The SambaFlow compiler takes PyTorch/TensorFlow and generates a performant mapping to the RDU, optimizing kernels in time and space.

23.-Spatial dataflow provides 2-6x improvement over TPUs on various ML algorithms due to fine-grained datapath, scheduling, and fusion.

24.-The pixelated butterfly approach with spatial dataflow can provide 2x improvement on image classification and language modeling.

25.-RDU can exploit more parallelism than GPUs for sparse computation, with up to 20x performance improvement at smaller batch sizes.

26.-RDU's efficient dataflow compute and minimal off-chip memory bandwidth enable high capacity (1.5TB/chip) without high-bandwidth memory.

27.-RDU systems can train large language models with fewer chips, eliminating complex system engineering for efficient multi-chip usage.

28.-RDU's batch size flexibility enables faster learning for applications like drug discovery models.

29.-RDU's large memory and tiling capabilities enable higher resolution and accuracy for computer vision tasks like neutrino physics.

30.-RDU provides 20x better inference throughput and latency compared to GPUs for applications like deep learning recommendation models.

Knowledge Vault built byDavid Vivancos 2024