Kunle Olukotun ICLR 2022 - Invited Talk - Accelerating AI Systems: Let the Data Flow!

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef computation fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef machinelearning fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef performance fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef models fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef architecture fill:#f9d4f9, font-weight:bold, font-size:14px;
classDef compiler fill:#d4f9f9, font-weight:bold, font-size:14px;
classDef applications fill:#f9d4d4, font-weight:bold, font-size:14px;
A[Kunle Olukotun

ICLR 2022] --> B[Moore's Law slowdown

limits computation 1] A --> C[ML ubiquitous, insatiable

computing demands 2] A --> G[Trend: complex models,

higher accuracy 6] A --> K[Convergence of training

and inference 10] A --> M[ML models use high-level

dataflow DSLs 12] A --> P[Future ML needs: efficient

compute, large models 15] B --> D[100-1000x performance needed

despite slowdown 3] D --> E[High performance, efficiency,

programmability required 4] E --> F[Dataflow computing proposed

as solution 5] G --> H[Transformer models doubling

every 2.5 months 7] H --> I[Sparse models researched

for efficiency 8] I --> J[Pixelated butterfly:

sparsity + block compute 9] K --> L[Continuous retraining,

adapting to drift 11] M --> N[Domain-specific operators decompose

to parallel patterns 13] N --> O[Parallel patterns optimize

computation, data access 14] P --> Q[Plasticine reconfigurable dataflow

architecture developed 16] Q --> R[SambaNova Systems implements

RDU in SN10 17] R --> S[SN10: compute, memory

units, networks 18] S --> T[Unroll PyTorch/TensorFlow

for parallelism 19] T --> U[Pattern units exploit

parallelism, data supply 20] T --> V[Spatial dataflow: graph layout,

fusion, pipelining 21] P --> W[SambaFlow compiler: PyTorch/TensorFlow

to RDU 22] W --> X[Spatial dataflow 2-6x

better than TPUs 23] X --> Y[Pixelated butterfly + spatial

dataflow 2x boost 24] P --> Z[RDU exploits parallelism,

20x sparse speedup 25] Z --> AA[RDU: high capacity without

high-bandwidth memory 26] AA --> AB[RDU trains large models

with fewer chips 27] P --> AC[RDU enables faster

drug discovery learning 28] AC --> AD[RDU: higher resolution,

accuracy in vision 29] AD --> AE[RDU: 20x better inference

throughput, latency 30] class A,B,D,E,F computation; class C,G,H,I,J,K,L machinelearning; class M,N,O,P,Q,R,S,T,U,V,W,X,Y architecture; class Z,AA,AB,AC,AD,AE applications;

ICLR 2022] --> B[Moore's Law slowdown

limits computation 1] A --> C[ML ubiquitous, insatiable

computing demands 2] A --> G[Trend: complex models,

higher accuracy 6] A --> K[Convergence of training

and inference 10] A --> M[ML models use high-level

dataflow DSLs 12] A --> P[Future ML needs: efficient

compute, large models 15] B --> D[100-1000x performance needed

despite slowdown 3] D --> E[High performance, efficiency,

programmability required 4] E --> F[Dataflow computing proposed

as solution 5] G --> H[Transformer models doubling

every 2.5 months 7] H --> I[Sparse models researched

for efficiency 8] I --> J[Pixelated butterfly:

sparsity + block compute 9] K --> L[Continuous retraining,

adapting to drift 11] M --> N[Domain-specific operators decompose

to parallel patterns 13] N --> O[Parallel patterns optimize

computation, data access 14] P --> Q[Plasticine reconfigurable dataflow

architecture developed 16] Q --> R[SambaNova Systems implements

RDU in SN10 17] R --> S[SN10: compute, memory

units, networks 18] S --> T[Unroll PyTorch/TensorFlow

for parallelism 19] T --> U[Pattern units exploit

parallelism, data supply 20] T --> V[Spatial dataflow: graph layout,

fusion, pipelining 21] P --> W[SambaFlow compiler: PyTorch/TensorFlow

to RDU 22] W --> X[Spatial dataflow 2-6x

better than TPUs 23] X --> Y[Pixelated butterfly + spatial

dataflow 2x boost 24] P --> Z[RDU exploits parallelism,

20x sparse speedup 25] Z --> AA[RDU: high capacity without

high-bandwidth memory 26] AA --> AB[RDU trains large models

with fewer chips 27] P --> AC[RDU enables faster

drug discovery learning 28] AC --> AD[RDU: higher resolution,

accuracy in vision 29] AD --> AE[RDU: 20x better inference

throughput, latency 30] class A,B,D,E,F computation; class C,G,H,I,J,K,L machinelearning; class M,N,O,P,Q,R,S,T,U,V,W,X,Y architecture; class Z,AA,AB,AC,AD,AE applications;

**Resume: **

**1.-**Moore's Law is slowing down, limiting computation by power, and conventional CPU-based systems are no longer sufficient.

**2.-**Machine learning is being used in various aspects of society, leading to insatiable computing demands for training and serving models.

**3.-**The challenge is to achieve a hundredfold to thousandfold performance improvement on ML applications despite the Moore's Law slowdown.

**4.-**The solution should achieve high performance, efficiency, and programmability, with processor-like flexibility and ASIC-like efficiency.

**5.-**Data flow computing is proposed as the answer to meet these requirements.

**6.-**The overwhelming trend in ML is building more complex models with higher accuracies, exemplified by large language models.

**7.-**The size of transformer-based models is doubling every 2.5 months, with models reaching trillion parameters, but training them is inefficient.

**8.-**Sparse models are being researched to achieve smaller memory and compute requirements while maintaining accuracy.

**9.-**The pixelated butterfly technique combines butterfly sparsity patterns with block computation for efficient hardware utilization.

**10.-**There is a convergence of training and inference, enabling serving the same model that was trained without requalification.

**11.-**Continuous retraining becomes possible with a converged platform, adapting to distribution drift in the inference data.

**12.-**ML models are developed using high-level domain-specific language frameworks like PyTorch and TensorFlow, representing dataflow computation graphs.

**13.-**Domain-specific operators can be decomposed into hierarchical parallel patterns that can be optimized for different hardware architectures.

**14.-**Parallel patterns like map, reduce, group by describe both parallel computation and data access for performance optimization.

**15.-**Future ML models require massive energy-efficient compute, terabyte-sized models, efficient sparsity execution, and convergence of training and inference.

**16.-**The Plasticine reconfigurable dataflow architecture was developed to efficiently execute parallel patterns using dedicated compute and memory units.

**17.-**SambaNova Systems was founded to implement the reconfigurable dataflow architecture, resulting in the SN10 chip with substantial compute and memory capabilities.

**18.-**The SN10 chip has a checkerboard of compute and memory units, wide data paths, and static/dynamic networks to efficiently move data.

**19.-**The goal is to take PyTorch/TensorFlow models and unroll them in space to exploit vector, pipeline, and spatial stream parallelism.

**20.-**The pattern compute unit exploits vector and pipeline parallelism, while pattern memory units provide high-bandwidth data supply and transformations.

**21.-**Spatial dataflow improves execution by laying out the computation graph in space, enabling on-chip kernel fusion and meta-pipelining.

**22.-**The SambaFlow compiler takes PyTorch/TensorFlow and generates a performant mapping to the RDU, optimizing kernels in time and space.

**23.-**Spatial dataflow provides 2-6x improvement over TPUs on various ML algorithms due to fine-grained datapath, scheduling, and fusion.

**24.-**The pixelated butterfly approach with spatial dataflow can provide 2x improvement on image classification and language modeling.

**25.-**RDU can exploit more parallelism than GPUs for sparse computation, with up to 20x performance improvement at smaller batch sizes.

**26.-**RDU's efficient dataflow compute and minimal off-chip memory bandwidth enable high capacity (1.5TB/chip) without high-bandwidth memory.

**27.-**RDU systems can train large language models with fewer chips, eliminating complex system engineering for efficient multi-chip usage.

**28.-**RDU's batch size flexibility enables faster learning for applications like drug discovery models.

**29.-**RDU's large memory and tiling capabilities enable higher resolution and accuracy for computer vision tasks like neutrino physics.

**30.-**RDU provides 20x better inference throughput and latency compared to GPUs for applications like deep learning recommendation models.

Knowledge Vault built byDavid Vivancos 2024