Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-Moore's Law is slowing down, limiting computation by power, and conventional CPU-based systems are no longer sufficient.
2.-Machine learning is being used in various aspects of society, leading to insatiable computing demands for training and serving models.
3.-The challenge is to achieve a hundredfold to thousandfold performance improvement on ML applications despite the Moore's Law slowdown.
4.-The solution should achieve high performance, efficiency, and programmability, with processor-like flexibility and ASIC-like efficiency.
5.-Data flow computing is proposed as the answer to meet these requirements.
6.-The overwhelming trend in ML is building more complex models with higher accuracies, exemplified by large language models.
7.-The size of transformer-based models is doubling every 2.5 months, with models reaching trillion parameters, but training them is inefficient.
8.-Sparse models are being researched to achieve smaller memory and compute requirements while maintaining accuracy.
9.-The pixelated butterfly technique combines butterfly sparsity patterns with block computation for efficient hardware utilization.
10.-There is a convergence of training and inference, enabling serving the same model that was trained without requalification.
11.-Continuous retraining becomes possible with a converged platform, adapting to distribution drift in the inference data.
12.-ML models are developed using high-level domain-specific language frameworks like PyTorch and TensorFlow, representing dataflow computation graphs.
13.-Domain-specific operators can be decomposed into hierarchical parallel patterns that can be optimized for different hardware architectures.
14.-Parallel patterns like map, reduce, group by describe both parallel computation and data access for performance optimization.
15.-Future ML models require massive energy-efficient compute, terabyte-sized models, efficient sparsity execution, and convergence of training and inference.
16.-The Plasticine reconfigurable dataflow architecture was developed to efficiently execute parallel patterns using dedicated compute and memory units.
17.-SambaNova Systems was founded to implement the reconfigurable dataflow architecture, resulting in the SN10 chip with substantial compute and memory capabilities.
18.-The SN10 chip has a checkerboard of compute and memory units, wide data paths, and static/dynamic networks to efficiently move data.
19.-The goal is to take PyTorch/TensorFlow models and unroll them in space to exploit vector, pipeline, and spatial stream parallelism.
20.-The pattern compute unit exploits vector and pipeline parallelism, while pattern memory units provide high-bandwidth data supply and transformations.
21.-Spatial dataflow improves execution by laying out the computation graph in space, enabling on-chip kernel fusion and meta-pipelining.
22.-The SambaFlow compiler takes PyTorch/TensorFlow and generates a performant mapping to the RDU, optimizing kernels in time and space.
23.-Spatial dataflow provides 2-6x improvement over TPUs on various ML algorithms due to fine-grained datapath, scheduling, and fusion.
24.-The pixelated butterfly approach with spatial dataflow can provide 2x improvement on image classification and language modeling.
25.-RDU can exploit more parallelism than GPUs for sparse computation, with up to 20x performance improvement at smaller batch sizes.
26.-RDU's efficient dataflow compute and minimal off-chip memory bandwidth enable high capacity (1.5TB/chip) without high-bandwidth memory.
27.-RDU systems can train large language models with fewer chips, eliminating complex system engineering for efficient multi-chip usage.
28.-RDU's batch size flexibility enables faster learning for applications like drug discovery models.
29.-RDU's large memory and tiling capabilities enable higher resolution and accuracy for computer vision tasks like neutrino physics.
30.-RDU provides 20x better inference throughput and latency compared to GPUs for applications like deep learning recommendation models.
Knowledge Vault built byDavid Vivancos 2024