Knowledge Vault 2/30 - ICLR 2014-2023
Song Han, Huizi Mao, Bill Dally ICLR 2016 - Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef compression fill:#f9d4d4, font-weight:bold, font-size:14px; classDef applications fill:#d4f9d4, font-weight:bold, font-size:14px; classDef pipeline fill:#d4d4f9, font-weight:bold, font-size:14px; classDef results fill:#f9f9d4, font-weight:bold, font-size:14px; classDef benefits fill:#f9d4f9, font-weight:bold, font-size:14px; A[Song Han et al
ICLR 2016] --> B[Deep compression: smaller models,
same accuracy. 1] A --> C[Deep learning: wide applications,
limited embedded power. 2] C --> D[Cloud deep learning: intelligent,
inefficient. 3] C --> E[Mobile deep learning: large
model sizes. 4] A --> F[Embedded systems: most energy
from DRAM access. 5] A --> G[Deep compression: 10-50x smaller
models, same accuracy. 6] G --> H[AlexNet 35x, VGGNet 49x,
GoogleNet 10x, SqueezeNet 10x. 7] A --> I[Deep compression pipeline: prune,
share weights, Huffman code. 8] I --> J[Pruning: removes redundant connections. 9] J --> K[Brain prunes connections from
birth to adulthood. 10] J --> L[Convolutional 66% pruned, fully
connected 90% pruned. 11] J --> M[Iterative pruning, retraining: 90%
AlexNet pruned, no loss. 12] J --> N[Pruning works for RNNs,
LSTMs, neural talk. 13] I --> O[After pruning: weight clusters,
quantization motivated. 14] I --> P[Weight sharing: nonlinear quantization,
higher compression. 15] P --> Q[Weight sharing: k-means, codebook,
quantization, retraining. 16] P --> R[Feedforward: weights as cluster
indices, fewer bits. 17] P --> S[Weight sharing training: SGD,
fine-tune centroids. 18] P --> T[Fully connected 2 bits,
convolutional 4 bits tolerated. 19] I --> U[Pruning, quantization: work well
together, sometimes better. 20] U --> V[AlexNet: 8/5 bits no
loss, 4/2 bits 2% loss. 21] U --> W[Pruning + quantization: 3%
size, no accuracy loss. 22] I --> X[Huffman coding: fewer bits
for frequent weights. 23] I --> Y[10-49x compression without Huffman. 24] A --> Z[Compressed models in SRAM:
speedup, energy efficiency. 25] Z --> AA[Fully connected: 3x speedup
on CPU, GPU after compression. 26] Z --> AB[EIE accelerator: 189x speedup,
24,000x efficiency over CPU. 27] A --> AC[10-50x compression enables mobile
deep learning under 10MB. 28] A --> AD[10-50x less memory bandwidth,
benefits fully connected layers. 29] A --> AE[On-chip SRAM: 100x energy
savings vs off-chip DRAM. 30] class A,B,G,I,U,Y,AC compression; class C,D,E applications; class J,K,L,M,N,O,P,Q,R,S,T,V,W,X pipeline; class H,Z,AA,AB,AD,AE results; class F benefits;

Resume:

1.-The presentation is about deep compression, which compresses neural networks to make models smaller while maintaining accuracy.

2.-Deep learning has a wide range of applications but embedded systems have limited computation power.

3.-Doing deep learning on the cloud is intelligent but inefficient due to network delay, power budget, and compromised user privacy.

4.-Running deep learning locally on mobile devices faces the problem of large model sizes.

5.-Most energy in embedded systems is consumed by accessing DRAM, which is much more expensive than multiplication and addition operations.

6.-Deep compression can make deep neural networks 10-50 times smaller with the same accuracy on ImageNet.

7.-AlexNet was compressed by 35x, VGGNet by 49x, GoogleNet by 10x, and SqueezeNet by 10x, all with no loss in accuracy.

8.-The deep compression pipeline has three stages: network pruning, weight sharing, and Huffman coding.

9.-Network pruning removes redundant connections, similar to how the human brain prunes neural connections from birth to adulthood.

10.-Convolutional layers can be pruned by around 66%, while fully connected layers can have 90% of parameters pruned.

11.-Iterative pruning and retraining with L2 regularization can remove 90% of AlexNet's parameters without hurting accuracy.

12.-Pruning also works for RNNs and LSTMs, verified on the neural talk model.

13.-After pruning, the weight distribution separates into positive and negative clusters, motivating the next step of quantization and weight sharing.

14.-Weight sharing is a nonlinear quantization method that can achieve higher compression rates than linear quantization.

15.-The weight sharing process involves k-means clustering, codebook generation, weight quantization with the codebook, and iterative codebook retraining.

16.-During feedforward, weights are represented by their cluster index, requiring fewer bits than 32-bit floating-point numbers.

17.-Training with weight sharing is derivable using stochastic gradient descent to fine-tune the centroids while keeping cluster assignments fixed.

18.-Fully connected layers can tolerate quantization down to 2 bits, while convolutional layers can go down to 4 bits before accuracy drops significantly.

19.-Pruning and quantization work well together, sometimes even better than when applied individually.

20.-AlexNet can be quantized to 8 or 5 bits with no loss in accuracy, and 4 or 2 bits with only 2% accuracy loss.

21.-Combining pruning and quantization, models can be compressed to only 3% of their original size without hurting accuracy.

22.-Huffman coding further compresses the weights by assigning fewer bits to more frequently occurring weights.

23.-The total compression rate ranges from 10x for inception models to 49x for other networks, without Huffman coding.

24.-Compressed models that fit entirely in SRAM cache result in significant speedups and energy efficiency improvements.

25.-Fully connected layers experience roughly 3x speedup on CPU and GPU after pruning and quantization.

26.-A custom hardware accelerator called EIE achieves 189x speedup and 24,000x energy efficiency over CPU for compressed models.

27.-10-50x model compression enables deep learning in mobile applications under 10 MB.

28.-Memory bandwidth is also reduced by 10-50x, especially beneficial for fully connected layers with less reuse.

29.-Fitting the entire working set in on-chip SRAM saves around 100x energy compared to accessing off-chip DRAM.

30.-The authors thank their collaborators and advisors for their guidance and helpful discussions in this work.

Knowledge Vault built byDavid Vivancos 2024