Song Han, Huizi Mao, Bill Dally ICLR 2016 - Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef compression fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef applications fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef pipeline fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef results fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef benefits fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Song Han et al

ICLR 2016] --> B[Deep compression: smaller models,

same accuracy. 1] A --> C[Deep learning: wide applications,

limited embedded power. 2] C --> D[Cloud deep learning: intelligent,

inefficient. 3] C --> E[Mobile deep learning: large

model sizes. 4] A --> F[Embedded systems: most energy

from DRAM access. 5] A --> G[Deep compression: 10-50x smaller

models, same accuracy. 6] G --> H[AlexNet 35x, VGGNet 49x,

GoogleNet 10x, SqueezeNet 10x. 7] A --> I[Deep compression pipeline: prune,

share weights, Huffman code. 8] I --> J[Pruning: removes redundant connections. 9] J --> K[Brain prunes connections from

birth to adulthood. 10] J --> L[Convolutional 66% pruned, fully

connected 90% pruned. 11] J --> M[Iterative pruning, retraining: 90%

AlexNet pruned, no loss. 12] J --> N[Pruning works for RNNs,

LSTMs, neural talk. 13] I --> O[After pruning: weight clusters,

quantization motivated. 14] I --> P[Weight sharing: nonlinear quantization,

higher compression. 15] P --> Q[Weight sharing: k-means, codebook,

quantization, retraining. 16] P --> R[Feedforward: weights as cluster

indices, fewer bits. 17] P --> S[Weight sharing training: SGD,

fine-tune centroids. 18] P --> T[Fully connected 2 bits,

convolutional 4 bits tolerated. 19] I --> U[Pruning, quantization: work well

together, sometimes better. 20] U --> V[AlexNet: 8/5 bits no

loss, 4/2 bits 2% loss. 21] U --> W[Pruning + quantization: 3%

size, no accuracy loss. 22] I --> X[Huffman coding: fewer bits

for frequent weights. 23] I --> Y[10-49x compression without Huffman. 24] A --> Z[Compressed models in SRAM:

speedup, energy efficiency. 25] Z --> AA[Fully connected: 3x speedup

on CPU, GPU after compression. 26] Z --> AB[EIE accelerator: 189x speedup,

24,000x efficiency over CPU. 27] A --> AC[10-50x compression enables mobile

deep learning under 10MB. 28] A --> AD[10-50x less memory bandwidth,

benefits fully connected layers. 29] A --> AE[On-chip SRAM: 100x energy

savings vs off-chip DRAM. 30] class A,B,G,I,U,Y,AC compression; class C,D,E applications; class J,K,L,M,N,O,P,Q,R,S,T,V,W,X pipeline; class H,Z,AA,AB,AD,AE results; class F benefits;

ICLR 2016] --> B[Deep compression: smaller models,

same accuracy. 1] A --> C[Deep learning: wide applications,

limited embedded power. 2] C --> D[Cloud deep learning: intelligent,

inefficient. 3] C --> E[Mobile deep learning: large

model sizes. 4] A --> F[Embedded systems: most energy

from DRAM access. 5] A --> G[Deep compression: 10-50x smaller

models, same accuracy. 6] G --> H[AlexNet 35x, VGGNet 49x,

GoogleNet 10x, SqueezeNet 10x. 7] A --> I[Deep compression pipeline: prune,

share weights, Huffman code. 8] I --> J[Pruning: removes redundant connections. 9] J --> K[Brain prunes connections from

birth to adulthood. 10] J --> L[Convolutional 66% pruned, fully

connected 90% pruned. 11] J --> M[Iterative pruning, retraining: 90%

AlexNet pruned, no loss. 12] J --> N[Pruning works for RNNs,

LSTMs, neural talk. 13] I --> O[After pruning: weight clusters,

quantization motivated. 14] I --> P[Weight sharing: nonlinear quantization,

higher compression. 15] P --> Q[Weight sharing: k-means, codebook,

quantization, retraining. 16] P --> R[Feedforward: weights as cluster

indices, fewer bits. 17] P --> S[Weight sharing training: SGD,

fine-tune centroids. 18] P --> T[Fully connected 2 bits,

convolutional 4 bits tolerated. 19] I --> U[Pruning, quantization: work well

together, sometimes better. 20] U --> V[AlexNet: 8/5 bits no

loss, 4/2 bits 2% loss. 21] U --> W[Pruning + quantization: 3%

size, no accuracy loss. 22] I --> X[Huffman coding: fewer bits

for frequent weights. 23] I --> Y[10-49x compression without Huffman. 24] A --> Z[Compressed models in SRAM:

speedup, energy efficiency. 25] Z --> AA[Fully connected: 3x speedup

on CPU, GPU after compression. 26] Z --> AB[EIE accelerator: 189x speedup,

24,000x efficiency over CPU. 27] A --> AC[10-50x compression enables mobile

deep learning under 10MB. 28] A --> AD[10-50x less memory bandwidth,

benefits fully connected layers. 29] A --> AE[On-chip SRAM: 100x energy

savings vs off-chip DRAM. 30] class A,B,G,I,U,Y,AC compression; class C,D,E applications; class J,K,L,M,N,O,P,Q,R,S,T,V,W,X pipeline; class H,Z,AA,AB,AD,AE results; class F benefits;

**Resume: **

**1.-**The presentation is about deep compression, which compresses neural networks to make models smaller while maintaining accuracy.

**2.-**Deep learning has a wide range of applications but embedded systems have limited computation power.

**3.-**Doing deep learning on the cloud is intelligent but inefficient due to network delay, power budget, and compromised user privacy.

**4.-**Running deep learning locally on mobile devices faces the problem of large model sizes.

**5.-**Most energy in embedded systems is consumed by accessing DRAM, which is much more expensive than multiplication and addition operations.

**6.-**Deep compression can make deep neural networks 10-50 times smaller with the same accuracy on ImageNet.

**7.-**AlexNet was compressed by 35x, VGGNet by 49x, GoogleNet by 10x, and SqueezeNet by 10x, all with no loss in accuracy.

**8.-**The deep compression pipeline has three stages: network pruning, weight sharing, and Huffman coding.

**9.-**Network pruning removes redundant connections, similar to how the human brain prunes neural connections from birth to adulthood.

**10.-**Convolutional layers can be pruned by around 66%, while fully connected layers can have 90% of parameters pruned.

**11.-**Iterative pruning and retraining with L2 regularization can remove 90% of AlexNet's parameters without hurting accuracy.

**12.-**Pruning also works for RNNs and LSTMs, verified on the neural talk model.

**13.-**After pruning, the weight distribution separates into positive and negative clusters, motivating the next step of quantization and weight sharing.

**14.-**Weight sharing is a nonlinear quantization method that can achieve higher compression rates than linear quantization.

**15.-**The weight sharing process involves k-means clustering, codebook generation, weight quantization with the codebook, and iterative codebook retraining.

**16.-**During feedforward, weights are represented by their cluster index, requiring fewer bits than 32-bit floating-point numbers.

**17.-**Training with weight sharing is derivable using stochastic gradient descent to fine-tune the centroids while keeping cluster assignments fixed.

**18.-**Fully connected layers can tolerate quantization down to 2 bits, while convolutional layers can go down to 4 bits before accuracy drops significantly.

**19.-**Pruning and quantization work well together, sometimes even better than when applied individually.

**20.-**AlexNet can be quantized to 8 or 5 bits with no loss in accuracy, and 4 or 2 bits with only 2% accuracy loss.

**21.-**Combining pruning and quantization, models can be compressed to only 3% of their original size without hurting accuracy.

**22.-**Huffman coding further compresses the weights by assigning fewer bits to more frequently occurring weights.

**23.-**The total compression rate ranges from 10x for inception models to 49x for other networks, without Huffman coding.

**24.-**Compressed models that fit entirely in SRAM cache result in significant speedups and energy efficiency improvements.

**25.-**Fully connected layers experience roughly 3x speedup on CPU and GPU after pruning and quantization.

**26.-**A custom hardware accelerator called EIE achieves 189x speedup and 24,000x energy efficiency over CPU for compressed models.

**27.-**10-50x model compression enables deep learning in mobile applications under 10 MB.

**28.-**Memory bandwidth is also reduced by 10-50x, especially beneficial for fully connected layers with less reuse.

**29.-**Fitting the entire working set in on-chip SRAM saves around 100x energy compared to accessing off-chip DRAM.

**30.-**The authors thank their collaborators and advisors for their guidance and helpful discussions in this work.

Knowledge Vault built byDavid Vivancos 2024