Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-The presentation is about deep compression, which compresses neural networks to make models smaller while maintaining accuracy.
2.-Deep learning has a wide range of applications but embedded systems have limited computation power.
3.-Doing deep learning on the cloud is intelligent but inefficient due to network delay, power budget, and compromised user privacy.
4.-Running deep learning locally on mobile devices faces the problem of large model sizes.
5.-Most energy in embedded systems is consumed by accessing DRAM, which is much more expensive than multiplication and addition operations.
6.-Deep compression can make deep neural networks 10-50 times smaller with the same accuracy on ImageNet.
7.-AlexNet was compressed by 35x, VGGNet by 49x, GoogleNet by 10x, and SqueezeNet by 10x, all with no loss in accuracy.
8.-The deep compression pipeline has three stages: network pruning, weight sharing, and Huffman coding.
9.-Network pruning removes redundant connections, similar to how the human brain prunes neural connections from birth to adulthood.
10.-Convolutional layers can be pruned by around 66%, while fully connected layers can have 90% of parameters pruned.
11.-Iterative pruning and retraining with L2 regularization can remove 90% of AlexNet's parameters without hurting accuracy.
12.-Pruning also works for RNNs and LSTMs, verified on the neural talk model.
13.-After pruning, the weight distribution separates into positive and negative clusters, motivating the next step of quantization and weight sharing.
14.-Weight sharing is a nonlinear quantization method that can achieve higher compression rates than linear quantization.
15.-The weight sharing process involves k-means clustering, codebook generation, weight quantization with the codebook, and iterative codebook retraining.
16.-During feedforward, weights are represented by their cluster index, requiring fewer bits than 32-bit floating-point numbers.
17.-Training with weight sharing is derivable using stochastic gradient descent to fine-tune the centroids while keeping cluster assignments fixed.
18.-Fully connected layers can tolerate quantization down to 2 bits, while convolutional layers can go down to 4 bits before accuracy drops significantly.
19.-Pruning and quantization work well together, sometimes even better than when applied individually.
20.-AlexNet can be quantized to 8 or 5 bits with no loss in accuracy, and 4 or 2 bits with only 2% accuracy loss.
21.-Combining pruning and quantization, models can be compressed to only 3% of their original size without hurting accuracy.
22.-Huffman coding further compresses the weights by assigning fewer bits to more frequently occurring weights.
23.-The total compression rate ranges from 10x for inception models to 49x for other networks, without Huffman coding.
24.-Compressed models that fit entirely in SRAM cache result in significant speedups and energy efficiency improvements.
25.-Fully connected layers experience roughly 3x speedup on CPU and GPU after pruning and quantization.
26.-A custom hardware accelerator called EIE achieves 189x speedup and 24,000x energy efficiency over CPU for compressed models.
27.-10-50x model compression enables deep learning in mobile applications under 10 MB.
28.-Memory bandwidth is also reduced by 10-50x, especially beneficial for fully connected layers with less reuse.
29.-Fitting the entire working set in on-chip SRAM saves around 100x energy compared to accessing off-chip DRAM.
30.-The authors thank their collaborators and advisors for their guidance and helpful discussions in this work.
Knowledge Vault built byDavid Vivancos 2024