My name is Daniele Affinita, and this report documents the work I completed during the Google Summer of Code 2024 for OpenCV, under the mentorship of Yuantao Feng.
Integrated Blockwise Quantization into OpenCV's DNN module, focusing on compressing model size for deployment on memory-constrained devices. Developed a tool to quantize models blockwise, achieving a 2-4x reduction in model size. Evaluated the performance, showing that blockwise quantized models retain accuracy closer to the original compared to standard int8 quantization.
As the popularity of Deep Learning grows, fitting neural networks onto edge devices becomes increasingly crucial. Model quantization is a key technique that reduces model size while enhancing inference speed. This approach has already been applied to the models in the OpenCV Zoo, but further evaluations revealed a significant drop in accuracy for the quantized models.
A potential alternative to standard quantization is blockwise quantization, which offers a balance between model size compression and performance. Although this technique has been extensively studied in the context of Large Language Models, its applicability to Computer Vision models has yet to be explored.
The objective of this project was to integrate Blockwise Quantization into OpenCV, enabling inference within the OpenCV DNN module and developing a tool for blockwise model quantization. Additionally, the project includes a comprehensive evaluation of the performance of block-quantized models in comparison to the original models and those quantized using the standard int8 method.
-
Inference Engine for Blockwise Quantization: Extended the QuantizeLinear and DequantizeLinear layers to support blockwise quantization, optimizing parameter caching to avoid repeated computations. Implemented unit tests to ensure compliance with ONNXRuntime test cases for DequantizeLinear and QuantizeLinear.
-
Blockwise Quantization Tool: Developed a weight-only blockwise quantization tool that converts valid ONNX models into their block-quantized versions. The tool modifies the computational graph by replacing each layer’s weights with a pattern of DequantizeLinear -> Reshape -> Layer for each quantized weight. Initially, the tool supported only Convolutional layers, demonstrating a 2-4x model size reduction based on the weight distribution.
-
Extended Blockwise Quantization Tool: Added support for MatMul and Gemm layers, further enhancing the tool's applicability. This extension was particularly effective for models like Vision Transformers, where it compressed the ViT tracker model from 698 KB to 260 KB, significantly improving storage efficiency.
- Blockwise Quantization Evaluation: Conducted a thorough evaluation of the blockwise quantization pipeline across all OpenCV Zoo models, focusing on compression ratios and accuracy. Results indicate that blockwise quantized models retain accuracy close to the original fp32 versions, outperforming standard int8 quantized models, which show a more significant drop in performance.
Models | Accuracy | mIoU | Size |
---|---|---|---|
PPHumanSeg fp32 | 0.9656 | 0.9164 | 6.2 MB |
PPHumanSeg block quantized | 0.9655 | 0.9162 | 1.7 MB |
PPHumanSeg quantized | 0.7285 | 0.3642 | 1.6 MB |
Models | Top-1 Accuracy | Top-5 Accuracy | Size |
---|---|---|---|
MobileNet V1 | 67.64 | 87.97 | 16.9 MB |
MobileNet V1 block quantized | 67.21 | 87.62 | 4.6 MB |
MobileNet V1 quantized | 55.53 | 78.74 | 4.3 MB |
The initial goals set for GSoC have been successfully achieved. In discussions with my mentor, we explored several ideas for continuing this project beyond GSoC. While the focus so far has been on compressing model size to fit devices with limited memory, quantization also offers significant benefits for inference speed.
To further optimize DNN inference for quantization, we plan to replace common operations with their integer counterparts, allowing computations to be performed without the need to dequantize the weights beforehand. This will require extending the quantization tool to not only quantize weights but also layer activation maps. Additionally, we aim to develop a graph simplification algorithm that can identify patterns in the computational graph and replace them with integer-optimized operations.
Before GSoC24:
- opencv/opencv_zoo#243 - C++ Demo - Human Segmentation
- opencv/opencv_zoo#233 - C++ Demo - Facial Expression Recognition
During GSoC24:
- opencv/opencv#25644 - [GSoC] dnn: Blockwise quantization support
- opencv/opencv_extra#1181 - Support Blockwise Quantization
- opencv/opencv_zoo#265 - [GSoC] Blockwise Quantization Tool
- opencv/opencv_zoo#268 - [GSoC] Gemm and MatMul block quantization support
- opencv/opencv_zoo#270 - [GSoC] Add block quantized models