OpenCV Block-wise quantization - Google Summer Of Code 2024

My name is Daniele Affinita, and this report documents the work I completed during the Google Summer of Code 2024 for OpenCV, under the mentorship of Yuantao Feng.

TL;DR

Integrated Blockwise Quantization into OpenCV's DNN module, focusing on compressing model size for deployment on memory-constrained devices. Developed a tool to quantize models blockwise, achieving a 2-4x reduction in model size. Evaluated the performance, showing that blockwise quantized models retain accuracy closer to the original compared to standard int8 quantization.

Project Summary

As the popularity of Deep Learning grows, fitting neural networks onto edge devices becomes increasingly crucial. Model quantization is a key technique that reduces model size while enhancing inference speed. This approach has already been applied to the models in the OpenCV Zoo, but further evaluations revealed a significant drop in accuracy for the quantized models.

A potential alternative to standard quantization is blockwise quantization, which offers a balance between model size compression and performance. Although this technique has been extensively studied in the context of Large Language Models, its applicability to Computer Vision models has yet to be explored.

The objective of this project was to integrate Blockwise Quantization into OpenCV, enabling inference within the OpenCV DNN module and developing a tool for blockwise model quantization. Additionally, the project includes a comprehensive evaluation of the performance of block-quantized models in comparison to the original models and those quantized using the standard int8 method.

Contributions

Inference Engine for Blockwise Quantization: Extended the QuantizeLinear and DequantizeLinear layers to support blockwise quantization, optimizing parameter caching to avoid repeated computations. Implemented unit tests to ensure compliance with ONNXRuntime test cases for DequantizeLinear and QuantizeLinear.
Blockwise Quantization Tool: Developed a weight-only blockwise quantization tool that converts valid ONNX models into their block-quantized versions. The tool modifies the computational graph by replacing each layer’s weights with a pattern of DequantizeLinear -> Reshape -> Layer for each quantized weight. Initially, the tool supported only Convolutional layers, demonstrating a 2-4x model size reduction based on the weight distribution.
Extended Blockwise Quantization Tool: Added support for MatMul and Gemm layers, further enhancing the tool's applicability. This extension was particularly effective for models like Vision Transformers, where it compressed the ViT tracker model from 698 KB to 260 KB, significantly improving storage efficiency.

Blockwise Quantization Evaluation: Conducted a thorough evaluation of the blockwise quantization pipeline across all OpenCV Zoo models, focusing on compression ratios and accuracy. Results indicate that blockwise quantized models retain accuracy close to the original fp32 versions, outperforming standard int8 quantized models, which show a more significant drop in performance.

Models	Accuracy	mIoU	Size
PPHumanSeg fp32	0.9656	0.9164	6.2 MB
PPHumanSeg block quantized	0.9655	0.9162	1.7 MB
PPHumanSeg quantized	0.7285	0.3642	1.6 MB

Models	Top-1 Accuracy	Top-5 Accuracy	Size
MobileNet V1	67.64	87.97	16.9 MB
MobileNet V1 block quantized	67.21	87.62	4.6 MB
MobileNet V1 quantized	55.53	78.74	4.3 MB

Future directions

The initial goals set for GSoC have been successfully achieved. In discussions with my mentor, we explored several ideas for continuing this project beyond GSoC. While the focus so far has been on compressing model size to fit devices with limited memory, quantization also offers significant benefits for inference speed.

To further optimize DNN inference for quantization, we plan to replace common operations with their integer counterparts, allowing computations to be performed without the need to dequantize the weights beforehand. This will require extending the quantization tool to not only quantize weights but also layer activation maps. Additionally, we aim to develop a graph simplification algorithm that can identify patterns in the computational graph and replace them with integer-optimized operations.

Pull Request links

Before GSoC24:

During GSoC24: