Skip to content

Instantly share code, notes, and snippets.

@DaniAffCH
Last active March 19, 2025 05:39
Show Gist options
  • Save DaniAffCH/8cefbb13816a462bb81408145a1e9315 to your computer and use it in GitHub Desktop.
Save DaniAffCH/8cefbb13816a462bb81408145a1e9315 to your computer and use it in GitHub Desktop.
GSoC 24 Final Report

OpenCV Block-wise quantization - Google Summer Of Code 2024

My name is Daniele Affinita, and this report documents the work I completed during the Google Summer of Code 2024 for OpenCV, under the mentorship of Yuantao Feng.

TL;DR

Integrated Blockwise Quantization into OpenCV's DNN module, focusing on compressing model size for deployment on memory-constrained devices. Developed a tool to quantize models blockwise, achieving a 2-4x reduction in model size. Evaluated the performance, showing that blockwise quantized models retain accuracy closer to the original compared to standard int8 quantization.

Project Summary

As the popularity of Deep Learning grows, fitting neural networks onto edge devices becomes increasingly crucial. Model quantization is a key technique that reduces model size while enhancing inference speed. This approach has already been applied to the models in the OpenCV Zoo, but further evaluations revealed a significant drop in accuracy for the quantized models.

A potential alternative to standard quantization is blockwise quantization, which offers a balance between model size compression and performance. Although this technique has been extensively studied in the context of Large Language Models, its applicability to Computer Vision models has yet to be explored.

The objective of this project was to integrate Blockwise Quantization into OpenCV, enabling inference within the OpenCV DNN module and developing a tool for blockwise model quantization. Additionally, the project includes a comprehensive evaluation of the performance of block-quantized models in comparison to the original models and those quantized using the standard int8 method.

blockwise_quant

Contributions

  1. Inference Engine for Blockwise Quantization: Extended the QuantizeLinear and DequantizeLinear layers to support blockwise quantization, optimizing parameter caching to avoid repeated computations. Implemented unit tests to ensure compliance with ONNXRuntime test cases for DequantizeLinear and QuantizeLinear.

  2. Blockwise Quantization Tool: Developed a weight-only blockwise quantization tool that converts valid ONNX models into their block-quantized versions. The tool modifies the computational graph by replacing each layer’s weights with a pattern of DequantizeLinear -> Reshape -> Layer for each quantized weight. Initially, the tool supported only Convolutional layers, demonstrating a 2-4x model size reduction based on the weight distribution. computational graph

  3. Extended Blockwise Quantization Tool: Added support for MatMul and Gemm layers, further enhancing the tool's applicability. This extension was particularly effective for models like Vision Transformers, where it compressed the ViT tracker model from 698 KB to 260 KB, significantly improving storage efficiency.

tracker block quantized

  1. Blockwise Quantization Evaluation: Conducted a thorough evaluation of the blockwise quantization pipeline across all OpenCV Zoo models, focusing on compression ratios and accuracy. Results indicate that blockwise quantized models retain accuracy close to the original fp32 versions, outperforming standard int8 quantized models, which show a more significant drop in performance.
Models Accuracy mIoU Size
PPHumanSeg fp32 0.9656 0.9164 6.2 MB
PPHumanSeg block quantized 0.9655 0.9162 1.7 MB
PPHumanSeg quantized 0.7285 0.3642 1.6 MB
Models Top-1 Accuracy Top-5 Accuracy Size
MobileNet V1 67.64 87.97 16.9 MB
MobileNet V1 block quantized 67.21 87.62 4.6 MB
MobileNet V1 quantized 55.53 78.74 4.3 MB

Future directions

The initial goals set for GSoC have been successfully achieved. In discussions with my mentor, we explored several ideas for continuing this project beyond GSoC. While the focus so far has been on compressing model size to fit devices with limited memory, quantization also offers significant benefits for inference speed.

To further optimize DNN inference for quantization, we plan to replace common operations with their integer counterparts, allowing computations to be performed without the need to dequantize the weights beforehand. This will require extending the quantization tool to not only quantize weights but also layer activation maps. Additionally, we aim to develop a graph simplification algorithm that can identify patterns in the computational graph and replace them with integer-optimized operations.

Pull Request links

Before GSoC24:

During GSoC24:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment