This guide provides detailed instructions for optimizing the https://huggingface.co/microsoft/trocr-base-printed model with OpenVINO.
To prepare your environment for model optimization and inference:
sudo apt update
sudo apt install git-lfs -y
python3 -m venv openvino-env
source openvino-env/bin/activate
pip install --upgrade pip
python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git
Optimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. For example, AutoModelForXxx
becomes OVModelForXxx
. For TrOCR, use OVModelForVision2Seq
as shown below:
from transformers import TrOCRProcessor
from optimum.intel.openvino import OVModelForVision2Seq
from PIL import Image
import requests
# Load image from the IAM database
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-small-handwritten')
model = OVModelForVision2Seq.from_pretrained('microsoft/trocr-small-handwritten', export=True)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
For 8-bit quantization during model loading, set load_in_8bit=True
when calling from_pretrained()
:
model = OVModelForVision2Seq.from_pretrained('microsoft/trocr-small-handwritten', load_in_8bit=True, export=True)
NOTE: The load_in_8bit
option is enabled by default for models with more than 1 billion parameters, which can be disabled with load_in_8bit=False
.
Utilize the Optimum Intel CLI to export models from HuggingFace to OpenVINO IR with various levels of weight compression:
optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATH
Replace placeholders appropriately:
MODEL_ID
: ID of the HuggingFace model.WEIGHT_FORMAT
: Desired weight format, options include{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}
. Refer to the Optimum Intel documentation for more details.EXPORT_PATH
: Directory path for storing the exported OpenVINO model.--ratio RATIO
: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression.
To see complete usage, execute:
optimum-cli export openvino -h
Example commands to export microsoft/trocr-base-printed
with different precision formats (FP16, INT8, and INT4):
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format fp16 ov_model_fp16
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format int8 ov_model_int8
optimum-cli export openvino --model microsoft/trocr-base-printed --weight-format int4 ov_model_int4
After conversion, pass the converted model path as model_id
when using from_pretrained()
.
Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as device
argument in that method. See documentation for other supported device options: AUTO
, HETERO
, BATCH
.
from transformers import TrOCRProcessor
from optimum.intel.openvino import OVModelForVision2Seq
from PIL import Image
import requests
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
device = "CPU"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"}
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = OVModelForVision2Seq.from_pretrained(model_id='./ov_model_int8', device=device, ov_config=ov_config, export=False)
pixel_values = processor(images=image, return_tensors="pt").pixel
_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]