This should give an idea of relative throughput of the models. I could not discern what would be fastest from the names alone.
This is just a speed test. Obviously the larger models will perform better on evaluation benchmarks at the tradeoff of speed. Find the models that meet your throughput requirements then benchmark those for performance on the task you are doing.
Tested on an NVIDIA RTX 3090. CPU is an AMD 7950x, though that should not affect the benchmark much.
Images are pre-loaded into memory. The benchmark measures preprocess + encode.
t0 = time.time()
preprocd = []
for image in images:
preprocd.append(clip_preprocess(image).unsqueeze(0).to(device))
with torch.no_grad():
embeds = []
for image in preprocd:
embeds.append(clip_model.encode_image(image).to("cpu"))
t1 = time.time()
print(arch, pretrained, len(images)/(t1-t0))
Architecture | Images Per Second |
---|---|
ViT-B-32 | 170.317 |
nllb-clip-base | 166.847 |
xlm-roberta-base-ViT-B-32 | 163.626 |
ViT-B-32-256 | 161.603 |
ViT-B-32-quickgelu | 161.004 |
roberta-ViT-B-32 | 159.02 |
coca_ViT-B-32 | 156.964 |
ViT-B-16-SigLIP | 152.068 |
ViT-B-16 | 150.31 |
RN50-quickgelu | 147.837 |
ViT-B-16-quickgelu | 145.091 |
ViT-B-16-SigLIP-256 | 141.253 |
ViT-B-16-SigLIP-i18n-256 | 140.578 |
RN50 | 129.033 |
ViT-B-16-plus-240 | 126.04 |
RN50x4 | 110.681 |
EVA02-B-16 | 100.629 |
RN101-quickgelu | 93.8347 |
convnext_base | 93.7734 |
RN101 | 92.2537 |
convnext_base_w | 90.138 |
convnext_base_w_320 | 89.1247 |
convnext_large_d | 83.5527 |
ViT-B-16-SigLIP-384 | 79.4187 |
RN50x16 | 65.2571 |
convnext_large_d_320 | 60.0695 |
ViT-L-16-SigLIP-256 | 59.8122 |
ViT-L-14-CLIPA | 57.4581 |
ViT-L-14 | 54.3296 |
coca_ViT-L-14 | 53.7369 |
ViT-L-14-quickgelu | 52.6421 |
ViT-SO400M-14-SigLIP | 50.0526 |
EVA02-L-14 | 43.3144 |
ViT-B-16-SigLIP-512 | 42.4944 |
ViT-H-14-CLIPA | 31.9688 |
ViT-L-14-CLIPA-336 | 31.7083 |
convnext_xxlarge | 31.5165 |
ViT-L-16-SigLIP-384 | 30.9679 |
xlm-roberta-large-ViT-H-14 | 30.9487 |
nllb-clip-large | 30.9176 |
ViT-H-14 | 30.8775 |
ViT-H-14-quickgelu | 30.0928 |
ViT-L-14-336 | 29.4929 |
EVA02-L-14-336 | 26.5657 |
RN50x64 | 26.1571 |
EVA01-g-14 | 20.4507 |
EVA01-g-14-plus | 20.4077 |
ViT-g-14 | 20.0696 |
ViT-SO400M-14-SigLIP-384 | 17.6165 |
ViT-H-14-CLIPA-336 | 16.354 |
ViT-bigG-14-CLIPA | 12.1527 |
ViT-bigG-14 | 12.0347 |
ViT-bigG-14-CLIPA-336 | 6.52044 |
EVA02-E-14-plus | 5.52984 |
EVA02-E-14 | 5.5032 |
Architecture | Pretrained | Images Per Second |
---|---|---|
RN50 | openai | 107.945 |
RN50 | yfcc15m | 139.889 |
RN50 | cc12m | 139.266 |
RN50-quickgelu | openai | 164.015 |
RN50-quickgelu | yfcc15m | 139.801 |
RN50-quickgelu | cc12m | 139.694 |
RN101 | openai | 100.867 |
RN101 | yfcc15m | 83.64 |
RN101-quickgelu | openai | 102.801 |
RN101-quickgelu | yfcc15m | 84.868 |
RN50x4 | openai | 110.681 |
RN50x16 | openai | 65.2571 |
RN50x64 | openai | 26.1571 |
ViT-B-32 | openai | 163.519 |
ViT-B-32 | laion400m_e31 | 171.204 |
ViT-B-32 | laion400m_e32 | 170.896 |
ViT-B-32 | laion2b_e16 | 169.895 |
ViT-B-32 | laion2b_s34b_b79k | 173.671 |
ViT-B-32 | datacomp_xl_s13b_b90k | 171.88 |
ViT-B-32 | datacomp_m_s128m_b4k | 172.488 |
ViT-B-32 | commonpool_m_clip_s128m_b4k | 169.047 |
ViT-B-32 | commonpool_m_laion_s128m_b4k | 170.736 |
ViT-B-32 | commonpool_m_image_s128m_b4k | 172.236 |
ViT-B-32 | commonpool_m_text_s128m_b4k | 173.43 |
ViT-B-32 | commonpool_m_basic_s128m_b4k | 168.535 |
ViT-B-32 | commonpool_m_s128m_b4k | 170.343 |
ViT-B-32 | datacomp_s_s13m_b4k | 170.372 |
ViT-B-32 | commonpool_s_clip_s13m_b4k | 170.721 |
ViT-B-32 | commonpool_s_laion_s13m_b4k | 170.097 |
ViT-B-32 | commonpool_s_image_s13m_b4k | 170.926 |
ViT-B-32 | commonpool_s_text_s13m_b4k | 169.276 |
ViT-B-32 | commonpool_s_basic_s13m_b4k | 169.176 |
ViT-B-32 | commonpool_s_s13m_b4k | 167.898 |
ViT-B-32-256 | datacomp_s34b_b86k | 161.603 |
ViT-B-32-quickgelu | openai | 163.239 |
ViT-B-32-quickgelu | laion400m_e31 | 159.452 |
ViT-B-32-quickgelu | laion400m_e32 | 161.956 |
ViT-B-32-quickgelu | metaclip_400m | 159.77 |
ViT-B-32-quickgelu | metaclip_fullcc | 160.604 |
ViT-B-16 | openai | 148.401 |
ViT-B-16 | laion400m_e31 | 151.745 |
ViT-B-16 | laion400m_e32 | 151.355 |
ViT-B-16 | laion2b_s34b_b88k | 147.765 |
ViT-B-16 | datacomp_xl_s13b_b90k | 148.236 |
ViT-B-16 | datacomp_l_s1b_b8k | 152.87 |
ViT-B-16 | commonpool_l_clip_s1b_b8k | 150.399 |
ViT-B-16 | commonpool_l_laion_s1b_b8k | 149.891 |
ViT-B-16 | commonpool_l_image_s1b_b8k | 150.263 |
ViT-B-16 | commonpool_l_text_s1b_b8k | 151.263 |
ViT-B-16 | commonpool_l_basic_s1b_b8k | 151.263 |
ViT-B-16 | commonpool_l_s1b_b8k | 150.274 |
ViT-B-16-quickgelu | metaclip_400m | 145.243 |
ViT-B-16-quickgelu | metaclip_fullcc | 144.938 |
ViT-B-16-plus-240 | laion400m_e31 | 125.802 |
ViT-B-16-plus-240 | laion400m_e32 | 126.279 |
ViT-L-14 | openai | 53.6898 |
ViT-L-14 | laion400m_e31 | 54.5316 |
ViT-L-14 | laion400m_e32 | 54.1111 |
ViT-L-14 | laion2b_s32b_b82k | 54.339 |
ViT-L-14 | datacomp_xl_s13b_b90k | 54.6971 |
ViT-L-14 | commonpool_xl_clip_s13b_b90k | 54.8216 |
ViT-L-14 | commonpool_xl_laion_s13b_b90k | 54.2918 |
ViT-L-14 | commonpool_xl_s13b_b90k | 54.155 |
ViT-L-14-quickgelu | metaclip_400m | 52.2084 |
ViT-L-14-quickgelu | metaclip_fullcc | 53.0757 |
ViT-L-14-336 | openai | 29.4929 |
ViT-H-14 | laion2b_s32b_b79k | 30.8775 |
ViT-H-14-quickgelu | metaclip_fullcc | 30.0928 |
ViT-g-14 | laion2b_s12b_b42k | 20.0461 |
ViT-g-14 | laion2b_s34b_b88k | 20.093 |
ViT-bigG-14 | laion2b_s39b_b160k | 12.0347 |
roberta-ViT-B-32 | laion2b_s12b_b32k | 159.02 |
xlm-roberta-base-ViT-B-32 | laion5b_s13b_b90k | 163.626 |
xlm-roberta-large-ViT-H-14 | frozen_laion5b_s13b_b90k | 30.9487 |
convnext_base | laion400m_s13b_b51k | 93.7734 |
convnext_base_w | laion2b_s13b_b82k | 88.6564 |
convnext_base_w | laion2b_s13b_b82k_augreg | 90.0901 |
convnext_base_w | laion_aesthetic_s13b_b82k | 91.6674 |
convnext_base_w_320 | laion_aesthetic_s13b_b82k | 89.222 |
convnext_base_w_320 | laion_aesthetic_s13b_b82k_augreg | 89.0274 |
convnext_large_d | laion2b_s26b_b102k_augreg | 83.5527 |
convnext_large_d_320 | laion2b_s29b_b131k_ft | 60.3519 |
convnext_large_d_320 | laion2b_s29b_b131k_ft_soup | 59.7872 |
convnext_xxlarge | laion2b_s34b_b82k_augreg | 31.4495 |
convnext_xxlarge | laion2b_s34b_b82k_augreg_rewind | 31.504 |
convnext_xxlarge | laion2b_s34b_b82k_augreg_soup | 31.5961 |
coca_ViT-B-32 | laion2b_s13b_b90k | 156.336 |
coca_ViT-B-32 | mscoco_finetuned_laion2b_s13b_b90k | 157.592 |
coca_ViT-L-14 | laion2b_s13b_b90k | 53.6466 |
coca_ViT-L-14 | mscoco_finetuned_laion2b_s13b_b90k | 53.8271 |
EVA01-g-14 | laion400m_s11b_b41k | 20.4507 |
EVA01-g-14-plus | merged2b_s11b_b114k | 20.4077 |
EVA02-B-16 | merged2b_s8b_b131k | 100.629 |
EVA02-L-14 | merged2b_s4b_b131k | 43.3144 |
EVA02-L-14-336 | merged2b_s6b_b61k | 26.5657 |
EVA02-E-14 | laion2b_s4b_b115k | 5.5032 |
EVA02-E-14-plus | laion2b_s9b_b144k | 5.52984 |
ViT-B-16-SigLIP | webli | 152.068 |
ViT-B-16-SigLIP-256 | webli | 141.253 |
ViT-B-16-SigLIP-i18n-256 | webli | 140.578 |
ViT-B-16-SigLIP-384 | webli | 79.4187 |
ViT-B-16-SigLIP-512 | webli | 42.4944 |
ViT-L-16-SigLIP-256 | webli | 59.8122 |
ViT-L-16-SigLIP-384 | webli | 30.9679 |
ViT-SO400M-14-SigLIP | webli | 50.0526 |
ViT-SO400M-14-SigLIP-384 | webli | 17.6165 |
ViT-L-14-CLIPA | datacomp1b | 57.4581 |
ViT-L-14-CLIPA-336 | datacomp1b | 31.7083 |
ViT-H-14-CLIPA | datacomp1b | 31.9688 |
ViT-H-14-CLIPA-336 | laion2b | 16.3764 |
ViT-H-14-CLIPA-336 | datacomp1b | 16.3315 |
ViT-bigG-14-CLIPA | datacomp1b | 12.1527 |
ViT-bigG-14-CLIPA-336 | datacomp1b | 6.52044 |
nllb-clip-base | v1 | 166.847 |
nllb-clip-large | v1 | 30.9176 |