This report ranks ten open‑weight, self‑hostable multimodal models that are especially strong for (a) systems engineering tasks (design reviews, debugging, log/trace reasoning, incident triage, reading diagrams/docs/UI screenshots, tool-using agents) and (b) Rust programming (ownership/lifetimes, unsafe correctness, concurrency, performance engineering, refactors with tests). The ranking emphasizes coding + agentic SWE benchmarks (notably SWE‑bench Verified, LiveCodeBench v6, and HumanEval) where officially reported, and treats Rust‑specific scores as mostly unspecified because few vendors publish them directly. In the absence of official Rust scores, this report recommends a local Rust evaluation harness based on MultiPL‑E (Rust translations of HumanEval/MBPP) combined with cargo test, Clippy, Miri, Loom, and criterion.
Top picks by deployment tier:
- Best overall “frontier open‑weight agentic coder” (huge, cluster‑class): Kimi K2.5 from Moonshot AI — standout engineering/agentic scores (e.g., SWE‑bench Verified 76.8, LiveCodeBench v6 85.0) plus native multimodal + tool workflows; video chat is limited locally (officially “experimental” and API-only).
- Best cost/performance “SWE agent” you can realistically self‑host on a single high‑VRAM GPU: Qwen3.5‑27B (Apache‑2.0) from Alibaba Cloud — strong across SWE‑bench Verified (72.4), LiveCodeBench v6 (80.7), and agent benchmarks, while remaining far cheaper to run than 200B–1T class MoE models.
- Best “multimodal + coding + audio on device” (edge‑friendly small variants): Gemma 4 from Google — a family spanning small on-device “Any‑to‑Any” models with native audio and bigger vision+text models, with LiveCodeBench v6 up to 80.0 and native function calling.
- Best “permissive license + solid code + strong doc/diagram VLM” (mid‑scale): Pixtral‑12B‑2409 from Mistral AI — Apache‑2.0, long context (128k), and published HumanEval pass@1 72.0 in its model card.
Key caveat about Rust: Vendors almost never publish “Rust‑only” leaderboards. As a result, Rust capability must be validated locally using execution‑based tests (compile + unit tests) and concurrency/UB tooling (Loom/Miri). This report provides a concrete evaluation suite later.
A model is included in the ranked top ten only if it satisfies:
- Open weights available (downloadable) and self-hostable (not strictly API-only). Evidence comes primarily from official model cards and/or official repositories.
- Multimodal capability (at least text+image, optionally video/audio).
- Documented or inferable suitability for programming + systems engineering (benchmarks, tool calling, agent framing, or strong community usage in dev workflows).
Each model is scored on a 0–10 scale per dimension, then weighted:
- Programming & systems engineering capability (35%) Signals: SWE‑bench Verified, LiveCodeBench v6, TerminalBench, HumanEval, MBPP, codeforces-style metrics, plus evidence of agentic coding frameworks.
- Multimodal systems usefulness (20%) Signals: doc/OCR/layout/diagram competence, GUI/desktop agent claims, long-video understanding, structured extraction support.
- Self-hosting efficiency & scalability (25%) Signals: official deployment guidance (tensor parallel, recommended engines), context length practicality, quantization availability, and feasible hardware footprint. If latency is not specified by sources, it is marked unspecified.
- Tool-use readiness (10%) Signals: explicit function calling/tool calling support, structured outputs, OpenAI-compatible serving patterns.
- License & ecosystem (10%) Signals: permissive license, documented usage, integration with mainstream serving stacks (Transformers, vLLM, SGLang, LMDeploy), scope of community fine-tunes/quantizations.
- SWE‑bench evaluates repository-level issue resolution (Python repos), but it is still one of the best proxies for “real SWE work”: multi-file edits, tests, tool use, and long-context reasoning.
- LiveCodeBench is designed to reduce contamination and covers broader coding abilities beyond code generation (self-repair and execution/test prediction).
- HumanEval / MBPP are unit-test driven code generation benchmarks (mostly Python), useful but older; Rust relevance is best obtained via translation frameworks.
- MultiPL‑E provides translated HumanEval/MBPP problems into many languages,
and its docs explicitly show how to evaluate Rust (
--lang rs) with containerized execution. This is the preferred Rust-oriented benchmark base in this report.
The ranking below targets potency for systems engineering + Rust under the rubric above. For every model, Rust-specific benchmark scores are marked “unspecified” unless an official source reports them (rare); instead, recommended Rust tests are provided later.
- Kimi‑K2.5 (MoE, native INT4, agentic coding leader)
- Qwen3.5‑27B (excellent SWE/agent scores at a realistic self-host scale)
- Gemma 4 (31B / 26B‑A4B / E4B/E2B) (strong coding + multimodal; audio on small variants)
- Llama 4 Maverick (strong multimodality + big MoE capacity; license constraints)
- Qwen3‑VL‑235B‑A22B (flagship VLM; very large)
- Qwen2.5‑VL‑72B‑Instruct (strong doc/video/agent VLM; code benchmarks unspecified in card)
- InternVL3.5‑30B‑A3B (strong multimodal/agentic suite; clear deployment guidance)
- Pixtral‑12B‑2409 (Apache‑2.0; strong HumanEval; long context; good “daily driver” VLM)
- DeepSeek‑VL2 (small/large variants) (MoE VLM; explicit VRAM guidance; shorter context)
- GLM‑4.6V‑Flash (local-friendly VLM with native multimodal function calling; text-only weaker per authors)
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/moonshotai/Kimi-K2.5
GitHub repo: https://github.com/MoonshotAI/Kimi-K2.5
vLLM recipe: https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html
- License: Modified MIT.
- Modalities: Text + image (native multimodality). “Chat with video content” is explicitly noted as experimental and officially API-only at the time of the model card; treat local video as unsupported/unspecified.
- Size & architecture: MoE, 1T total params, 32B activated, 384 experts, 8 selected experts/token, 256K context, vision encoder MoonViT (400M).
- Training data notes: Continual pretraining on ~15T mixed visual + text tokens (composition otherwise unspecified).
- Instruction-tuning / fine-tuning: The HF page shows a large ecosystem of fine-tunes and quantizations; instruction-tuned base is provided.
- Programming/SWE benchmarks (officially reported):
- SWE‑bench Verified 76.8
- TerminalBench 2.0 50.8
- LiveCodeBench v6 85.0 (Many other evals are listed in the official table; Rust-specific: unspecified.)
- Community / third‑party evaluations: Strong adoption signals via many Spaces and downstream derivatives; treat as “high community activity” rather than a quality guarantee.
- Self-hosting requirements:
- Official guidance emphasizes vLLM/SGLang/KTransformers.
- Tensor-parallel multi-GPU is effectively required for full model; a community deployment note recommends TP=8 and highlights native INT4.
- Latency: unspecified (depends heavily on GPU class and concurrency).
- Quantization & optimization: Native INT4 quantization is part of the official release narrative.
- Tool / plugin support: The model card includes “Interleaved Thinking and Multi‑Step Tool Call” and a “Coding Agent Framework.” For tool calling infrastructure, use vLLM’s tool-calling + structured outputs when serving.
- Security/privacy considerations: Fully local inference is possible (weights self-hosted), but tool execution must be sandboxed (see evaluation section).
- Recommended deployment config (cost/perf):
- Cluster: vLLM or SGLang with multi-GPU tensor parallel (TP8 recommended in community guidance).
- Due to model scale, prioritize this model if you truly need frontier-level agentic coding in an on-prem cluster; otherwise Qwen3.5/Gemma 4 may deliver much better cost/perf.
- Best-in-class reported open-weight agentic coding metrics (SWE/LCB/TerminalBench) among listed models.
- Native multimodality + explicit agent/tool workflows in official docs.
- Very large; multi-GPU serving is effectively mandatory.
- Video chat is not a reliable local feature per the model card (API-only experimental).
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/Qwen/Qwen3.5-27B
- License: Apache‑2.0.
- Modalities: Image‑Text‑to‑Text (vision encoder + causal LM). Audio/video: unspecified/not indicated in the model card.
- Size & architecture: 27B causal LM with vision encoder; long context 262,144 (extensible to ~1,010,000). The model card states a hybrid layout using Gated DeltaNet + Gated Attention blocks and provides head/layer dimensions.
- Training data notes: The card describes “multimodal learning” and large-scale RL but does not fully disclose dataset composition; treat as mixed/unspecified.
- Instruction-tuning / fine-tuning: Post-trained model is provided; compatibility noted for Transformers, vLLM, SGLang, KTransformers.
- Programming/SWE benchmarks (official):
- SWE‑bench Verified 72.4
- LiveCodeBench v6 80.7
- Terminal Bench 2 41.6
- CodeForces 1899 Rust-specific: unspecified.
- Community evaluation: Strong HF activity and explicit agentic usage sections (“Qwen‑Agent”, “Qwen Code”) indicate ecosystem orientation.
- Self-hosting requirements: The model card does not provide a single “VRAM minimum.” Treat latency as unspecified; recommended engines include vLLM/SGLang.
- Quantization & optimization: Not explicitly enumerated in the excerpted lines, but the model is designed for high throughput; use standard HF/serving quantization stacks (AWQ/GPTQ/bitsandbytes) where available. Transformers documents quantization support broadly (AWQ/GPTQ/bnb).
- Tool/plugin support: The model’s benchmark table includes explicit “agent” and “search agent” evaluations (BFCL, TAU2-Bench, BrowseComp), aligning with tool use; serve with vLLM tool calling + structured outputs.
- Security/privacy: Well-suited to on-prem; still requires strict sandboxing for code execution.
Recommended deployment config (cost/perf):
- Single high‑VRAM GPU (best practical default): vLLM OpenAI-compatible server; enable structured outputs for tools.
- If you need bigger concurrency/long context, scale horizontally with multiple replicas rather than pushing maximal context for every request.
- Near-frontier open-weight coding/agent performance at a dramatically lower serving cost than 200B+ MoE VLMs.
- Very long context (native 256k).
- No official Rust bench numbers; must validate with MultiPL‑E + local harness.
Official sources (repos/docs):
Gemma 4 model card: https://ai.google.dev/gemma/docs/core/model_card_4
Hugging Face blog overview: https://huggingface.co/blog/gemma4
- License: The model card describes the family as openly released; specific license text is not reproduced in this excerpt, so treat license as unspecified here and follow the official model card terms. (Benchmarks and architecture are taken from the official model card.)
- Modalities:
- All models: text + image.
- E2B/E4B: audio (ASR + speech translation), and video supported as frame sequences.
- Size & architecture: The model card provides detailed sizes:
- Dense: E2B (2.3B effective; 5.1B w/ embeddings), E4B (4.5B effective; 8B w/ embeddings), 31B dense (30.7B)
- MoE: 26B A4B (25.2B total, 3.8B active) with 8 active / 128 total experts (+ shared). Context: 128k (E2B/E4B) and 256k (26B/31B). Vision encoder parameters ~150M (small) or ~550M (large). Audio encoder parameters ~300M (only small models).
- Training data notes: The model card states training includes web documents, code, images, audio with cutoff January 2025, plus filtering notes (CSAM, sensitive data filtering).
- Instruction-tuning / fine-tuning: Instruction-tuned variants are benchmarked; examples show Transformers integration and “thinking mode.”
- Programming/SWE benchmarks (official):
- LiveCodeBench v6: 80.0 (31B), 77.1 (26B A4B), 52.0 (E4B), 44.0 (E2B)
- Codeforces ELO: 2150 (31B), 1718 (26B A4B), etc. (HumanEval/MBPP/SWE-bench not listed in the excerpted Gemma 4 tables; mark unspecified.)
- Self-hosting requirements: The model card positions smaller models for on-device use and larger for consumer GPUs/workstations; exact VRAM/latency not specified.
- Quantization & optimization: The HF blog credits ecosystem integrations (llama.cpp, mistral.rs, etc.), but exact quantization formats are not specified in these excerpts; treat as unspecified and rely on downstream quantization projects where available.
- Tool/plugin support: The model card explicitly notes native function calling support. vLLM also supports tool calling and structured outputs at the serving layer.
- Security/privacy: Audio input raises additional privacy concerns (recordings); keep transcripts local, disable network during tool execution, and store only minimal logs.
Recommended deployment config (cost/perf):
- For “real work” coding + multimodal: prioritize Gemma 4 26B A4B (MoE active 3.8B) or 31B if you can afford the VRAM.
- For speech workflows: use E4B/E2B on-device; keep audio inputs under the stated maximum length (30s).
- Broad modality coverage (including audio on small variants) and strong coding metrics (LiveCodeBench v6 near 80).
- Clear architectural transparency in the official model card (active vs total params, encoder sizes).
- SWE‑bench/HumanEval/MBPP not listed in the cited official tables; Rust performance requires local measurement.
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
Meta release blog: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
- License: Llama 4 Community License (custom).
- Modalities: Text + image input; text + code output. Video/audio: unspecified in the model card.
- Size & architecture: MoE with 17B activated, 400B total, 128 experts, context length 1M, token count ~22T, cutoff Aug 2024.
- Training data notes: Mix of public, licensed, and Meta product/service data (including publicly shared posts/interactions), as stated in the model card.
- Instruction-tuning / fine-tuning: Instruct-tuned model provided; future versions may be released.
- Programming/SWE benchmarks (official):
- LiveCodeBench (10/01/2024–02/01/2025) pass@1: 43.4 (Maverick instruct)
- MBPP pass@1 (pretrained): 77.6 Rust-specific: unspecified.
- Self-hosting requirements:
- Quantization section states Maverick is released as BF16 and FP8; FP8 weights fit on a “single H100 DGX host” (multi-GPU system).
- Latency unspecified.
- Quantization & optimization: FP8 weights provided; on-the-fly int4 quantization code is referenced in the model card.
- Tool/plugin support: Use serving-layer tool calling (vLLM) even if the model isn’t explicitly “function calling tuned.”
- Security/privacy: License restrictions may matter for commercial deployment; review acceptable use and compliance before production.
Recommended deployment config (cost/perf):
- When you need a capable multimodal MoE and can operate an 8-GPU node (e.g., DGX-class).
- For most teams, Qwen3.5 or Gemma 4 is likely better cost/perf unless you specifically want Llama ecosystem tooling and 1M context.
- Very large effective capacity with long context (1M).
- Officially published core coding benchmarks (LiveCodeBench, MBPP).
- License is not permissive OSS; compliance overhead may be non-trivial.
- Rust scores not published; validate locally.
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
Qwen3 technical report (family context): https://arxiv.org/abs/2505.09388
Qwen3-VL GitHub: https://github.com/QwenLM/Qwen3-VL
- License: Apache‑2.0.
- Modalities: Text + image; the model card describes advanced video reasoning and long context, but the HF page itself is “Image‑Text‑to‑Text.” Treat video input as supported in the broader Qwen3‑VL stack but confirm in chosen runtime; audio is unspecified.
- Size & architecture: MoE; HF metadata shows ~236B params; A22B implies ~22B active (active count is not explicitly stated in this model card excerpt, so treat “active params” as unspecified here). Context: 256K native, “expandable to 1M.”
- Training data notes: Not fully disclosed in the model card excerpt; treat as mixed/unspecified.
- Instruction-tuning / fine-tuning: Instruct checkpoint provided; Transformers usage requires recent versions.
- Benchmarks:
- The official HF “Model Performance” section is primarily graphical (tables embedded in images), so many numeric values are not extractable from text here; treat detailed numbers as unspecified unless you consult the images directly.
- For family-level coding evidence, Qwen3 technical report reports strong coding results for the flagship text model (e.g., LiveCodeBench v5). This is not the same as Qwen3‑VL; do not substitute those numbers for the VLM.
- Rust-specific: unspecified.
- Self-hosting requirements: Large MoE; expect multi‑GPU tensor parallel for high throughput; exact VRAM/latency unspecified.
- Quantization & optimization: FlashAttention recommended in the model card; quantized variants exist on the Hub (not enumerated here).
- Tool/plugin support: Model card emphasizes “agent interaction capabilities” and “visual agent”; serve with tool calling infrastructure.
Recommended deployment config (cost/perf):
- Choose this if you need frontier-scale multimodal (especially GUI/video/dynamics) and can operate multi‑GPU infrastructure.
- If you primarily care about coding agent performance per dollar, Qwen3.5‑27B is typically the better first choice.
- Flagship VLM in the Qwen series with explicit “visual coding” and agent framing.
- Apache‑2.0 license.
- Many official benchmark numbers are presented as images; text-extractable metrics are incomplete here.
- Extremely large; likely expensive to run.
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
Qwen2.5-VL technical report: https://arxiv.org/abs/2502.13923
Qwen2.5-VL blog: https://qwenlm.github.io/blog/qwen2.5-vl/
- License: “qwen” (custom).
- Modalities: Image‑Text‑to‑Text; includes extensive image, video, and agent benchmarks in the official card. Audio: unspecified.
- Size & architecture: Dense 72B; model card describes dynamic resolution training, window attention in ViT, and long video processing. Vision encoder details: partially described (ViT with window attention), but exact vision parameter count is unspecified here.
- Training data notes: Technical report abstract describes multimodal training and long-video design; detailed corpus composition is not provided here → treat as mixed/unspecified.
- Programming/SWE benchmarks: The model card emphasizes multimodal/agent benchmarks and does not provide SWE‑bench/HumanEval/LiveCodeBench numbers in text; mark these as unspecified for this VLM card.
- Self-hosting requirements:
- Requires very recent Transformers builds; flash_attention_2 recommended for speed/memory.
- Latency unspecified.
- Quantization & optimization: An official AWQ variant exists for this model series (example: 72B‑Instruct‑AWQ) and the model card recommends FlashAttention2; quantization behavior deltas are not fully specified here.
- Tool/plugin support: The model is positioned as a “visual agent” for computer/phone use; serve with tool calling and structured outputs for robust agent loops.
Recommended deployment config (cost/perf):
- If you need document+diagram+video understanding and UI agent behaviors, this is a strong open-weight option.
- For Rust coding quality, pair with a local compilation/testing loop; benchmarks in the model card are not Rust-oriented.
- Strong published doc/OCR/video/agent results in the official model card.
- Tooling support (
qwen-vl-utils, FlashAttention recommendation) eases production use.
- Coding benchmarks relevant to Rust (HumanEval/LiveCodeBench/SWE) are not given in the model card text.
Official sources (repos/docs):
Hugging Face model card (family): https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
Hugging Face model card (30B-A3B-HF format): https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B-HF
vLLM recipe (InternVL3.5): https://docs.vllm.ai/projects/recipes/en/latest/InternVL/InternVL3_5.html
Paper: https://arxiv.org/abs/2508.18265
- License: Apache‑2.0 (per model card).
- Modalities: Image‑Text‑to‑Text; official model card includes evaluation categories spanning OCR/doc, video understanding, GUI tasks, and grounded/spatial reasoning.
- Size & architecture: The family card lists per-model vision+language parameters; for 30B‑A3B, total is ~30.8B‑A3B (active params implied by A3B label; exact routing details for this variant are not shown in the excerpt).
- Training data notes: The paper emphasizes Cascade RL and multimodal training; HF model card does not fully disclose dataset provenance → mixed/unspecified.
- Instruction-tuning / fine-tuning: The model card references multiple fine-tuning toolchains and provides LMDeploy deployment examples and OpenAI-style API compatibility.
- Programming/SWE benchmarks: The model card is heavily multimodal/agent oriented; Rust/code metrics like SWE‑bench/HumanEval/LiveCodeBench are not present in the excerpted text → mark unspecified.
- Self-hosting requirements (official guidance):
- “Models up to 30B can be deployed on a single A100 GPU,” 38B needs 2×A100, and 235B needs 8×A100 (family guidance).
- LMDeploy provides explicit tensor-parallel notes (e.g., tp=8 for 241B-A28B).
- Latency unspecified.
- Quantization & optimization: Examples include 8-bit loading; LMDeploy is explicitly presented as a compression/deployment toolkit.
- Tool/plugin support: Strong agent framing; OpenAI-compatible REST serving in LMDeploy supports building tool-using agents.
Recommended deployment config (cost/perf):
- For a balanced “multimodal systems engineer” assistant at manageable infrastructure cost, the 30B‑A3B class is a practical sweet spot (single A100-class GPU per official guidance).
- For Rust coding, run it in an agent loop with compile/test tools.
- Exceptionally complete multimodal evaluation and deployment documentation (LMDeploy service, tp guidance).
- Clear family parameter breakdown (vision vs language).
- No official Rust or code benchmark scores in the model card excerpt; must evaluate locally.
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/mistralai/Pixtral-12B-2409
- License: Apache‑2.0.
- Modalities: Natively multimodal (image + text); audio/video not indicated.
- Size & architecture: 12B text decoder + 400M vision encoder, sequence length 128k.
- Training data notes: Trained with interleaved image/text data; detailed dataset composition unspecified.
- Programming/SWE benchmarks (official): HumanEval pass@1 72.0 is reported in the model card’s “Text Benchmarks.” Rust-specific: unspecified.
- Self-hosting requirements: vLLM is recommended; official examples include
vllm serveand advice to lower model limits on low‑VRAM GPUs. Exact VRAM/latency not specified. - Quantization & optimization: Not explicitly enumerated in the cited lines; vLLM is recommended for production inference.
- Tool/plugin support: Use vLLM tool calling + structured outputs for agent workflows.
- Security/privacy: The model card explicitly notes no built-in moderation mechanisms; treat as requiring additional safeguards in production.
Recommended deployment config (cost/perf):
- One of the best “daily driver” options if you want Apache‑2.0 + multimodal + strong coding baseline, without huge GPU fleets.
- Apache‑2.0, long context, and published HumanEval 72.0.
- Straightforward vLLM serving guidance in the model card.
- No official SWE‑bench / Rust-specific metrics; needs local agent loop with compilation/tests for Rust.
Official sources (repos/docs):
GitHub repo + paper PDF: https://github.com/deepseek-ai/deepseek-vl2
- License: Code is MIT; model use is subject to DeepSeek Model License; stated as commercial-use supported.
- Modalities: Vision-language (images + text). Audio/video unspecified in the repo excerpt.
- Size & architecture: MoE VLM family with activated parameters: 1.0B (tiny), 2.8B (small), 4.5B (vl2). The repo also mentions total MoE sizes (e.g., vl2‑tiny “3.37B‑MoE total”; vl2‑small “16.1B‑MoE total”; vl2 “27.5B‑MoE total”). Context/sequence length is 4096.
- Training data notes: Not detailed in the excerpt; treat as mixed/unspecified.
- Programming/SWE benchmarks: Not reported in the excerpt; Rust-specific: unspecified.
- Self-hosting requirements (explicit guidance):
- Repo notes you may need 80GB GPU memory for deepseek‑vl2‑small and larger, with incremental prefilling enabling vl2‑small within ~40GB at slower speed.
- Production serving is recommended via optimized stacks like vLLM/SGLang/LMDeploy (explicitly named).
- Quantization & optimization: Incremental prefilling is highlighted as a memory saver; other quantization formats not specified in excerpt.
- Tool/plugin support: Not explicitly “function calling tuned” in excerpt; implement tool calling at serving layer.
Recommended deployment config (cost/perf):
- If you want a smaller MoE VLM with explicit memory-saving tactics and can accept shorter context (4k), DeepSeek‑VL2‑small can be used with incremental prefilling.
- Clear, practical deployment guidance (80GB recommendation, incremental prefilling strategy).
- Smaller activated-parameter MoE variants can be efficient per token.
- Short context (4096) limits “big repo” and log-heavy systems engineering tasks unless you add retrieval/chunking.
- No official code/Rust benchmarks in excerpt; requires local evaluation.
Official sources (repos/docs):
Hugging Face model card: https://huggingface.co/zai-org/GLM-4.6V-Flash
GitHub repository: https://github.com/zai-org/GLM-V
Blog: https://z.ai/blog/glm-4.6v
Paper: https://huggingface.co/papers/2507.01006
- License: MIT.
- Modalities: Image‑Text‑to‑Text; model card emphasizes multimodal document understanding and also references video tasks (via SGLang for “video tasks”).
- Size & architecture: “Flash” is labeled 9B in the model card narrative; HF metadata shows ~10B params. Context length trained to 128K.
- Training data notes: Not detailed in the model card excerpt; treat as mixed/unspecified.
- Programming/SWE benchmarks: Benchmarks are primarily presented as an image in the model card; numeric extraction is limited in text here → treat most as unspecified.
- Self-hosting requirements: Model card provides installation guidance for vLLM/SGLang and notes remaining issues; exact VRAM/latency unspecified.
- Quantization & optimization: The ecosystem contains GGUF conversions and many quantizations on HF; however, details are not in the official model card excerpt.
- Tool/plugin support: Native multimodal function calling is a core stated feature, meant to “close the loop” from perception to execution.
- Security/privacy: The model card explicitly acknowledges limitations and encourages issue reporting; for production, you must enforce sandboxing and authorization gates for any tool execution.
Recommended deployment config (cost/perf):
- A good “local multimodal agent scaffold” if you want MIT license + explicit multimodal function calling, and can tolerate weaker pure-text QA per authors.
- Explicit native multimodal function calling + agent loop orientation.
- Local-friendly “Flash” variant intended for low latency.
- Authors note pure text QA still needs improvement; treat it as a VLM/agent component rather than a top pure coder.
Key: “—” means unspecified in cited official sources (do not assume).
| Rank | Model | License | Modalities (in/out) | Params & architecture | Context | Key coding/SWE evidence | Self-host notes (official) |
|---|---|---|---|---|---|---|---|
| 1 | Kimi‑K2.5 | Modified MIT | in: text+image; out: text | MoE, 1T total / 32B active, 256K, MoonViT 400M | 256K | SWE Verified 76.8; LiveCodeBench v6 85.0; TerminalBench2 50.8 | vLLM/SGLang/KTransformers recommended; native INT4; TP8 commonly recommended |
| 2 | Qwen3.5‑27B | Apache‑2.0 | in: image+text; out: text | 27B w/ vision encoder; Gated DeltaNet/Gated Attention | 262K | SWE Verified 72.4; LiveCodeBench v6 80.7; TerminalBench2 41.6 | Compatible w/ Transformers/vLLM/SGLang/KTransformers |
| 3 | Gemma 4 (31B/26B‑A4B/E4B/E2B) | — | in: text+image (+audio for E2B/E4B); out: text | Dense + MoE; 26B A4B is 25.2B total/3.8B active; native function calling | 128K–256K | LiveCodeBench v6 up to 80.0; Codeforces ELO up to 2150 | Official best practices; audio max length 30s; video as frames |
| 4 | Llama 4 Maverick | Llama 4 Community License | in: text+image; out: text+code | MoE 17B active / 400B total; 128 experts | 1M | LiveCodeBench pass@1 43.4 (instruct); MBPP 77.6 (pretrain) | FP8 weights fit on single H100 DGX host (per card) |
| 5 | Qwen3‑VL‑235B‑A22B | Apache‑2.0 | in: image+text; out: text | MoE ~236B; “native 256K, expandable 1M” | 256K | Detailed numbers mostly image‑embedded; avoid guessing | Requires very recent Transformers; large-scale serving implied |
| 6 | Qwen2.5‑VL‑72B | qwen | in: image+video+text; out: text | Dense 72B VLM | — | Code benchmarks not in card text; multimodal/agent evals are extensive | FlashAttention2 recommended; official AWQ variant exists |
| 7 | InternVL3.5‑30B‑A3B | Apache‑2.0 | in: image+text; out: text | Family has vision+LM split; 30B class supported | — | Code/Rust metrics not listed; broad multimodal eval suite | Up to 30B deployable on single A100 (official); LMDeploy OpenAI-style API example |
| 8 | Pixtral‑12B‑2409 | Apache‑2.0 | in: image+text; out: text | 12B + vision encoder 400M | 128K | HumanEval pass@1 72.0 | vLLM (recommended) in card; no moderation built-in |
| 9 | DeepSeek‑VL2 | MIT (code) + model license | in: image+text; out: text | MoE family; activated 1.0B/2.8B/4.5B; total up to 27.5B; seq len 4096 | 4K | Rust/code metrics not listed in repo excerpt | 80GB GPU suggested for small+; incremental prefilling enables ~40GB for small (slower) |
| 10 | GLM‑4.6V‑Flash | MIT | in: image+text; out: text | Flash 9B class; 128K context; tool calling | 128K | Benchmarks mostly image‑embedded; text QA noted weaker | vLLM or SGLang recommended; explicit multimodal function calling |
This section provides a repeatable, local way to measure (a) Rust correctness and (b) systems-engineering reasoning with multimodal inputs, independent of vendor marketing. It combines benchmark translations + real toolchains.
MultiPL‑E translates HumanEval and MBPP into many languages and
explicitly documents how to run Rust (--lang rs) and execute inside a
locked-down container environment.
Minimal high-signal plan:
- Use MultiPL‑E Rust for: algorithmic correctness, idiomatic Rust, lifetimes, and function-level reasoning under unit tests.
- Use LiveCodeBench (if you can run it locally) for “fresh” problems and self-repair behavior; treat it as a supplement.
- Use SWE‑bench‑style tasks conceptually for Rust by creating an internal “Rust‑SWE mini” suite: small Rust repos with issues + tests + expected diffs. SWE‑bench’s design rationale explains why repo-level tasks matter.
Use a standardized agent loop: “model proposes patch → apply → run → report → iterate”.
Include these gates:
cargo testwithRUSTFLAGS="-D warnings"(treat warnings as failures).cargo clippy -- -D warnings(lint quality).cargo miri testfor UB detection (unsafe, aliasing, stack borrows).- Loom tests for concurrency correctness (state space exploration of atomics/locks).
cargo benchusing criterion for performance regression tracking.
Why this matters: Many models can draft plausible Rust, but fewer can converge under strict tool feedback; this tends to separate “chatty codegen” from real engineering ability.
Use consistent temperature and seeds across models. For each prompt, require the artifact plus a runnable test.
- Ownership/lifetimes: implement a zero-copy parser returning slices; verify no allocations using a tracking allocator.
- Unsafe correctness: write a small
unsafering buffer and prove safety invariants; validate via Miri. - Concurrency: implement a bounded MPSC queue; validate with Loom stress schedules.
- Performance: optimize a hot loop (SIMD optional) and benchmark with criterion; require explanation of cache behavior.
- Systems debugging: given stack traces + logs, produce a minimal reproducer and patch, then run tests.
- FFI boundary: wrap a C library safely and write property tests to ensure safe invariants.
- Async runtime: fix a deadlock in Tokio-based code; require a deterministic test.
- Error handling: convert error enums into
thiserror-based structure; ensure backtraces preserved. - API design: propose a crate-level API and produce docs + examples with doc tests.
- Multimodal: feed a screenshot of a failing CI log (or flamegraph) and ask for diagnosis + patch plan.
If you run code suggested by any model:
- Run inside containers with no network by default (MultiPL‑E’s containerized approach is aligned with this).
- Use an allowlist of commands (
cargo,rustc,clippy,miri,loom,criterion), and block filesystem writes outside the repo workspace. - Require signed/confirmed tool actions for destructive ops (file deletion, publishing, secrets).
A practical “systems engineering + Rust” self-host stack:
- vLLM as the default inference server, because it offers:
- OpenAI-compatible server mode for easy client integrations.
- Named function calling (tool calling) support.
- Structured outputs support (JSON schema / constraints) for more reliable tool calls.
- Use model-native stacks when the official card recommends them (e.g., LMDeploy for InternVL; SGLang for GLM video tasks).
- The Transformers documentation notes support for AWQ and GPTQ and 8-bit/4-bit quantization via bitsandbytes.
- The original AWQ paper provides the method basis for activation-aware weight quantization (useful when selecting AWQ toolchains).
- Some models ship native quantization:
- Kimi‑K2.5 explicitly reports native INT4 quantization.
- Llama 4 provides FP8 weights and mentions int4 on-the-fly quantization.
- Some model families ship official quantized variants:
- Qwen2.5‑VL provides an AWQ model variant on HF.
To compare “GPU-only purchase cost” across recommended configs, this report uses representative 2026-era unit price ranges from publicly available pricing guides and industry snapshots. These prices vary widely by region, vendor, and availability; treat the chart as an order-of-magnitude planning tool, not a quote.
Unit price assumptions (approx midpoints):
- H100 80GB: $25k–$40k → assume $30k.
- A100 80GB: $7k–$15k → assume $11k.
- L40S 48GB: $7.5k–$10k → assume $8.75k.
- L20 48GB: pricing snapshots around ~$4k → assume $4.05k.
- RTX 4090 24GB: price trackers show ~$2.7k retail (varies).
- Kimi‑K2.5: 8×H100 (TP8 commonly recommended; cluster-class).
- Qwen3.5‑27B: 1×L40S (realistic single-GPU serving baseline; exact VRAM not published).
- Gemma 4 31B: 1×A100 80GB (conservative; VRAM not specified in card).
- Llama 4 Maverick: 8×H100 (FP8 fits on “single H100 DGX host”).
- Qwen3‑VL‑235B: 8×H100 (flagship VLM at this scale typically multi-GPU; exact not specified in card).
- Qwen2.5‑VL‑72B: 2×A100 80GB (conservative for 72B-class; exact not specified).
- InternVL3.5‑30B‑A3B: 1×A100 80GB (explicitly stated deployable up to 30B on single A100).
- Pixtral‑12B: 1×RTX 4090 (12B-class often fits with quantization; official VRAM not specified).
- DeepSeek‑VL2‑small: 1×A100 80GB (repo suggests 80GB for small+; incremental prefilling can reduce).
- GLM‑4.6V‑Flash: 1×L20 48GB (local-friendly 9B class; exact VRAM not stated).
timeline
title Model release timeline relevant to this ranking
2024-09 : Pixtral-12B-2409 (Pixtral 12B series identifier)
2024-12-13 : DeepSeek-VL2 family released (GitHub release timeline)
2025-01-26 : Qwen2.5-VL announced (blog)
2025-04-05 : Llama 4 Scout/Maverick released
2025-08-25 : InternVL3.5 paper published (series release window)
2025-09-22 : Qwen3-VL generation (series blog window)
2025-12-08 : GLM-4.6V-Flash model card published window
2026-02-02 : Kimi-K2.5 paper/model release window
2026-04 : Gemma 4 released window (model card)
Release-date citations: Pixtral model card identifier and details ; DeepSeek release timeline ; Qwen2.5‑VL blog date ; Llama 4 release date ; InternVL3.5 paper date ; GLM‑4.6V‑Flash card date window ; Kimi paper/model date window ; Gemma 4 model card . (Qwen3‑VL’s precise date is treated as a series window because official numeric dates in the cited text are limited.)
---
config: { xyChart: { width: 2100 } }
---
xychart-beta
title "Estimated GPU-only cost by recommended deployment config (USD, midpoint assumptions)"
x-axis ["Kimi-K2.5 (8xH100)","Qwen3.5-27B (1xL40S)","Gemma4-31B (1xA100)","Llama4-Maverick (8xH100)","Qwen3-VL-235B (8xH100)","Qwen2.5-VL-72B (2xA100)","InternVL3.5-30B (1xA100)","Pixtral-12B (1x4090)","DeepSeek-VL2-small (1xA100)","GLM-4.6V-Flash (1xL20)"]
y-axis "USD (est.)" 0 --> 260000
bar [240000,8750,11000,240000,240000,22000,11000,2755,11000,4050]
Price assumption citations: H100 range ; A100 range ; L40S range ; L20 snapshot ; RTX 4090 snapshot .
This report excludes closed-source or not-reliably-self-hostable models even if they may be strong for Rust/SWE, because the request is explicitly for self-hostable models.
- OpenAI GPT‑4o / GPT‑4o‑mini (and similar proprietary GPT‑4-class models): excluded because the reliably self-hostable open-weight offering from OpenAI in the cited sources is gpt‑oss, whose weights are downloadable; GPT‑4o-class weights are not provided as open-weight releases in these official open-model announcements.
- Anthropic Claude (including “Claude X”): excluded because Claude is treated as a hosted subscription/API product in the cited coverage; no official open-weight release is evidenced here. (Name “Claude X” is unverified in official sources used in this report.)
- Mistral Large API-only variants: excluded: the launch communication emphasizes “generally available through an API” rather than open downloadable weights, so it does not match strict self-host requirements.
- Z.ai GLM‑5V‑Turbo: excluded (currently ambiguous/likely API-first): official developer docs describe GLM‑5V‑Turbo as an API offering; credible reporting states weights are not announced and it is API-only “for now.” Some secondary sources claim HF weights exist, but because sources conflict and no official HF model card was validated in this research set, it is treated as not reliably self-hostable for this ranking.
- “GLM5.1” and “Qwen 3.6”: excluded as unverified labels: this research set found strong evidence for Qwen3.5 (open weight) and multiple GLM‑4.x / GLM‑4.6V releases, but did not validate “Qwen 3.6” or “GLM5.1” as official self-hostable model releases; treat these names as unspecified/unverified.
- Language: English (en‑US).
- Current date reference: 2026‑04‑10, timezone Asia/Jakarta.
- Budget constraint: unspecified.
- Target latency SLO: unspecified.
- When a requested detail is not present in cited primary sources, it is marked unspecified rather than inferred (especially for exact VRAM, latency, and Rust-only benchmark scores).