Top Multimodal Self‑Hostable LLMs for Systems Engineering and Rust Programming

Executive summary

This report ranks ten open‑weight, self‑hostable multimodal models that are especially strong for (a) systems engineering tasks (design reviews, debugging, log/trace reasoning, incident triage, reading diagrams/docs/UI screenshots, tool-using agents) and (b) Rust programming (ownership/lifetimes, unsafe correctness, concurrency, performance engineering, refactors with tests). The ranking emphasizes coding + agentic SWE benchmarks (notably SWE‑bench Verified, LiveCodeBench v6, and HumanEval) where officially reported, and treats Rust‑specific scores as mostly unspecified because few vendors publish them directly. In the absence of official Rust scores, this report recommends a local Rust evaluation harness based on MultiPL‑E (Rust translations of HumanEval/MBPP) combined with cargo test, Clippy, Miri, Loom, and criterion.

Top picks by deployment tier:

Best overall “frontier open‑weight agentic coder” (huge, cluster‑class): Kimi K2.5 from Moonshot AI — standout engineering/agentic scores (e.g., SWE‑bench Verified 76.8, LiveCodeBench v6 85.0) plus native multimodal + tool workflows; video chat is limited locally (officially “experimental” and API-only).
Best cost/performance “SWE agent” you can realistically self‑host on a single high‑VRAM GPU: Qwen3.5‑27B (Apache‑2.0) from Alibaba Cloud — strong across SWE‑bench Verified (72.4), LiveCodeBench v6 (80.7), and agent benchmarks, while remaining far cheaper to run than 200B–1T class MoE models.
Best “multimodal + coding + audio on device” (edge‑friendly small variants): Gemma 4 from Google — a family spanning small on-device “Any‑to‑Any” models with native audio and bigger vision+text models, with LiveCodeBench v6 up to 80.0 and native function calling.
Best “permissive license + solid code + strong doc/diagram VLM” (mid‑scale): Pixtral‑12B‑2409 from Mistral AI — Apache‑2.0, long context (128k), and published HumanEval pass@1 72.0 in its model card.

Key caveat about Rust: Vendors almost never publish “Rust‑only” leaderboards. As a result, Rust capability must be validated locally using execution‑based tests (compile + unit tests) and concurrency/UB tooling (Loom/Miri). This report provides a concrete evaluation suite later.

Methodology and scoring rubric

Inclusion criteria

A model is included in the ranked top ten only if it satisfies:

Open weights available (downloadable) and self-hostable (not strictly API-only). Evidence comes primarily from official model cards and/or official repositories.
Multimodal capability (at least text+image, optionally video/audio).
Documented or inferable suitability for programming + systems engineering (benchmarks, tool calling, agent framing, or strong community usage in dev workflows).

Scoring rubric

Each model is scored on a 0–10 scale per dimension, then weighted:

Programming & systems engineering capability (35%) Signals: SWE‑bench Verified, LiveCodeBench v6, TerminalBench, HumanEval, MBPP, codeforces-style metrics, plus evidence of agentic coding frameworks.
Multimodal systems usefulness (20%) Signals: doc/OCR/layout/diagram competence, GUI/desktop agent claims, long-video understanding, structured extraction support.
Self-hosting efficiency & scalability (25%) Signals: official deployment guidance (tensor parallel, recommended engines), context length practicality, quantization availability, and feasible hardware footprint. If latency is not specified by sources, it is marked unspecified.
Tool-use readiness (10%) Signals: explicit function calling/tool calling support, structured outputs, OpenAI-compatible serving patterns.
License & ecosystem (10%) Signals: permissive license, documented usage, integration with mainstream serving stacks (Transformers, vLLM, SGLang, LMDeploy), scope of community fine-tunes/quantizations.

Benchmarks used and how to interpret them for Rust

SWE‑bench evaluates repository-level issue resolution (Python repos), but it is still one of the best proxies for “real SWE work”: multi-file edits, tests, tool use, and long-context reasoning.
LiveCodeBench is designed to reduce contamination and covers broader coding abilities beyond code generation (self-repair and execution/test prediction).
HumanEval / MBPP are unit-test driven code generation benchmarks (mostly Python), useful but older; Rust relevance is best obtained via translation frameworks.
MultiPL‑E provides translated HumanEval/MBPP problems into many languages, and its docs explicitly show how to evaluate Rust (--lang rs) with containerized execution. This is the preferred Rust-oriented benchmark base in this report.

Ranked model profiles

The ranking below targets potency for systems engineering + Rust under the rubric above. For every model, Rust-specific benchmark scores are marked “unspecified” unless an official source reports them (rare); instead, recommended Rust tests are provided later.

Rank overview

Kimi‑K2.5 (MoE, native INT4, agentic coding leader)
Qwen3.5‑27B (excellent SWE/agent scores at a realistic self-host scale)
Gemma 4 (31B / 26B‑A4B / E4B/E2B) (strong coding + multimodal; audio on small variants)
Llama 4 Maverick (strong multimodality + big MoE capacity; license constraints)
Qwen3‑VL‑235B‑A22B (flagship VLM; very large)
Qwen2.5‑VL‑72B‑Instruct (strong doc/video/agent VLM; code benchmarks unspecified in card)
InternVL3.5‑30B‑A3B (strong multimodal/agentic suite; clear deployment guidance)
Pixtral‑12B‑2409 (Apache‑2.0; strong HumanEval; long context; good “daily driver” VLM)
DeepSeek‑VL2 (small/large variants) (MoE VLM; explicit VRAM guidance; shorter context)
GLM‑4.6V‑Flash (local-friendly VLM with native multimodal function calling; text-only weaker per authors)

Model details

Kimi‑K2.5

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/moonshotai/Kimi-K2.5
GitHub repo: https://github.com/MoonshotAI/Kimi-K2.5
vLLM recipe: https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html

License: Modified MIT.
Modalities: Text + image (native multimodality). “Chat with video content” is explicitly noted as experimental and officially API-only at the time of the model card; treat local video as unsupported/unspecified.
Size & architecture: MoE, 1T total params, 32B activated, 384 experts, 8 selected experts/token, 256K context, vision encoder MoonViT (400M).
Training data notes: Continual pretraining on ~15T mixed visual + text tokens (composition otherwise unspecified).
Instruction-tuning / fine-tuning: The HF page shows a large ecosystem of fine-tunes and quantizations; instruction-tuned base is provided.
Programming/SWE benchmarks (officially reported):
- SWE‑bench Verified 76.8
- TerminalBench 2.0 50.8
- LiveCodeBench v6 85.0 (Many other evals are listed in the official table; Rust-specific: unspecified.)
Community / third‑party evaluations: Strong adoption signals via many Spaces and downstream derivatives; treat as “high community activity” rather than a quality guarantee.
Self-hosting requirements:
- Official guidance emphasizes vLLM/SGLang/KTransformers.
- Tensor-parallel multi-GPU is effectively required for full model; a community deployment note recommends TP=8 and highlights native INT4.
- Latency: unspecified (depends heavily on GPU class and concurrency).
Quantization & optimization: Native INT4 quantization is part of the official release narrative.
Tool / plugin support: The model card includes “Interleaved Thinking and Multi‑Step Tool Call” and a “Coding Agent Framework.” For tool calling infrastructure, use vLLM’s tool-calling + structured outputs when serving.
Security/privacy considerations: Fully local inference is possible (weights self-hosted), but tool execution must be sandboxed (see evaluation section).
Recommended deployment config (cost/perf):
- Cluster: vLLM or SGLang with multi-GPU tensor parallel (TP8 recommended in community guidance).
- Due to model scale, prioritize this model if you truly need frontier-level agentic coding in an on-prem cluster; otherwise Qwen3.5/Gemma 4 may deliver much better cost/perf.

Kimi‑K2.5: Pros

Best-in-class reported open-weight agentic coding metrics (SWE/LCB/TerminalBench) among listed models.
Native multimodality + explicit agent/tool workflows in official docs.

Kimi‑K2.5: Cons

Very large; multi-GPU serving is effectively mandatory.
Video chat is not a reliable local feature per the model card (API-only experimental).

Qwen3.5‑27B

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/Qwen/Qwen3.5-27B

License: Apache‑2.0.
Modalities: Image‑Text‑to‑Text (vision encoder + causal LM). Audio/video: unspecified/not indicated in the model card.
Size & architecture: 27B causal LM with vision encoder; long context 262,144 (extensible to ~1,010,000). The model card states a hybrid layout using Gated DeltaNet + Gated Attention blocks and provides head/layer dimensions.
Training data notes: The card describes “multimodal learning” and large-scale RL but does not fully disclose dataset composition; treat as mixed/unspecified.
Instruction-tuning / fine-tuning: Post-trained model is provided; compatibility noted for Transformers, vLLM, SGLang, KTransformers.
Programming/SWE benchmarks (official):
- SWE‑bench Verified 72.4
- LiveCodeBench v6 80.7
- Terminal Bench 2 41.6
- CodeForces 1899 Rust-specific: unspecified.
Community evaluation: Strong HF activity and explicit agentic usage sections (“Qwen‑Agent”, “Qwen Code”) indicate ecosystem orientation.
Self-hosting requirements: The model card does not provide a single “VRAM minimum.” Treat latency as unspecified; recommended engines include vLLM/SGLang.
Quantization & optimization: Not explicitly enumerated in the excerpted lines, but the model is designed for high throughput; use standard HF/serving quantization stacks (AWQ/GPTQ/bitsandbytes) where available. Transformers documents quantization support broadly (AWQ/GPTQ/bnb).
Tool/plugin support: The model’s benchmark table includes explicit “agent” and “search agent” evaluations (BFCL, TAU2-Bench, BrowseComp), aligning with tool use; serve with vLLM tool calling + structured outputs.
Security/privacy: Well-suited to on-prem; still requires strict sandboxing for code execution.

Recommended deployment config (cost/perf):

Single high‑VRAM GPU (best practical default): vLLM OpenAI-compatible server; enable structured outputs for tools.
If you need bigger concurrency/long context, scale horizontally with multiple replicas rather than pushing maximal context for every request.

Qwen3.5‑27B: Pros

Near-frontier open-weight coding/agent performance at a dramatically lower serving cost than 200B+ MoE VLMs.
Very long context (native 256k).

Qwen3.5‑27B: Cons

No official Rust bench numbers; must validate with MultiPL‑E + local harness.

Gemma 4 family

Official sources (repos/docs):

Gemma 4 model card: https://ai.google.dev/gemma/docs/core/model_card_4
Hugging Face blog overview: https://huggingface.co/blog/gemma4

License: The model card describes the family as openly released; specific license text is not reproduced in this excerpt, so treat license as unspecified here and follow the official model card terms. (Benchmarks and architecture are taken from the official model card.)
Modalities:
- All models: text + image.
- E2B/E4B: audio (ASR + speech translation), and video supported as frame sequences.
Size & architecture: The model card provides detailed sizes:
- Dense: E2B (2.3B effective; 5.1B w/ embeddings), E4B (4.5B effective; 8B w/ embeddings), 31B dense (30.7B)
- MoE: 26B A4B (25.2B total, 3.8B active) with 8 active / 128 total experts (+ shared). Context: 128k (E2B/E4B) and 256k (26B/31B). Vision encoder parameters ~150M (small) or ~550M (large). Audio encoder parameters ~300M (only small models).
Training data notes: The model card states training includes web documents, code, images, audio with cutoff January 2025, plus filtering notes (CSAM, sensitive data filtering).
Instruction-tuning / fine-tuning: Instruction-tuned variants are benchmarked; examples show Transformers integration and “thinking mode.”
Programming/SWE benchmarks (official):
- LiveCodeBench v6: 80.0 (31B), 77.1 (26B A4B), 52.0 (E4B), 44.0 (E2B)
- Codeforces ELO: 2150 (31B), 1718 (26B A4B), etc. (HumanEval/MBPP/SWE-bench not listed in the excerpted Gemma 4 tables; mark unspecified.)
Self-hosting requirements: The model card positions smaller models for on-device use and larger for consumer GPUs/workstations; exact VRAM/latency not specified.
Quantization & optimization: The HF blog credits ecosystem integrations (llama.cpp, mistral.rs, etc.), but exact quantization formats are not specified in these excerpts; treat as unspecified and rely on downstream quantization projects where available.
Tool/plugin support: The model card explicitly notes native function calling support. vLLM also supports tool calling and structured outputs at the serving layer.
Security/privacy: Audio input raises additional privacy concerns (recordings); keep transcripts local, disable network during tool execution, and store only minimal logs.

Recommended deployment config (cost/perf):

For “real work” coding + multimodal: prioritize Gemma 4 26B A4B (MoE active 3.8B) or 31B if you can afford the VRAM.
For speech workflows: use E4B/E2B on-device; keep audio inputs under the stated maximum length (30s).

Gemma 4 family: Pros

Broad modality coverage (including audio on small variants) and strong coding metrics (LiveCodeBench v6 near 80).
Clear architectural transparency in the official model card (active vs total params, encoder sizes).

Gemma 4 family: Cons

SWE‑bench/HumanEval/MBPP not listed in the cited official tables; Rust performance requires local measurement.

Llama 4 Maverick

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
Meta release blog: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

License: Llama 4 Community License (custom).
Modalities: Text + image input; text + code output. Video/audio: unspecified in the model card.
Size & architecture: MoE with 17B activated, 400B total, 128 experts, context length 1M, token count ~22T, cutoff Aug 2024.
Training data notes: Mix of public, licensed, and Meta product/service data (including publicly shared posts/interactions), as stated in the model card.
Instruction-tuning / fine-tuning: Instruct-tuned model provided; future versions may be released.
Programming/SWE benchmarks (official):
- LiveCodeBench (10/01/2024–02/01/2025) pass@1: 43.4 (Maverick instruct)
- MBPP pass@1 (pretrained): 77.6 Rust-specific: unspecified.
Self-hosting requirements:
- Quantization section states Maverick is released as BF16 and FP8; FP8 weights fit on a “single H100 DGX host” (multi-GPU system).
- Latency unspecified.
Quantization & optimization: FP8 weights provided; on-the-fly int4 quantization code is referenced in the model card.
Tool/plugin support: Use serving-layer tool calling (vLLM) even if the model isn’t explicitly “function calling tuned.”
Security/privacy: License restrictions may matter for commercial deployment; review acceptable use and compliance before production.

Recommended deployment config (cost/perf):

When you need a capable multimodal MoE and can operate an 8-GPU node (e.g., DGX-class).
For most teams, Qwen3.5 or Gemma 4 is likely better cost/perf unless you specifically want Llama ecosystem tooling and 1M context.

Llama 4 Maverick: Pros

Very large effective capacity with long context (1M).
Officially published core coding benchmarks (LiveCodeBench, MBPP).

Llama 4 Maverick: Cons

License is not permissive OSS; compliance overhead may be non-trivial.
Rust scores not published; validate locally.

Qwen3‑VL‑235B‑A22B‑Instruct

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
Qwen3 technical report (family context): https://arxiv.org/abs/2505.09388
Qwen3-VL GitHub: https://github.com/QwenLM/Qwen3-VL

License: Apache‑2.0.
Modalities: Text + image; the model card describes advanced video reasoning and long context, but the HF page itself is “Image‑Text‑to‑Text.” Treat video input as supported in the broader Qwen3‑VL stack but confirm in chosen runtime; audio is unspecified.
Size & architecture: MoE; HF metadata shows ~236B params; A22B implies ~22B active (active count is not explicitly stated in this model card excerpt, so treat “active params” as unspecified here). Context: 256K native, “expandable to 1M.”
Training data notes: Not fully disclosed in the model card excerpt; treat as mixed/unspecified.
Instruction-tuning / fine-tuning: Instruct checkpoint provided; Transformers usage requires recent versions.
Benchmarks:
- The official HF “Model Performance” section is primarily graphical (tables embedded in images), so many numeric values are not extractable from text here; treat detailed numbers as unspecified unless you consult the images directly.
- For family-level coding evidence, Qwen3 technical report reports strong coding results for the flagship text model (e.g., LiveCodeBench v5). This is not the same as Qwen3‑VL; do not substitute those numbers for the VLM.
- Rust-specific: unspecified.
Self-hosting requirements: Large MoE; expect multi‑GPU tensor parallel for high throughput; exact VRAM/latency unspecified.
Quantization & optimization: FlashAttention recommended in the model card; quantized variants exist on the Hub (not enumerated here).
Tool/plugin support: Model card emphasizes “agent interaction capabilities” and “visual agent”; serve with tool calling infrastructure.

Recommended deployment config (cost/perf):

Choose this if you need frontier-scale multimodal (especially GUI/video/dynamics) and can operate multi‑GPU infrastructure.
If you primarily care about coding agent performance per dollar, Qwen3.5‑27B is typically the better first choice.

Qwen3‑VL‑235B‑A22B‑Instruct: Pros

Flagship VLM in the Qwen series with explicit “visual coding” and agent framing.
Apache‑2.0 license.

Qwen3‑VL‑235B‑A22B‑Instruct: Cons

Many official benchmark numbers are presented as images; text-extractable metrics are incomplete here.
Extremely large; likely expensive to run.

Qwen2.5‑VL‑72B‑Instruct

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
Qwen2.5-VL technical report: https://arxiv.org/abs/2502.13923
Qwen2.5-VL blog: https://qwenlm.github.io/blog/qwen2.5-vl/

License: “qwen” (custom).
Modalities: Image‑Text‑to‑Text; includes extensive image, video, and agent benchmarks in the official card. Audio: unspecified.
Size & architecture: Dense 72B; model card describes dynamic resolution training, window attention in ViT, and long video processing. Vision encoder details: partially described (ViT with window attention), but exact vision parameter count is unspecified here.
Training data notes: Technical report abstract describes multimodal training and long-video design; detailed corpus composition is not provided here → treat as mixed/unspecified.
Programming/SWE benchmarks: The model card emphasizes multimodal/agent benchmarks and does not provide SWE‑bench/HumanEval/LiveCodeBench numbers in text; mark these as unspecified for this VLM card.
Self-hosting requirements:
- Requires very recent Transformers builds; flash_attention_2 recommended for speed/memory.
- Latency unspecified.
Quantization & optimization: An official AWQ variant exists for this model series (example: 72B‑Instruct‑AWQ) and the model card recommends FlashAttention2; quantization behavior deltas are not fully specified here.
Tool/plugin support: The model is positioned as a “visual agent” for computer/phone use; serve with tool calling and structured outputs for robust agent loops.

Recommended deployment config (cost/perf):

If you need document+diagram+video understanding and UI agent behaviors, this is a strong open-weight option.
For Rust coding quality, pair with a local compilation/testing loop; benchmarks in the model card are not Rust-oriented.

Qwen2.5‑VL‑72B‑Instruct: Pros

Strong published doc/OCR/video/agent results in the official model card.
Tooling support (qwen-vl-utils, FlashAttention recommendation) eases production use.

Qwen2.5‑VL‑72B‑Instruct: Cons

Coding benchmarks relevant to Rust (HumanEval/LiveCodeBench/SWE) are not given in the model card text.

InternVL3.5‑30B‑A3B

Official sources (repos/docs):

Hugging Face model card (family): https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
Hugging Face model card (30B-A3B-HF format): https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B-HF
vLLM recipe (InternVL3.5): https://docs.vllm.ai/projects/recipes/en/latest/InternVL/InternVL3_5.html
Paper: https://arxiv.org/abs/2508.18265

License: Apache‑2.0 (per model card).
Modalities: Image‑Text‑to‑Text; official model card includes evaluation categories spanning OCR/doc, video understanding, GUI tasks, and grounded/spatial reasoning.
Size & architecture: The family card lists per-model vision+language parameters; for 30B‑A3B, total is ~30.8B‑A3B (active params implied by A3B label; exact routing details for this variant are not shown in the excerpt).
Training data notes: The paper emphasizes Cascade RL and multimodal training; HF model card does not fully disclose dataset provenance → mixed/unspecified.
Instruction-tuning / fine-tuning: The model card references multiple fine-tuning toolchains and provides LMDeploy deployment examples and OpenAI-style API compatibility.
Programming/SWE benchmarks: The model card is heavily multimodal/agent oriented; Rust/code metrics like SWE‑bench/HumanEval/LiveCodeBench are not present in the excerpted text → mark unspecified.
Self-hosting requirements (official guidance):
- “Models up to 30B can be deployed on a single A100 GPU,” 38B needs 2×A100, and 235B needs 8×A100 (family guidance).
- LMDeploy provides explicit tensor-parallel notes (e.g., tp=8 for 241B-A28B).
- Latency unspecified.
Quantization & optimization: Examples include 8-bit loading; LMDeploy is explicitly presented as a compression/deployment toolkit.
Tool/plugin support: Strong agent framing; OpenAI-compatible REST serving in LMDeploy supports building tool-using agents.

Recommended deployment config (cost/perf):

For a balanced “multimodal systems engineer” assistant at manageable infrastructure cost, the 30B‑A3B class is a practical sweet spot (single A100-class GPU per official guidance).
For Rust coding, run it in an agent loop with compile/test tools.

InternVL3.5‑30B‑A3B: Pros

Exceptionally complete multimodal evaluation and deployment documentation (LMDeploy service, tp guidance).
Clear family parameter breakdown (vision vs language).

InternVL3.5‑30B‑A3B: Cons

No official Rust or code benchmark scores in the model card excerpt; must evaluate locally.

Pixtral‑12B‑2409

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/mistralai/Pixtral-12B-2409

License: Apache‑2.0.
Modalities: Natively multimodal (image + text); audio/video not indicated.
Size & architecture: 12B text decoder + 400M vision encoder, sequence length 128k.
Training data notes: Trained with interleaved image/text data; detailed dataset composition unspecified.
Programming/SWE benchmarks (official): HumanEval pass@1 72.0 is reported in the model card’s “Text Benchmarks.” Rust-specific: unspecified.
Self-hosting requirements: vLLM is recommended; official examples include vllm serve and advice to lower model limits on low‑VRAM GPUs. Exact VRAM/latency not specified.
Quantization & optimization: Not explicitly enumerated in the cited lines; vLLM is recommended for production inference.
Tool/plugin support: Use vLLM tool calling + structured outputs for agent workflows.
Security/privacy: The model card explicitly notes no built-in moderation mechanisms; treat as requiring additional safeguards in production.

Recommended deployment config (cost/perf):

One of the best “daily driver” options if you want Apache‑2.0 + multimodal + strong coding baseline, without huge GPU fleets.

Pixtral‑12B‑2409: Pros

Apache‑2.0, long context, and published HumanEval 72.0.
Straightforward vLLM serving guidance in the model card.

Pixtral‑12B‑2409: Cons

No official SWE‑bench / Rust-specific metrics; needs local agent loop with compilation/tests for Rust.

DeepSeek‑VL2 (family)

Official sources (repos/docs):

GitHub repo + paper PDF: https://github.com/deepseek-ai/deepseek-vl2

License: Code is MIT; model use is subject to DeepSeek Model License; stated as commercial-use supported.
Modalities: Vision-language (images + text). Audio/video unspecified in the repo excerpt.
Size & architecture: MoE VLM family with activated parameters: 1.0B (tiny), 2.8B (small), 4.5B (vl2). The repo also mentions total MoE sizes (e.g., vl2‑tiny “3.37B‑MoE total”; vl2‑small “16.1B‑MoE total”; vl2 “27.5B‑MoE total”). Context/sequence length is 4096.
Training data notes: Not detailed in the excerpt; treat as mixed/unspecified.
Programming/SWE benchmarks: Not reported in the excerpt; Rust-specific: unspecified.
Self-hosting requirements (explicit guidance):
- Repo notes you may need 80GB GPU memory for deepseek‑vl2‑small and larger, with incremental prefilling enabling vl2‑small within ~40GB at slower speed.
- Production serving is recommended via optimized stacks like vLLM/SGLang/LMDeploy (explicitly named).
Quantization & optimization: Incremental prefilling is highlighted as a memory saver; other quantization formats not specified in excerpt.
Tool/plugin support: Not explicitly “function calling tuned” in excerpt; implement tool calling at serving layer.

Recommended deployment config (cost/perf):

If you want a smaller MoE VLM with explicit memory-saving tactics and can accept shorter context (4k), DeepSeek‑VL2‑small can be used with incremental prefilling.

DeepSeek‑VL2 (family): Pros

Clear, practical deployment guidance (80GB recommendation, incremental prefilling strategy).
Smaller activated-parameter MoE variants can be efficient per token.

DeepSeek‑VL2 (family): Cons

Short context (4096) limits “big repo” and log-heavy systems engineering tasks unless you add retrieval/chunking.
No official code/Rust benchmarks in excerpt; requires local evaluation.

GLM‑4.6V‑Flash

Official sources (repos/docs):

Hugging Face model card: https://huggingface.co/zai-org/GLM-4.6V-Flash
GitHub repository: https://github.com/zai-org/GLM-V
Blog: https://z.ai/blog/glm-4.6v
Paper: https://huggingface.co/papers/2507.01006

License: MIT.
Modalities: Image‑Text‑to‑Text; model card emphasizes multimodal document understanding and also references video tasks (via SGLang for “video tasks”).
Size & architecture: “Flash” is labeled 9B in the model card narrative; HF metadata shows ~10B params. Context length trained to 128K.
Training data notes: Not detailed in the model card excerpt; treat as mixed/unspecified.
Programming/SWE benchmarks: Benchmarks are primarily presented as an image in the model card; numeric extraction is limited in text here → treat most as unspecified.
Self-hosting requirements: Model card provides installation guidance for vLLM/SGLang and notes remaining issues; exact VRAM/latency unspecified.
Quantization & optimization: The ecosystem contains GGUF conversions and many quantizations on HF; however, details are not in the official model card excerpt.
Tool/plugin support: Native multimodal function calling is a core stated feature, meant to “close the loop” from perception to execution.
Security/privacy: The model card explicitly acknowledges limitations and encourages issue reporting; for production, you must enforce sandboxing and authorization gates for any tool execution.

Recommended deployment config (cost/perf):

A good “local multimodal agent scaffold” if you want MIT license + explicit multimodal function calling, and can tolerate weaker pure-text QA per authors.

GLM‑4.6V‑Flash: Pros

Explicit native multimodal function calling + agent loop orientation.
Local-friendly “Flash” variant intended for low latency.

GLM‑4.6V‑Flash: Cons

Authors note pure text QA still needs improvement; treat it as a VLM/agent component rather than a top pure coder.

Comparison table

Key: “—” means unspecified in cited official sources (do not assume).

Rank	Model	License	Modalities (in/out)	Params & architecture	Context	Key coding/SWE evidence	Self-host notes (official)
1	Kimi‑K2.5	Modified MIT	in: text+image; out: text	MoE, 1T total / 32B active, 256K, MoonViT 400M	256K	SWE Verified 76.8; LiveCodeBench v6 85.0; TerminalBench2 50.8	vLLM/SGLang/KTransformers recommended; native INT4; TP8 commonly recommended
2	Qwen3.5‑27B	Apache‑2.0	in: image+text; out: text	27B w/ vision encoder; Gated DeltaNet/Gated Attention	262K	SWE Verified 72.4; LiveCodeBench v6 80.7; TerminalBench2 41.6	Compatible w/ Transformers/vLLM/SGLang/KTransformers
3	Gemma 4 (31B/26B‑A4B/E4B/E2B)	—	in: text+image (+audio for E2B/E4B); out: text	Dense + MoE; 26B A4B is 25.2B total/3.8B active; native function calling	128K–256K	LiveCodeBench v6 up to 80.0; Codeforces ELO up to 2150	Official best practices; audio max length 30s; video as frames
4	Llama 4 Maverick	Llama 4 Community License	in: text+image; out: text+code	MoE 17B active / 400B total; 128 experts	1M	LiveCodeBench pass@1 43.4 (instruct); MBPP 77.6 (pretrain)	FP8 weights fit on single H100 DGX host (per card)
5	Qwen3‑VL‑235B‑A22B	Apache‑2.0	in: image+text; out: text	MoE ~236B; “native 256K, expandable 1M”	256K	Detailed numbers mostly image‑embedded; avoid guessing	Requires very recent Transformers; large-scale serving implied
6	Qwen2.5‑VL‑72B	qwen	in: image+video+text; out: text	Dense 72B VLM	—	Code benchmarks not in card text; multimodal/agent evals are extensive	FlashAttention2 recommended; official AWQ variant exists
7	InternVL3.5‑30B‑A3B	Apache‑2.0	in: image+text; out: text	Family has vision+LM split; 30B class supported	—	Code/Rust metrics not listed; broad multimodal eval suite	Up to 30B deployable on single A100 (official); LMDeploy OpenAI-style API example
8	Pixtral‑12B‑2409	Apache‑2.0	in: image+text; out: text	12B + vision encoder 400M	128K	HumanEval pass@1 72.0	vLLM (recommended) in card; no moderation built-in
9	DeepSeek‑VL2	MIT (code) + model license	in: image+text; out: text	MoE family; activated 1.0B/2.8B/4.5B; total up to 27.5B; seq len 4096	4K	Rust/code metrics not listed in repo excerpt	80GB GPU suggested for small+; incremental prefilling enables ~40GB for small (slower)
10	GLM‑4.6V‑Flash	MIT	in: image+text; out: text	Flash 9B class; 128K context; tool calling	128K	Benchmarks mostly image‑embedded; text QA noted weaker	vLLM or SGLang recommended; explicit multimodal function calling

Local evaluation suite for Rust and systems engineering

This section provides a repeatable, local way to measure (a) Rust correctness and (b) systems-engineering reasoning with multimodal inputs, independent of vendor marketing. It combines benchmark translations + real toolchains.

Core benchmark foundation: MultiPL‑E for Rust

MultiPL‑E translates HumanEval and MBPP into many languages and explicitly documents how to run Rust (--lang rs) and execute inside a locked-down container environment.

Minimal high-signal plan:

Use MultiPL‑E Rust for: algorithmic correctness, idiomatic Rust, lifetimes, and function-level reasoning under unit tests.
Use LiveCodeBench (if you can run it locally) for “fresh” problems and self-repair behavior; treat it as a supplement.
Use SWE‑bench‑style tasks conceptually for Rust by creating an internal “Rust‑SWE mini” suite: small Rust repos with issues + tests + expected diffs. SWE‑bench’s design rationale explains why repo-level tasks matter.

Rust-focused toolchain checks

Use a standardized agent loop: “model proposes patch → apply → run → report → iterate”.

Include these gates:

cargo test with RUSTFLAGS="-D warnings" (treat warnings as failures).
cargo clippy -- -D warnings (lint quality).
cargo miri test for UB detection (unsafe, aliasing, stack borrows).
Loom tests for concurrency correctness (state space exploration of atomics/locks).
cargo bench using criterion for performance regression tracking.

Why this matters: Many models can draft plausible Rust, but fewer can converge under strict tool feedback; this tends to separate “chatty codegen” from real engineering ability.

Suggested local prompts / tests (copyable)

Use consistent temperature and seeds across models. For each prompt, require the artifact plus a runnable test.

Ownership/lifetimes: implement a zero-copy parser returning slices; verify no allocations using a tracking allocator.
Unsafe correctness: write a small unsafe ring buffer and prove safety invariants; validate via Miri.
Concurrency: implement a bounded MPSC queue; validate with Loom stress schedules.
Performance: optimize a hot loop (SIMD optional) and benchmark with criterion; require explanation of cache behavior.
Systems debugging: given stack traces + logs, produce a minimal reproducer and patch, then run tests.
FFI boundary: wrap a C library safely and write property tests to ensure safe invariants.
Async runtime: fix a deadlock in Tokio-based code; require a deterministic test.
Error handling: convert error enums into thiserror-based structure; ensure backtraces preserved.
API design: propose a crate-level API and produce docs + examples with doc tests.
Multimodal: feed a screenshot of a failing CI log (or flamegraph) and ask for diagnosis + patch plan.

Tool execution safety (non-negotiable)

If you run code suggested by any model:

Run inside containers with no network by default (MultiPL‑E’s containerized approach is aligned with this).
Use an allowlist of commands (cargo, rustc, clippy, miri, loom, criterion), and block filesystem writes outside the repo workspace.
Require signed/confirmed tool actions for destructive ops (file deletion, publishing, secrets).

Self-hosting guidance, quantization, and cost model

Serving stack recommendations

A practical “systems engineering + Rust” self-host stack:

vLLM as the default inference server, because it offers:
- OpenAI-compatible server mode for easy client integrations.
- Named function calling (tool calling) support.
- Structured outputs support (JSON schema / constraints) for more reliable tool calls.
Use model-native stacks when the official card recommends them (e.g., LMDeploy for InternVL; SGLang for GLM video tasks).

Quantization support (what’s defensible from primary sources)

The Transformers documentation notes support for AWQ and GPTQ and 8-bit/4-bit quantization via bitsandbytes.
The original AWQ paper provides the method basis for activation-aware weight quantization (useful when selecting AWQ toolchains).
Some models ship native quantization:
- Kimi‑K2.5 explicitly reports native INT4 quantization.
- Llama 4 provides FP8 weights and mentions int4 on-the-fly quantization.
Some model families ship official quantized variants:
- Qwen2.5‑VL provides an AWQ model variant on HF.

Assumptions for GPU-only hardware cost chart

To compare “GPU-only purchase cost” across recommended configs, this report uses representative 2026-era unit price ranges from publicly available pricing guides and industry snapshots. These prices vary widely by region, vendor, and availability; treat the chart as an order-of-magnitude planning tool, not a quote.

Unit price assumptions (approx midpoints):

H100 80GB: $25k–$40k → assume $30k.
A100 80GB: $7k–$15k → assume $11k.
L40S 48GB: $7.5k–$10k → assume $8.75k.
L20 48GB: pricing snapshots around ~$4k → assume $4.05k.
RTX 4090 24GB: price trackers show ~$2.7k retail (varies).

Recommended deployment configs (for the chart)

Kimi‑K2.5: 8×H100 (TP8 commonly recommended; cluster-class).
Qwen3.5‑27B: 1×L40S (realistic single-GPU serving baseline; exact VRAM not published).
Gemma 4 31B: 1×A100 80GB (conservative; VRAM not specified in card).
Llama 4 Maverick: 8×H100 (FP8 fits on “single H100 DGX host”).
Qwen3‑VL‑235B: 8×H100 (flagship VLM at this scale typically multi-GPU; exact not specified in card).
Qwen2.5‑VL‑72B: 2×A100 80GB (conservative for 72B-class; exact not specified).
InternVL3.5‑30B‑A3B: 1×A100 80GB (explicitly stated deployable up to 30B on single A100).
Pixtral‑12B: 1×RTX 4090 (12B-class often fits with quantization; official VRAM not specified).
DeepSeek‑VL2‑small: 1×A100 80GB (repo suggests 80GB for small+; incremental prefilling can reduce).
GLM‑4.6V‑Flash: 1×L20 48GB (local-friendly 9B class; exact VRAM not stated).

Mermaid timeline of releases

timeline
  title Model release timeline relevant to this ranking
  2024-09 : Pixtral-12B-2409 (Pixtral 12B series identifier)
  2024-12-13 : DeepSeek-VL2 family released (GitHub release timeline)
  2025-01-26 : Qwen2.5-VL announced (blog)
  2025-04-05 : Llama 4 Scout/Maverick released
  2025-08-25 : InternVL3.5 paper published (series release window)
  2025-09-22 : Qwen3-VL generation (series blog window)
  2025-12-08 : GLM-4.6V-Flash model card published window
  2026-02-02 : Kimi-K2.5 paper/model release window
  2026-04 : Gemma 4 released window (model card)

Release-date citations: Pixtral model card identifier and details ; DeepSeek release timeline ; Qwen2.5‑VL blog date ; Llama 4 release date ; InternVL3.5 paper date ; GLM‑4.6V‑Flash card date window ; Kimi paper/model date window ; Gemma 4 model card . (Qwen3‑VL’s precise date is treated as a series window because official numeric dates in the cited text are limited.)

Mermaid bar chart of estimated GPU-only purchase cost

---
config: { xyChart: { width: 2100 } }
---
xychart-beta
  title "Estimated GPU-only cost by recommended deployment config (USD, midpoint assumptions)"
  x-axis ["Kimi-K2.5 (8xH100)","Qwen3.5-27B (1xL40S)","Gemma4-31B (1xA100)","Llama4-Maverick (8xH100)","Qwen3-VL-235B (8xH100)","Qwen2.5-VL-72B (2xA100)","InternVL3.5-30B (1xA100)","Pixtral-12B (1x4090)","DeepSeek-VL2-small (1xA100)","GLM-4.6V-Flash (1xL20)"]
  y-axis "USD (est.)" 0 --> 260000
  bar [240000,8750,11000,240000,240000,22000,11000,2755,11000,4050]

Price assumption citations: H100 range ; A100 range ; L40S range ; L20 snapshot ; RTX 4090 snapshot .

Elimination notes

This report excludes closed-source or not-reliably-self-hostable models even if they may be strong for Rust/SWE, because the request is explicitly for self-hostable models.

OpenAI GPT‑4o / GPT‑4o‑mini (and similar proprietary GPT‑4-class models): excluded because the reliably self-hostable open-weight offering from OpenAI in the cited sources is gpt‑oss, whose weights are downloadable; GPT‑4o-class weights are not provided as open-weight releases in these official open-model announcements.
Anthropic Claude (including “Claude X”): excluded because Claude is treated as a hosted subscription/API product in the cited coverage; no official open-weight release is evidenced here. (Name “Claude X” is unverified in official sources used in this report.)
Mistral Large API-only variants: excluded: the launch communication emphasizes “generally available through an API” rather than open downloadable weights, so it does not match strict self-host requirements.
Z.ai GLM‑5V‑Turbo: excluded (currently ambiguous/likely API-first): official developer docs describe GLM‑5V‑Turbo as an API offering; credible reporting states weights are not announced and it is API-only “for now.” Some secondary sources claim HF weights exist, but because sources conflict and no official HF model card was validated in this research set, it is treated as not reliably self-hostable for this ranking.
“GLM5.1” and “Qwen 3.6”: excluded as unverified labels: this research set found strong evidence for Qwen3.5 (open weight) and multiple GLM‑4.x / GLM‑4.6V releases, but did not validate “Qwen 3.6” or “GLM5.1” as official self-hostable model releases; treat these names as unspecified/unverified.

Assumptions and “unspecified” fields policy

Language: English (en‑US).
Current date reference: 2026‑04‑10, timezone Asia/Jakarta.
Budget constraint: unspecified.
Target latency SLO: unspecified.
When a requested detail is not present in cited primary sources, it is marked unspecified rather than inferred (especially for exact VRAM, latency, and Rust-only benchmark scores).

Ujang360/selfhost-llm-rust-system.md

Top Multimodal Self‑Hostable LLMs for Systems Engineering and Rust Programming

Executive summary

Methodology and scoring rubric

Inclusion criteria

Scoring rubric

Benchmarks used and how to interpret them for Rust

Ranked model profiles

Rank overview

Model details

Kimi‑K2.5

Kimi‑K2.5: Pros

Kimi‑K2.5: Cons

Qwen3.5‑27B

Qwen3.5‑27B: Pros

Qwen3.5‑27B: Cons

Gemma 4 family

Gemma 4 family: Pros

Gemma 4 family: Cons

Llama 4 Maverick

Llama 4 Maverick: Pros

Llama 4 Maverick: Cons

Qwen3‑VL‑235B‑A22B‑Instruct

Qwen3‑VL‑235B‑A22B‑Instruct: Pros

Qwen3‑VL‑235B‑A22B‑Instruct: Cons

Qwen2.5‑VL‑72B‑Instruct

Qwen2.5‑VL‑72B‑Instruct: Pros

Qwen2.5‑VL‑72B‑Instruct: Cons

InternVL3.5‑30B‑A3B

InternVL3.5‑30B‑A3B: Pros

InternVL3.5‑30B‑A3B: Cons

Pixtral‑12B‑2409

Pixtral‑12B‑2409: Pros

Pixtral‑12B‑2409: Cons

DeepSeek‑VL2 (family)

DeepSeek‑VL2 (family): Pros

DeepSeek‑VL2 (family): Cons

GLM‑4.6V‑Flash

GLM‑4.6V‑Flash: Pros

GLM‑4.6V‑Flash: Cons

Comparison table

Local evaluation suite for Rust and systems engineering

Core benchmark foundation: MultiPL‑E for Rust

Rust-focused toolchain checks

Suggested local prompts / tests (copyable)

Tool execution safety (non-negotiable)

Self-hosting guidance, quantization, and cost model

Serving stack recommendations

Quantization support (what’s defensible from primary sources)

Assumptions for GPU-only hardware cost chart

Recommended deployment configs (for the chart)

Mermaid timeline of releases

Mermaid bar chart of estimated GPU-only purchase cost

Elimination notes

Assumptions and “unspecified” fields policy