TL;DR: ESPIRE is a diagnostic benchmark for assessing embodied spatial reasoning of vision-language models within a simulated physical environment. It contains diverse objects and scenes that support spatial reasoning across different aspects and at varying levels of granularity.
Check out our project website for more details!
- OS: Ubuntu 22.04
- GPU: NVIDIA RTX 4090
- RAM: 16GB+ Recommended
- Docker: Install Docker Engine on Ubuntu
The flowchart below highlights what can be done with ESPIRE.
flowchart LR
A("`_ESPIRE_ SERVER`") <--> B("`APPLICATIONS<br>(_Preview/Evaluation/Generation_)`");
We rely on Docker for a reproducible server environment. You can build the Docker image by:
git clone https://github.com/spatigen/espire.git
cd espire
./scripts/compose.sh buildThe build will take about half an hour. After it finishes, you can start the ESPIRE service by:
./scripts/compose.sh upThe above command will start a container named espire, you can interact with it in a new terminal by running:
docker exec -it espire /bin/bash
⚠️ ESPIRE relies on 3D assets for rendering; please follow our instructions to have them ready before starting the docker container.
Local setup without Docker
- Setup:
bash scripts/setup.sh source scripts/env.sh export OMNI_KIT_ACCEPT_EULA=yes
- If Vulkan init fails, verify the NVIDIA driver and check that
vulkaninfoworks. - If you run headless or over SSH, make sure the display setup is valid.
- If Python build tools are missing, you may need:
python3.10-dev python3.10-venv python3-pip build-essential git curl ninja-build pkg-config
- If X11 / OpenGL / EGL / GTK / Vulkan libraries are missing, you may need:
libx* libgl* libegl1 libglib2.0-0 libgtk-3-0 libnss3 libvulkan1
We provide a Jupyter notebook that demonstrates how to inspect the ESPIRE environment, including obtaining ego-centric and world-centric views. Check it out here. Instructions for evaluation and task generation are provided in the following sections.
We provide an implementation of the fully generative evaluation framework that adapts vision-language models to robotic tasks. Check out our evaluation codebase.
ESPIRE systematically generates diverse environments with varying clutter levels and instructions — 148 task types · 65 instruction families · 3 difficulty levels (easy -> hard) · pick & place · 2 task scenes — enabling comprehensive evaluation of embodied spatial reasoning. Please refer to our documentation for procedural generation of scenes and tasks.
@misc{zhao2026espire,
title={ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models},
author={Yanpeng Zhao and Wentao Ding and Hongtao Li and Baoxiong Jia and Zilong Zheng},
year={2026},
eprint={2603.13033},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13033},
}