ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

📗 Overview

TL;DR: ESPIRE is a diagnostic benchmark for assessing embodied spatial reasoning of vision-language models within a simulated physical environment. It contains diverse objects and scenes that support spatial reasoning across different aspects and at varying levels of granularity.

Check out our project website for more details!

🛠️ Installation

Requirements

OS: Ubuntu 22.04
GPU: NVIDIA RTX 4090
RAM: 16GB+ Recommended
Docker: Install Docker Engine on Ubuntu

Setups

Overview

The flowchart below highlights what can be done with ESPIRE.

flowchart LR
    A("`_ESPIRE_ SERVER`") <--> B("`APPLICATIONS<br>(_Preview/Evaluation/Generation_)`");

Start the ESPIRE server

We rely on Docker for a reproducible server environment. You can build the Docker image by:

git clone https://github.com/spatigen/espire.git
cd espire
./scripts/compose.sh build

The build will take about half an hour. After it finishes, you can start the ESPIRE service by:

./scripts/compose.sh up

The above command will start a container named espire, you can interact with it in a new terminal by running:

docker exec -it espire /bin/bash

⚠️ ESPIRE relies on 3D assets for rendering; please follow our instructions to have them ready before starting the docker container.

Local setup without Docker

Setup:

bash scripts/setup.sh
source scripts/env.sh
export OMNI_KIT_ACCEPT_EULA=yes

If Vulkan init fails, verify the NVIDIA driver and check that vulkaninfo works.
If you run headless or over SSH, make sure the display setup is valid.

If Python build tools are missing, you may need:

python3.10-dev python3.10-venv python3-pip build-essential git curl ninja-build pkg-config

If X11 / OpenGL / EGL / GTK / Vulkan libraries are missing, you may need:
```
libx* libgl* libegl1 libglib2.0-0 libgtk-3-0 libnss3 libvulkan1
```

Interact with the ESPIRE server

We provide a Jupyter notebook that demonstrates how to inspect the ESPIRE environment, including obtaining ego-centric and world-centric views. Check it out here. Instructions for evaluation and task generation are provided in the following sections.

Evaluation

We provide an implementation of the fully generative evaluation framework that adapts vision-language models to robotic tasks. Check out our evaluation codebase.

Scene and Task Generation

ESPIRE systematically generates diverse environments with varying clutter levels and instructions — 148 task types · 65 instruction families · 3 difficulty levels (easy -> hard) · pick & place · 2 task scenes — enabling comprehensive evaluation of embodied spatial reasoning. Please refer to our documentation for procedural generation of scenes and tasks.

✒️ Citation

@misc{zhao2026espire,
      title={ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models}, 
      author={Yanpeng Zhao and Wentao Ding and Hongtao Li and Baoxiong Jia and Zilong Zheng},
      year={2026},
      eprint={2603.13033},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13033}, 
}

zhaoyanpeng/PROJ-README.md

Select an option

No results found