Skip to content

Instantly share code, notes, and snippets.

@zhaoyanpeng
Last active April 30, 2026 10:12
Show Gist options
  • Select an option

  • Save zhaoyanpeng/8922b6de77edfb28ac50e432c5c88d0c to your computer and use it in GitHub Desktop.

Select an option

Save zhaoyanpeng/8922b6de77edfb28ac50e432c5c88d0c to your computer and use it in GitHub Desktop.

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

arXiv HuggingFace Benchmark GitHub Repo Website

📗 Overview

TL;DR: ESPIRE is a diagnostic benchmark for assessing embodied spatial reasoning of vision-language models within a simulated physical environment. It contains diverse objects and scenes that support spatial reasoning across different aspects and at varying levels of granularity.

Check out our project website for more details!

🛠️ Installation

Requirements

Setups

Overview

The flowchart below highlights what can be done with ESPIRE.

flowchart LR
    A("`_ESPIRE_ SERVER`") <--> B("`APPLICATIONS<br>(_Preview/Evaluation/Generation_)`");
Loading

Start the ESPIRE server

We rely on Docker for a reproducible server environment. You can build the Docker image by:

git clone https://github.com/spatigen/espire.git
cd espire
./scripts/compose.sh build

The build will take about half an hour. After it finishes, you can start the ESPIRE service by:

./scripts/compose.sh up

The above command will start a container named espire, you can interact with it in a new terminal by running:

docker exec -it espire /bin/bash

⚠️ ESPIRE relies on 3D assets for rendering; please follow our instructions to have them ready before starting the docker container.

Local setup without Docker
  • Setup:
    bash scripts/setup.sh
    source scripts/env.sh
    export OMNI_KIT_ACCEPT_EULA=yes
  • If Vulkan init fails, verify the NVIDIA driver and check that vulkaninfo works.
  • If you run headless or over SSH, make sure the display setup is valid.
  • If Python build tools are missing, you may need:
    python3.10-dev python3.10-venv python3-pip build-essential git curl ninja-build pkg-config
  • If X11 / OpenGL / EGL / GTK / Vulkan libraries are missing, you may need:
    libx* libgl* libegl1 libglib2.0-0 libgtk-3-0 libnss3 libvulkan1

Interact with the ESPIRE server

We provide a Jupyter notebook that demonstrates how to inspect the ESPIRE environment, including obtaining ego-centric and world-centric views. Check it out here. Instructions for evaluation and task generation are provided in the following sections.

Evaluation

We provide an implementation of the fully generative evaluation framework that adapts vision-language models to robotic tasks. Check out our evaluation codebase.

Scene and Task Generation

ESPIRE systematically generates diverse environments with varying clutter levels and instructions — 148 task types · 65 instruction families · 3 difficulty levels (easy -> hard) · pick & place · 2 task scenes — enabling comprehensive evaluation of embodied spatial reasoning. Please refer to our documentation for procedural generation of scenes and tasks.

✒️ Citation

@misc{zhao2026espire,
      title={ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models}, 
      author={Yanpeng Zhao and Wentao Ding and Hongtao Li and Baoxiong Jia and Zilong Zheng},
      year={2026},
      eprint={2603.13033},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13033}, 
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment