NuScenes-SpatialQA

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning—key capabilities for autonomous driving—still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.

QA Generation Pipeline

Our QA generation framework involves two key pipelines: ①3D Scene Graph Generation: We construct a 3D scene graph for each camera view in the NuScenes dataset. An auto-captioning process is designed to generate instance-level descriptions for each annotated object, and the generated captions are combined with selected 3D attributes from NuScenes as node attributes. Spatial relationships between objects are encoded as edge attributes. ②Q&A Pairs Generation: Based on the structured scene graph and our predefined QA templates, we generate multi-aspect QA pairs that comprehensively cover both spatial understanding and spatial reasoning, providing a holistic evaluation of the spatial capabilities of VLMs. An overview of each step is presented in the following figure.

QA Example

The final QA pairs are designed to comprehensively evaluate the spatial capabilities of vision-language models, and are categorized into two main types:

Spatial Understanding: Assesses direct spatial relationships and metric measurements, including:
- Qualitative QA: Tasks that evaluate relative spatial relations, like relative spatial relationships and dimension comparison.
- Quantitative QA: Tasks that involve direct numerical estimation, requiring models to extract specific values such as distances, dimensions, or angles.
Spatial Reasoning: Involves higher-level inference beyond direct attribute retrieval, including:
- Direct Reasoning QA: Deductive questions based on object relations.
- Situational Reasoning QA: Real-world scenario-based reasoning involving safety or physical constraints.

The figure below shows some examples of the QA pairs:

Benchmark Statistics

The final benchmark consists of approximately 3.5 million QA pairs, including around 2.5M qualitative and 0.6M quantitative questions under the spatial understanding category, as well as 0.2M reasoning-based questions covering both direct and situational reasoning.

These QA pairs span 6,000 keyframes, each captured from 6 camera views in the NuScenes dataset.

As shown in the following figure, we compare NuScenes-SpatialQA with existing open-source benchmarks in autonomous driving and spatial reasoning to highlight its scale and coverage. It is the first large-scale, ground-truth-based QA benchmark specifically designed to evaluate both spatial understanding and spatial reasoning capabilities of VLMs in autonomous driving.

Evaluation Results

Benchmark results on spatial understanding tasks. — Performance on **spatial understanding** tasks in NuScenes-SpatialQA. The upper part of the table reports results on **Qualitative Spatial QA**, where values represent *accuracy* (↑). The lower part presents results on **Quantitative Spatial QA**, where values correspond to Tolerance-based Accuracy * (↑) / *MAE* (↓). Baseline marked with ✧ is spatial-enhanced VLM.
* Tolerance-based Accuracy is defined to measure the proportion of model responses that fall within the range of [75%, 125%] of the ground-truth answer.

Benchmark results on spatial reasoning tasks. — Performance on **Spatial Reasoning** tasks in NuScenes-SpatialQA. The table reports *Tolerance-based Accuracy* (↑) across different VLMs.

Ablation study on backbone and scaling. — Effect of **backbone architecture** and **model scaling** on VLM performance. This table reports *Tolerance-based Accuracy* (↑) across different model variants of LLaVA-v1.6. The first two rows compare the impact of different backbone architectures (Mistral-7B vs. Vicuna-7B). The last three rows examine the effect of model scaling.

Ablation study on CoT. — Effects of **CoT reasoning** on VLM performance in NuScenes-SpatialQA.

BibTeX

@article{tian2025nuscenes,
        title={NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving},
        author={Tian, Kexin and Mao, Jingrui and Zhang, Yunlong and Jiang, Jiwan and Zhou, Yang and Tu, Zhengzhong},
        journal={arXiv preprint arXiv:2504.03164},
        year={2025}
      }

NuScenes-SpatialQA:
A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

Comprehensive experiments on our NuScenes-SpatialQA benchmark have demonstrated VLMs' performance on spatial understanding and reasoning abilities, including qualitative spatial relationship tasks (left) and quantitative spatial measurement tasks (right).

Abstract

QA Generation Pipeline

QA Example

Benchmark Statistics

Evaluation Results

BibTeX