Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning—key capabilities for autonomous driving—still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
Our QA generation framework involves two key pipelines: ①3D Scene Graph Generation: We construct a 3D scene graph for each camera view in the NuScenes dataset. An auto-captioning process is designed to generate instance-level descriptions for each annotated object, and the generated captions are combined with selected 3D attributes from NuScenes as node attributes. Spatial relationships between objects are encoded as edge attributes. ②Q&A Pairs Generation: Based on the structured scene graph and our predefined QA templates, we generate multi-aspect QA pairs that comprehensively cover both spatial understanding and spatial reasoning, providing a holistic evaluation of the spatial capabilities of VLMs. An overview of each step is presented in the following figure.
The final QA pairs are designed to comprehensively evaluate the spatial capabilities of vision-language models, and are categorized into two main types:
The figure below shows some examples of the QA pairs:
The final benchmark consists of approximately 3.5 million QA pairs, including around 2.5M qualitative and 0.6M quantitative questions under the spatial understanding category, as well as 0.2M reasoning-based questions covering both direct and situational reasoning.
These QA pairs span 6,000 keyframes, each captured from 6 camera views in the NuScenes dataset.
As shown in the following figure, we compare NuScenes-SpatialQA with existing open-source benchmarks in autonomous driving and spatial reasoning to highlight its scale and coverage. It is the first large-scale, ground-truth-based QA benchmark specifically designed to evaluate both spatial understanding and spatial reasoning capabilities of VLMs in autonomous driving.
@article{tian2025nuscenes,
title={NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving},
author={Tian, Kexin and Mao, Jingrui and Zhang, Yunlong and Jiang, Jiwan and Zhou, Yang and Tu, Zhengzhong},
journal={arXiv preprint arXiv:2504.03164},
year={2025}
}