A u t o T r u s t

A Comprehensive Benchmark of Trustworthiness in VLMs for AD

AutoTrust Team

Texas A&M University
University of Toronto
University of Michigan
University of Wisconsin-Madison
University of Maryland
University of Texas at Austin
University of North Carolina at Chapel Hill

What is AutoTrust?

⚠️ WARNING: our data contains model outputs that may be considered offensive.

Teaser Figure

AutoTrust aims at providing a thorough assessment of trustworthiness in VLMs in Autonomous Driving Tasks

The research endeavor is designed to assist researchers and practitioners in better understanding the trustworthiness issues associated with the deployment of state-of-the-art Vision-Language Models (VLMs) for autonomous driving.

This project is organized around the following five primary perspectives of trustworthiness, including:

  • Trustfulness: Assessing their ability to provide factual responses and recognize potential inaccuracies.
  • Safety: Assessment of these models encompasses two critical dimensions: their resilience against unintentional perturbations that may occur in real-world scenarios, and their ability to maintain reliable performance when subjected to potential malicious attacks on their inputs.
  • Robustness: Evaluating the OOD robustness of DriveVLMs which includes Visual Robustness and Language Robustness.
  • Privacy: Investigating whether DriveVLMs inadvertently leak privacy-sensitive information about traffic participants during the perception process.
  • Fairness: This assessment evaluates the model's ability to ensure unbiased decision-making across various dimensions including race, age, gender, and car characteristics such as brand, type, and color.

Key Findings:


  • General: Generalist VLMs demonstrate superior performance on trustworthiness compared to specialist DriveVLMs for autonomous driving, where GPT-4o-mini and LLaVA-v1.6 are the top two performers.
  • Trustfulness: Despite potential factual inaccuracies, DriveVLMs maintain comparable trustfulness to general-purpose VLMs thanks to better uncertainty handling.
  • Safety: Vulnerability to adversarial attacks highly correlates with model size—smaller is more fragile.
  • Robustness: DriveVLMs exhibit significant robustness issues, performing notably worse than generalists.
  • Privacy: DriveVLMs are ineffective at protecting privacy information, with Dolphins and EM-VLM4AD being particularly susceptible to privacy-leakage prompts, while GPT-4o-mini excels remarkably.
  • Fairness: Both generalist and specialist VLM models struggle to ensure unbiased decision-making. DriveVLMs demonstrate consistent performance across models but show a noticeable performance gap compared to generalist VLMs.

Citation

@misc{xing2024autotrustbenchmarkingtrustworthinesslarge,
    title={AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving},
    author={Shuo Xing and Hongyuan Hua and Xiangbo Gao and Shenzhe Zhu and Renjie Li and Kexin Tian and Xiaopeng Li and Heng Huang and Tianbao Yang and Zhangyang Wang and Yang Zhou and Huaxiu Yao and Zhengzhong Tu},
    year={2024},
    eprint={2412.15206},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.15206},
}

Have Questions?

Ask us questions at tzz@tamu.edu.

Acknowledgements

We thank the SQuAD team and the DecodingTrust team for sharing their website template.