The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications.
We propose retrieval-augmented direct preference optimization (rDPO), an extension of DPO that integrates an additional visual preference optimization objective, which is formulated as follows:
Re-Align achieves the best among the evaluated methods on both POPE and HallusionBench for LLaVA-v1.5-7B and LLaVA-v1.6-Mistral-7B, highlighting the effectiveness of our approach in mitigating hallucinations of VLMs. Furthermore, Re-Align can provide generally on-par or better performance than the vanilla models and baseline alignment methods on each evaluated general VQA task, ultimately achieving the best overall results.
Table 3 presents the performance of Re-Align using both standard image-to-text and unified VLM backbones across model sizes from 1B to 13B on the POPE benchmark. In experiments with the LLaVA-v1.5 series, none of the baseline approaches consistently improve performance for either the 7B or the 13B models, highlighting the limited scalability of these methods. In contrast, Re-Align achieved substantial performance gains, outperforming both the baseline models and the vanilla version—most notably on the LLaVA-v1.5-13B variant. Similarly, experiments with the LLaVA-v1.6-Vicuna series revealed the same trend, further underscoring Re-Align's superior scalability. For unified vision-language models, especially Janus-Pro, integrating Re-Align yields a significant performance boost. Notably, Janus-Pro-1B experiences the greatest improvement, underscoring robustness of Re-Align across different model architectures.
Table 4 summarizes the performance of Re-Align when using both standard DPO and rDPO as the direct optimization objectives, evaluated on general VQA and hallucination tasks with LLaVA-v1.5-7B and LLaVA-v1.6-Mistral-7B as backbones. The results indicate that employing rDPO as the finetuning objective consistently yields superior performance over standard DPO across both task categories, highlighting the benefits of incorporating visual preference signals during the alignment process for VLMs.
@article{Xing2025Feb,
author = {Xing, Shuo and Wang, Yuping and Li, Peiran and Bai, Ruizheng and Wang, Yueqi and Qian, Chengxuan and Yao, Huaxiu and Tu, Zhengzhong},
title = {{Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization}},
journal = {arXiv},
year = {2025},
month = feb,
eprint = {2502.13146},
doi = {10.48550/arXiv.2502.13146}
}