Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

1Texas A&M University, 2University of Michigan, 3UIUC, 4UNC Chapel Hill
arXiv 2025

*Corresponding author
Code arXiv
MY ALT TEXT

Benchmark performance comparison (min-max normalized).

Abstract

The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications.

Preference Generation

MY ALT TEXT

Figure 1. Illustration of the preference generation process, utilizing the original vision encoder from initial VLMs and the SentenceTransformer as the text encoder.

  • Strategical masking: Given an input pair and its corresponding chosen response generated by a pretrained VLM, a strategic masking process removes words or segments associated with objects, attributes, or logical relationships inferred from the image, producing the masked response.
  • Image retrieval: All images in the training set are embedded using the original vision encoder of the pre-trained VLMs, forming the knowledge base. The top-k most similar images to the input image are then retrieved from knowledge base using a cosine similarity search.
  • Inducing hallucinations: VLMs are prompted to generate a candidate completion for the masked response conditioned on the input instruction and one of the retrieved image, which are ranked by their cosine similarity to the input image.. Both the chosen response and the reconstructed response are embedded using a SentenceTransformer model. If the cosine similarity between these embeddings falls below 0.95, the reconstructed response is designated as the rejected response. Otherwise, the process continues with the next retrieved image in the similarity-ranked sequence until a suitable candidate is identified or all retrieved images have been examined.

Preference Optimization

We propose retrieval-augmented direct preference optimization (rDPO), an extension of DPO that integrates an additional visual preference optimization objective, which is formulated as follows:

MY ALT TEXT

Results

Re-Align achieves the best among the evaluated methods on both POPE and HallusionBench for LLaVA-v1.5-7B and LLaVA-v1.6-Mistral-7B, highlighting the effectiveness of our approach in mitigating hallucinations of VLMs. Furthermore, Re-Align can provide generally on-par or better performance than the vanilla models and baseline alignment methods on each evaluated general VQA task, ultimately achieving the best overall results.

MY ALT TEXT

Table 1. Impact of Re-Align across hallucination benchmarks for VLMs, and comparisons with baselines.
MY ALT TEXT

Table 2. Impact of Re-Align across general benchmarks for VLMs, and comparisons with baselines.

Table 3 presents the performance of Re-Align using both standard image-to-text and unified VLM backbones across model sizes from 1B to 13B on the POPE benchmark. In experiments with the LLaVA-v1.5 series, none of the baseline approaches consistently improve performance for either the 7B or the 13B models, highlighting the limited scalability of these methods. In contrast, Re-Align achieved substantial performance gains, outperforming both the baseline models and the vanilla version—most notably on the LLaVA-v1.5-13B variant. Similarly, experiments with the LLaVA-v1.6-Vicuna series revealed the same trend, further underscoring Re-Align's superior scalability. For unified vision-language models, especially Janus-Pro, integrating Re-Align yields a significant performance boost. Notably, Janus-Pro-1B experiences the greatest improvement, underscoring robustness of Re-Align across different model architectures.

MY ALT TEXT

Table 3. Impact of Re-Align across various model scales on POPE.

Table 4 summarizes the performance of Re-Align when using both standard DPO and rDPO as the direct optimization objectives, evaluated on general VQA and hallucination tasks with LLaVA-v1.5-7B and LLaVA-v1.6-Mistral-7B as backbones. The results indicate that employing rDPO as the finetuning objective consistently yields superior performance over standard DPO across both task categories, highlighting the benefits of incorporating visual preference signals during the alignment process for VLMs.

MY ALT TEXT

Table 4. Impact of rDPO across general and hallucination benchmarks for VLMs, and comparisons with baselines.

BibTeX

@article{Xing2025Feb,
        author = {Xing, Shuo and Wang, Yuping and Li, Peiran and Bai, Ruizheng and Wang, Yueqi and Qian, Chengxuan and Yao, Huaxiu and Tu, Zhengzhong},
        title = {{Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization}},
        journal = {arXiv},
        year = {2025},
        month = feb,
        eprint = {2502.13146},
        doi = {10.48550/arXiv.2502.13146}
      }