DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

1Texas A&M University, 2University of Southern California
*Corresponding author
Code arXiv
MY ALT TEXT

The Overview of our proposed DecAlign framework.

Abstract

Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios.

Method Overview

MY ALT TEXT

  • Multimodal Representation Decoupling: DecAlign decouples multimodal features into modality-unique (heterogeneous) and modality-common (homogeneous) representations. Modality-unique encoders extract heterogeneous features to capture specific characteristics, while a modality-common encoder derives homogeneous features for common semantics. This decoupling effectively reduces cross-modal interference while simultaneously enhancing the ability to capture underlying semantic commonalities across modalities.
  • Heterogeneity Alignment: DecAlign introduces a prototype-guided optimal transport strategy to mitigate distributional discrepancies among modality-unique features, which often pose challenges to seamless cross-modal integration. By incorporating Gaussian Mixture Models (GMM), DecAlign effectively captures complex intra-modal structures, generating adaptive prototypes that act as alignment anchors. A multi-marginal optimal transport mechanism then dynamically aligns these prototypes across modalities, bridging distributional gaps while preserving modality-unique characteristics. This approach not only maintains the distinctive nuances of each modality but also facilitates a more cohesive integration into a shared multimodal representation space.
  • Homogeneity Alignment: To ensure semantic consistency and enable seamless multimodal fusion, DecAlign aligns common features through latent distribution matching and maximum mean discrepancy regularization. By systematically correcting global statistical shifts in means, covariances, and higher-order moments, DecAlign preserves the intrinsic semantic relationships of modality-common representations while mitigating distortions caused by distributional inconsistencies. This approach not only fosters effective feature integration but also enhances the model’s robustness in handling diverse multimodal scenarios.
  • Hierarchical Alignment Strategy: DecAlign bridges modality gaps by first aligning heterogeneous features through prototype-based optimal transport, then ensuring semantic coherence via latent space alignment and MMD regularization. This hierarchical approach enhances multimodal fusion accuracy and generalizability, as demonstrated in benchmark evaluations.

Comparison Analysis

Comprehensive experiments conducted across four widely-used multimodal benchmarks demonstrate DecAlign's superior performance compared to existing state-of-the-art methods. The results consistently show that DecAlign achieves substantial improvements in both fine-grained semantic distinction and overall alignment accuracy, highlighting the effectiveness and robustness of its hierarchical alignment strategy.

Comparison 1

Table 1. Performance Comparison on CMU-MOSI and CMU-MOSEI datasets.
Comparison 2

Table 2. Performance Comparison on CH-SIMS and IEMOCAP datasets.

Experimental results presented in Tables 1 and 2 indicate that DecAlign consistently outperforms existing state-of-the-art multimodal methods across various datasets and evaluation metrics. Specifically, DecAlign achieves the lowest Mean Absolute Error (MAE) and the highest correlation coefficients, binary accuracies (Acc-2), and F1 Scores, demonstrating significant improvements in both regression and classification tasks.

Comparison 3

Figure 1. Visualization showcasing the superior performance of our proposed DecAlign, which surpasses state-of-the-art methods across multiple multimodal benchmarks.

The bubble chart visualization further emphasizes DecAlign's balanced and superior performance, maintaining high accuracy and F1 scores across diverse multimodal benchmarks, highlighting its robustness. Additionally, the confusion matrices clearly illustrate that DecAlign significantly reduces misclassification across sentiment intensity categories, exhibiting improved diagonal dominance and an enhanced ability to distinguish nuanced sentiment classes, thereby underscoring its precise alignment and semantic understanding.

Ablation Analysis

Ablation studies (Tables 3 and 4) systematically evaluate the impact of DecAlign’s key modules and specific alignment strategies. Table 3 demonstrates that multimodal feature decoupling (MFD), heterogeneous alignment (Hete), and homogeneous alignment (Homo) each substantially contribute to performance, with the removal of either Hete or Homo individually resulting in minor performance drops, and the absence of both causing a notable decline. This confirms their essential roles and complementary interaction. Table 4 further analyzes individual alignment techniques such as Prototype-Based Optimal Transport (Proto-OT), Contrastive Training (CT), Semantic Consistency (Sem), and Maximum Mean Discrepancy (MMD). Results show that each alignment strategy significantly influences performance, emphasizing the critical importance of both fine-grained and global alignment mechanisms within DecAlign.

MY ALT TEXT

Table 3. Ablation study on different key modules for DecAlign on CMU-MOSI and CMU-MOSEI datasets.
MY ALT TEXT

Table 4. Ablation study on different alignment strategies for DecAlign on CMU-MOSI and CMU-MOSEI datasets.

Figure 2 visually demonstrates the impact of removing heterogeneous and homogeneous alignment modules, highlighting the consistent performance degradation across different sentiment categories when either module is omitted. This underscores the necessity of both alignment strategies in maintaining robust multimodal sentiment classification performance.

MY ALT TEXT

Figure 2. Visualization of ablation studies on accuracy comparison across different emotion categories.

Figure 3 offers a visual case study on the modality gap between vision and language features, illustrating that DecAlign effectively reduces modality discrepancies through hierarchical alignment. Models without heterogeneous or homogeneous alignment exhibit significantly larger modality gaps, validating DecAlign’s capability to bridge semantic and distributional differences across modalities.

MY ALT TEXT

Figure 3. Visualization of the modality gap between vision and language on CMU-MOSEI dataset.

Hyperparameter Sensitivity Analysis

Hyperparameter sensitivity analysis (Figure 4) explores the influence of alignment trade-off parameters α and β on DecAlign’s performance. The results indicate that optimal performance is achieved with balanced values (α = 0.05, β = 0.05). Excessively large values significantly degrade performance, suggesting overly stringent constraints impede effective multimodal fusion. Moderate parameter settings provide the optimal balance, confirming the importance of carefully tuning alignment strategies for achieving peak multimodal learning effectiveness.

Comparison 1

Figure 4. Hyperparameter sensitivity analysis on CMU-MOSI and CMU-MOSEI datasets in terms of Binary F1 Score

BibTeX

BibTex is coming soon.