Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation
Abstract
MambaRec enhances multimodal recommendation systems by integrating local feature alignment and global distribution regularization to improve cross-modal fusion and reduce representational bias.
Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.
Community
We introduce MambaRec, a novel multimodal recommendation framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, MambaRec employs the Dilated Refinement Attention Module (DREAM) to capture fine-grained semantic patterns between visual and textual modalities, and applies MMD and contrastive losses to ensure global modality alignment. A dimensionality reduction strategy further improves scalability. Experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency.
GitHub: https://github.com/rkl71/MambaRec
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multimodal Fusion And Sparse Attention-based Alignment Model for Long Sequential Recommendation (2025)
- Semantic Item Graph Enhancement for Multimodal Recommendation (2025)
- EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation (2025)
- Knowledge graph-based personalized multimodal recommendation fusion framework (2025)
- Hypercomplex Prompt-aware Multimodal Recommendation (2025)
- Multi-modal Adaptive Mixture of Experts for Cold-start Recommendation (2025)
- Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper