--- license: mit library_name: transformers base_model: Qwen/Qwen2.5-Omni-3B language: - en tags: - clamr - multimodal - video-retrieval - late-interaction pipeline_tag: feature-extraction --- # CLaMR: Multimodal Late-Interaction Retrieval by David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal CLaMR (Contextualized Late-Interaction for Multimodal Content Retrieval) is a novel retrieval system designed for tasks involving multiple modalities such as video frames, text (ASR, OCR), and descriptions. It adapts the ColBERT late-interaction strategy to a powerful multimodal foundation model, enabling fine-grained relevance scoring between a textual query and a rich set of multimodal document evidence. It was introduced in the paper [CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval](https://arxiv.org/abs/2506.06144).

## Model Description This model is built upon a **Qwen2.5-Omni-3B** backbone. CLaMR encodes a textual query and various multimodal document sources (like ASR, OCR, and video frames) into multi-vector representations. The core innovation is the **contextualized late-interaction mechanism**, which computes relevance by efficiently matching each query token embedding against all token embeddings from the various document modalities. Unlike traditional methods that aggregate multimodal information into a single fixed-size vector, CLaMR preserves modality-specific details. This allows for a much more granular and interpretable similarity assessment, significantly improving retrieval performance on complex, multimodal documents. The model is trained to distinguish between relevant and irrelevant documents using a contrastive loss function. ## Model Training ### Dataset The model was trained on MSRVTT. ### Training Parameters The model was trained using the following configuration: - **Framework:** PEFT with LoRA - **LoRA `r`:** 128 - **LoRA `alpha`:** 128 - **LoRA Target Modules:** `down_proj`, `gate_proj`, `up_proj`, `k_proj`, `q_proj`, `v_proj`, `o_proj`, and a `custom_text_proj` layer. - **Optimizer:** `paged_adamw_8bit` - **Learning Rate:** 1e-5 with a linear decay and 0.1 warmup ratio. - **Precision:** 4-bit quantization with `bfloat16`. - **Hardware:** 8 x NVIDIA A100 80GB GPUs. - **Batch Size:** 4 per device for training, 2 for evaluation. - **Epochs:** 5 ## Citation If you use CLaMR in your research, please cite the following paper: ```bibtex @misc{wan2025clamrcontextualizedlateinteractionmultimodal, title={CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval}, author={David Wan and Han Wang and Elias Stengel-Eskin and Jaemin Cho and Mohit Bansal}, year={2025}, eprint={2506.06144}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={[https://arxiv.org/abs/2506.06144](https://arxiv.org/abs/2506.06144)}, } ```