Abstract
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
Community
- We introduce LCO-Embedding, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on MIEB (Massive Image Embedding Benchmark) while supporting audio and videos.
- We introduce Generation-Representation Scaling law, and connect models' generative capabilities and their representation upper bound.
- We introduce SeaDoc, a challenging visual document retrieval task in southeast Asian languages; and show that continual generative pretraining before contrastive learning raises the representation upper bound.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPANER: Shared Prompt Aligner for Multimodal Semantic Representation (2025)
- Think Then Embed: Generative Context Improves Multimodal Embedding (2025)
- OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment (2025)
- BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion (2025)
- MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction (2025)
- Training LLMs to be Better Text Embedders through Bidirectional Reconstruction (2025)
- Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper

