Is Extending Modality The Right Path Towards Omni-Modality?
Abstract
Research investigates the impact of extending modality and model merging on maintaining language abilities and generalization in omni-modal language models.
Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
Community
We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Through extensive experiments, we analyze the trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings (2025)
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2025)
- Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models (2025)
- FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding (2025)
- Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging (2025)
- Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling (2025)
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper