arxiv:2506.01872

Is Extending Modality The Right Path Towards Omni-Modality?

Published on Jun 2

· Submitted by

DarthZhu on Jun 9

Upvote

Authors:

Tinghui Zhu ,

Abstract

Research investigates the impact of extending modality and model merging on maintaining language abilities and generalization in omni-modal language models.

AI-generated summary

Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

DarthZhu

Paper author Paper submitter about 21 hours ago

We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Through extensive experiments, we analyze the trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.