Papers
arxiv:2507.10015

(Almost) Free Modality Stitching of Foundation Models

Published on Jul 14
· Submitted by Xa9aX on Jul 17
Authors:
,
,
,

Abstract

Hypernetwork Model Alignment (Hyma) optimizes uni-modal model selection and connector training for multi-modal models, reducing search costs while maintaining performance.

AI-generated summary

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N times M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10times, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

Community

Paper submitter

We propose Hyma, a hypernetwork based framework that allows stitching multiple pre-trained uni-modal models via connectors to form multi-modal models in a single run while being orders of magnitude cheaper in computational cost in comparison to Grid Search.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.10015 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.10015 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.10015 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.