arxiv:2507.10015

(Almost) Free Modality Stitching of Foundation Models

Published on Jul 14

· Submitted by

Authors:

Abstract

Hypernetwork Model Alignment (Hyma) optimizes uni-modal model selection and connector training for multi-modal models, reducing search costs while maintaining performance.

AI-generated summary

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N times M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10times, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

View arXiv page View PDF Add to collection

Community

Xa9aX

Paper submitter about 15 hours ago

We propose Hyma, a hypernetwork based framework that allows stitching multiple pre-trained uni-modal models via connectors to form multi-modal models in a single run while being orders of magnitude cheaper in computational cost in comparison to Grid Search.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.10015 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.10015 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.10015 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.