Papers
arxiv:2504.09925

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Published on Apr 14
· Submitted by starriver030515 on Apr 15
Authors:
,
,

Abstract

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

Community

Paper author Paper submitter
edited 1 day ago
Paper author Paper submitter
edited 1 day ago
Model # Vis Tok. MMB_EN MMB_CN VizWiz POPE MM-Vet MME_P MME_C Seed-Image HallB LLaVA_W MMStar MME-RW RWQA CV-Bench MMVP AI2D MathVista MMMU SQA TextVQA OCRBench ChartQA DocVQA
<=4B Model Comparison
Qwen2.5VL 3B - 79.1 78.1 - 85.9 61.4 1592.4 607.5 74.0 46.6 - 56.3 53.1 65.4 - - 81.4 61.2 51.2 79.3 - 82.8 84.0 93.93
InternVL2 4B - 78.5 73.9 - 84.6 50.5 1532.8 531.8 73.2 42.4 - 53.9 52.1 60.5 - - 79.0 58.5 48.3 96.0 74.7 78.4 81.5 89.2
DeepSeek-VL2-Tiny - 74.6 72.1 - - 52.5 1548.3 357.1 72.3 39.6 - 45.9 - 64.2 - - 71.6 53.6 40.7 - 80.7 80.5 81.0 86.9
MM1.5 3B - - - - 88.1 41.0 1478.4 319.6 72.4 - 73.0 - - 56.9 - - 65.7 44.4 37.1 85.8 76.5 65.7 74.2 87.5
Phi 3.5-Vision - 75.5 64.2 58.2 82.2 46.5 1473.4 412.1 69.9 53.3 68.8 49.0 - 53.5 69.3 67.7 77.4 - 43.3 89.0 61.1 59.8 72.0 75.9
Florence-VL 3B 576 71.6 60.8 59.1 88.3 51.0 1498.7 403.9 70.6 58.1 71.1 44.9 - 60.4 70.2 64.7 73.8 52.2 41.8 84.6 69.1 63.0 70.7 -
FUSION 3B (ours) 780 79.5 71.7 64.6 88.9 57.2 1595.9 416.5 74.6 51.4 84.7 52.4 41.5 65.1 76.4 76.0 78.9 54.3 44.7 87.1 71.8 60.0 75.7 70.9
FUSION-X 3B (ours) 620 80.3 74.8 66.1 88.7 60.3 1582.1 440.0 75.3 51.9 85.2 50.9 41.7 63.7 78.3 78.1 79.2 54.9 44.2 87.3 73.9 63.7 75.8 71.1
FUSION-L 3B (ours) 308 77.6 70.8 65.3 88.3 56.7 1573.7 406.8 74.1 48.7 77.6 44.7 39.5 61.8 76.2 77.0 77.3 48.6 43.4 85.6 71.4 56.9 67.7 63.5
>=7B Model Comparison
Qwen2VL 7B - 83.0 80.5 - 88.4 62.0 1639.2 637.1 76.0 50.6 - 60.7 57.4 70.1 - - 83.0 58.2 54.1 85.5 84.3 86.6 83.0 94.5
InternVL2 8B - 81.7 81.2 - 86.9 54.2 1639.7 575.3 75.4 45.2 - 61.5 53.5 64.4 - - 83.6 58.3 52.6 96.3 77.4 79.4 83.3 91.6
LLaVA-OneVision 8B - 81.7 78.0 - 87.2 58.8 1626.0 483.0 74.8 47.5 86.9 60.9 57.5 65.5 - - 81.6 56.1 47.7 96.6 78.5 69.7 78.8 87.5
MM1.5 7B - - - - 88.6 42.2 1514.9 346.4 73.4 - 74.2 - - 62.5 - - 72.2 47.6 41.8 89.6 76.5 63.5 88.1 78.2
Cambrian 8B 576 75.9 67.9 - 87.4 48.0 1547.1 - 74.7 48.7 71.0 50.0 - 64.2 72.2 51.3 73.0 49.0 42.7 80.4 71.7 62.4 73.3 77.8
Florence-VL 8B 576 76.2 69.5 59.1 89.9 56.3 1560.0 381.1 74.9 57.3 74.2 50.0 - 64.2 73.4 73.3 74.2 55.5 43.7 85.9 74.2 63.4 74.7 -
Eagle 8B 1024 75.9 - - - - 1559.0 - 76.3 - - - - 66.5 - 71.6 76.1 52.7 43.8 84.3 77.1 62.6 80.1 86.6
FUSION 8B (ours) 780 80.5 74.9 59.5 89.3 60.0 1592.3 396.1 77.2 52.6 86.9 52.4 46.0 65.2 78.7 78.7 80.4 56.6 43.1 89.2 77.3 63.8 80.3 78.6
FUSION-X 8B (ours) 620 82.0 76.2 62.9 88.8 60.0 1607.5 337.2 78.2 51.4 88.0 52.7 44.7 66.1 79.2 79.9 81.4 59.4 42.2 90.3 74.7 66.6 79.8 77.8
FUSION-L 8B (ours) 308 80.0 73.6 59.9 88.5 57.3 1601.7 338.9 75.9 46.7 82.1 49.3 42.3 65.1 78.2 76.7 79.2 55.2 41.8 88.3 72.8 59.5 73.0 66.0

With only 630 vision tokens, FUSION-X outperforms Cambrian-1 and Florence-VL, matching LLaVA-OneVision and nearly reaching the performance of top models like InternVL2 and Qwen2VL. Even with 300 vision tokens, FUSION-L retains 95% of its original performance, staying on par with Florence-VL.

Notably, FUSION-X 3B achieved the highest score on MMBench among models under 4B in size, even surpassing Qwen2.5VL 3B!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09925 in a Space README.md to link it from this page.

Collections including this paper 9