COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Abstract
Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs (2025)
- Decoupled Global-Local Alignment for Improving Compositional Understanding (2025)
- MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems (2025)
- Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data (2025)
- HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding (2025)
- Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding (2025)
- The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper