CoMP: Continual Multimodal Pre-training for Vision Foundation Models Paper • 2503.18931 • Published Mar 24 • 30
Inst-IT Models Collection A series of LMMs finetuned with the Inst-IT Dataset, skilled in fine-grained image/video understanding at the instance-level. • 2 items • Updated Mar 17
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Paper • 2412.03565 • Published Dec 4, 2024 • 11
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Paper • 2412.03565 • Published Dec 4, 2024 • 11
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding Paper • 2312.00081 • Published Nov 30, 2023 • 2
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation Paper • 2311.14671 • Published Nov 24, 2023
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs Paper • 2406.04334 • Published Jun 6, 2024
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Paper • 2311.07574 • Published Nov 13, 2023 • 16