oguzhanercan 's Collections Image-Text Alignment
updated
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
• 2502.05178
• Published
• 10
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
• 2502.14846
• Published
• 14
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper
• 2504.00557
• Published
• 15
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual
Reasoning Self-Improvement
Paper
• 2504.07934
• Published
• 21
Ming-Lite-Uni: Advancements in Unified Architecture for Natural
Multimodal Interaction
Paper
• 2505.02471
• Published
• 15
FG-CLIP: Fine-Grained Visual and Textual Alignment
Paper
• 2505.05071
• Published
• 18
GenRecal: Generation after Recalibration from Large to Small
Vision-Language Models
Paper
• 2506.15681
• Published
• 42
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
• 2507.04590
• Published
• 17
MetaCLIP 2: A Worldwide Scaling Recipe
Paper
• 2507.22062
• Published
• 37
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Paper
• 2508.05547
• Published
• 11
Concept-Aware Batch Sampling Improves Language-Image Pretraining
Paper
• 2511.20643
• Published
• 3