kaizuberbuehler
's Collections
Vision Language Models
updated
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
27
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
31
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
32
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
31
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
23
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
29
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
•
2311.16502
•
Published
•
35
Kosmos-2: Grounding Multimodal Large Language Models to the World
Paper
•
2306.14824
•
Published
•
34
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
27
Pegasus-v1 Technical Report
Paper
•
2404.14687
•
Published
•
33
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
60
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
•
2404.16375
•
Published
•
18
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
9
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
37
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
104
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
24
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
76
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
52
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
40
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
20
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
22
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
16
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
64
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
25
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
56
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
55
Wolf: Captioning Everything with a World Summarization Framework
Paper
•
2407.18908
•
Published
•
33
Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal
Language Model
Paper
•
2408.00754
•
Published
•
25
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
26
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
53
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
52
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
28
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
78
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
75
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
3D-awareness
Paper
•
2409.18125
•
Published
•
35
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
•
2409.15272
•
Published
•
31
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
74
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
•
2412.10360
•
Published
•
147
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
54
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
•
2412.01169
•
Published
•
13
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
•
2412.03555
•
Published
•
135
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
87
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
•
2411.14794
•
Published
•
13
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
•
2411.10442
•
Published
•
81
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
124
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
•
2411.07461
•
Published
•
23
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
46
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
52
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
25
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile
Manipulation
Paper
•
2411.04999
•
Published
•
18
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
•
2501.00958
•
Published
•
107
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper
•
2501.01904
•
Published
•
34
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
•
2501.08326
•
Published
•
35
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
66
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
•
2501.05510
•
Published
•
44
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
•
2501.04003
•
Published
•
27
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
•
2501.09012
•
Published
•
10
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
•
2501.09755
•
Published
•
37
Do generative video models learn physical principles from watching
videos?
Paper
•
2501.09038
•
Published
•
35
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
•
2501.09747
•
Published
•
24
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
•
2501.12380
•
Published
•
86
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
•
2501.12368
•
Published
•
46
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
•
2501.13106
•
Published
•
91
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
Professional Videos
Paper
•
2501.13826
•
Published
•
26
Temporal Preference Optimization for Long-Form Video Understanding
Paper
•
2501.13919
•
Published
•
22
PixelWorld: Towards Perceiving Everything as Pixels
Paper
•
2501.19339
•
Published
•
17
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
•
2502.04328
•
Published
•
30
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper
•
2502.03738
•
Published
•
11
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
•
2502.07617
•
Published
•
29
CoS: Chain-of-Shot Prompting for Long Video Understanding
Paper
•
2502.06428
•
Published
•
10
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
•
2502.09560
•
Published
•
36
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
•
2502.09621
•
Published
•
28
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper
•
2502.09620
•
Published
•
26
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
•
2502.08468
•
Published
•
13
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
•
2502.09696
•
Published
•
44
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
•
2502.14282
•
Published
•
20
Qwen2.5-VL Technical Report
Paper
•
2502.13923
•
Published
•
182
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
•
2502.12900
•
Published
•
85
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
18
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
•
2502.18906
•
Published
•
12
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
•
2503.04130
•
Published
•
94
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
•
2503.01743
•
Published
•
85
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
78
EgoLife: Towards Egocentric Life Assistant
Paper
•
2503.03803
•
Published
•
42
Unified Video Action Model
Paper
•
2503.00200
•
Published
•
14
Unified Reward Model for Multimodal Understanding and Generation
Paper
•
2503.05236
•
Published
•
121
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
•
2503.07536
•
Published
•
85
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
•
2503.07365
•
Published
•
60
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
•
2503.05132
•
Published
•
57
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
•
2503.10480
•
Published
•
53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
36
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
•
2503.05379
•
Published
•
36
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
•
2503.06749
•
Published
•
29
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via
Reinforcement Learning and Reasoning
Paper
•
2503.07608
•
Published
•
23
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
17
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
•
2503.08525
•
Published
•
17
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering
Paper
•
2503.06492
•
Published
•
11
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Paper
•
2503.10391
•
Published
•
11