view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 447
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 80
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models Paper • 2504.13122 • Published Apr 17 • 21
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Paper • 2503.24379 • Published Mar 31 • 77
A Survey on Benchmarks of Multimodal Large Language Models Paper • 2408.08632 • Published Aug 16, 2024 • 2
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis Paper • 2408.09481 • Published Aug 18, 2024 • 1