MoCha: Towards Movie-Grade Talking Character Synthesis Paper • 2503.23307 • Published 19 days ago • 121
LEGION: Learning to Ground and Explain for Synthetic Image Detection Paper • 2503.15264 • Published 30 days ago • 20
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning Paper • 2503.10291 • Published Mar 13 • 34
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning Paper • 2503.07572 • Published Mar 10 • 41
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published Mar 13 • 22
ABC: Achieving Better Control of Multimodal Embeddings using VLMs Paper • 2503.00329 • Published Mar 1 • 18
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Paper • 2502.04328 • Published Feb 6 • 30
Running 548 548 Scaling test-time compute 📈 Enhance math problem solving by scaling test-time compute
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Paper • 2412.03548 • Published Dec 4, 2024 • 17
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 48
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published Oct 23, 2024 • 37
AutoTrain: No-code training for state-of-the-art models Paper • 2410.15735 • Published Oct 21, 2024 • 60
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution Paper • 2410.16256 • Published Oct 21, 2024 • 61
Harnessing Webpage UIs for Text-Rich Visual Understanding Paper • 2410.13824 • Published Oct 17, 2024 • 32