SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published 1 day ago • 36
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation Paper • 2501.16764 • Published 2 days ago • 13
Are Vision Language Models Texture or Shape Biased and Can We Steer Them? Paper • 2403.09193 • Published Mar 14, 2024 • 8
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 3 items • Updated 3 days ago • 277
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models Paper • 2501.13920 • Published 7 days ago • 13
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published 7 days ago • 21
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Paper • 2501.13926 • Published 7 days ago • 28
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published 8 days ago • 74
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published 8 days ago • 264
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Paper • 2501.12909 • Published 8 days ago • 62
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments Paper • 2501.10893 • Published 11 days ago • 22
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper • 2501.12368 • Published 9 days ago • 39
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Paper • 2501.11733 • Published 9 days ago • 26
UI-TARS: Pioneering Automated GUI Interaction with Native Agents Paper • 2501.12326 • Published 9 days ago • 47