DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models Paper • 2505.24025 • Published 16 days ago • 27
Describe Anything Collection Multimodal Large Language Models for Detailed Localized Image and Video Captioning • 7 items • Updated 3 days ago • 51
GameFactory: Creating New Games with Generative Interactive Videos Paper • 2501.08325 • Published Jan 14 • 66
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens Paper • 2501.07730 • Published Jan 13 • 17