MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection Paper • 2404.04910 • Published Apr 7, 2024
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models Paper • 2503.04240 • Published Mar 6
Science-T2I: Addressing Scientific Illusions in Image Synthesis Paper • 2504.13129 • Published Apr 17 • 3
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark Paper • 2504.14693 • Published Apr 20
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments Paper • 2503.08604 • Published Mar 11
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 14
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published 17 days ago • 22
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published 17 days ago • 22
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Paper • 2505.10046 • Published May 15 • 9
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 94
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Paper • 2505.01583 • Published May 2 • 9
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers Paper • 2504.10483 • Published Apr 14 • 21
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think Paper • 2502.20172 • Published Feb 27 • 28
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published Jan 28 • 123
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps Paper • 2501.09732 • Published Jan 16 • 72
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Paper • 2501.05452 • Published Jan 9 • 15
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published Dec 18, 2024 • 24