Submitted by guyuchao 68 Long-Context Autoregressive Video Modeling with Next-Frame Prediction · 3 authors 2
Submitted by phillipinseoul 30 Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing · 4 authors 4
Submitted by Row11n 29 CoMP: Continual Multimodal Pre-training for Vision Foundation Models · 5 authors 1
Submitted by HongchengGao 28 Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation · 9 authors 4
Submitted by akhaliq 24 Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking · 8 authors 5
Submitted by zichenwen 18 Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation · 10 authors 3
Submitted by richardxp888 16 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding · 7 authors 2
Submitted by akhaliq 14 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning · 12 authors 2
Submitted by 3587jjh 9 Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models · 4 authors 1
Submitted by akhaliq 8 WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation · 8 authors 2
Submitted by BestWishYsh 6 FullDiT: Multi-Task Video Generative Foundation Model with Full Attention · 9 authors 2
Submitted by gym890 6 DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis · 7 authors 2
Submitted by akhaliq 6 FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement · 7 authors 2
Submitted by qth 5 Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation · 9 authors 2
Submitted by Ningyu 5 LookAhead Tuning: Safer Language Models via Partial Answer Previews · 10 authors 3
Submitted by haoyuhsu 5 PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos · 6 authors 2
Submitted by pranamanam 4 Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation · 4 authors 2
Submitted by wish44165 4 Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID · 1 authors 5
Submitted by zhehuderek 4 When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making · 3 authors 2
Submitted by DmitryRyumin 3 FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images · 13 authors 2
Submitted by mwmathis 3 LLaVAction: evaluating and training multi-modal large language models for action recognition · 4 authors 2
Submitted by LUC1O 3 OpenCity3D: What do Vision-Language Models know about Urban Environments? · 5 authors 2
Submitted by wangyi111 3 Towards a Unified Copernicus Foundation Model for Earth Vision · 11 authors 3
Submitted by rishitdagli 2 Can Vision-Language Models Answer Face to Face Questions in the Real-World? · 6 authors 2
Submitted by lx865712528 2 Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling · 4 authors 2
Submitted by CharlesChen2023 2 Frequency Dynamic Convolution for Dense Image Prediction · 5 authors 2
Submitted by stojnvla 1 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation · 4 authors 2
Submitted by ikodoh 1 ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models · 7 authors 1
Submitted by yaraalaa0 - Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images · 2 authors 2