Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space Paper • 2505.13308 • Published May 19 • 26
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts Paper • 2503.22952 • Published Mar 29 • 18
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens Paper • 2502.18890 • Published Feb 26 • 30
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions Paper • 2305.18756 • Published May 30, 2023
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation Paper • 2210.12460 • Published Oct 22, 2022
LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding Paper • 2402.16050 • Published Feb 25, 2024 • 1
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training Paper • 2305.18760 • Published May 30, 2023
LongViTU: Instruction Tuning for Long-Form Video Understanding Paper • 2501.05037 • Published Jan 9 • 1
HawkEye: Training Video-Text LLMs for Grounding Text in Videos Paper • 2403.10228 • Published Mar 15, 2024 • 1
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges Paper • 2409.01071 • Published Sep 2, 2024 • 28
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning Paper • 2408.02210 • Published Aug 5, 2024 • 9
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Paper • 2406.16338 • Published Jun 24, 2024 • 27
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering Paper • 2401.03901 • Published Jan 8, 2024