Unmasked Teacher: Towards Training-Efficient Video Foundation Models Paper • 2303.16058 • Published Mar 28, 2023
Harvest Video Foundation Models via Efficient Post-Pretraining Paper • 2310.19554 • Published Oct 30, 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Paper • 2311.17005 • Published Nov 28, 2023 • 2
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks Paper • 2401.14159 • Published Jan 25, 2024 • 3
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning Paper • 2201.04676 • Published Jan 12, 2022
UniFormer: Unifying Convolution and Self-attention for Visual Recognition Paper • 2201.09450 • Published Jan 24, 2022
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction Paper • 2205.14871 • Published May 30, 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer Paper • 2211.09552 • Published Nov 17, 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Paper • 2212.03191 • Published Dec 6, 2022
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration Paper • 2408.10605 • Published Aug 20, 2024 • 1
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Paper • 2410.19702 • Published Oct 25, 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling Paper • 2501.00574 • Published Dec 31, 2024 • 6
Make Your Training Flexible: Towards Deployment-Efficient Video Models Paper • 2503.14237 • Published Mar 18 • 5
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Paper • 2412.19326 • Published Dec 26, 2024 • 18
Causal Diffusion Transformers for Generative Modeling Paper • 2412.12095 • Published Dec 16, 2024 • 23
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel Paper • 2412.08467 • Published Dec 11, 2024 • 6