Learning and Leveraging World Models in Visual Representation Learning Paper • 2403.00504 • Published Mar 1 • 31
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies Paper • 2403.01422 • Published Mar 3 • 26
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models Paper • 2403.05438 • Published Mar 8 • 18
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11 • 27
Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts Paper • 2403.08268 • Published Mar 13 • 15
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding Paper • 2403.09626 • Published Mar 14 • 13
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation Paper • 2403.12962 • Published Mar 19 • 7
GaussianFlow: Splatting Gaussian Dynamics for 4D Content Creation Paper • 2403.12365 • Published Mar 19 • 10
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework Paper • 2403.13248 • Published Mar 20 • 77
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition Paper • 2403.14148 • Published Mar 21 • 18
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text Paper • 2403.14773 • Published Mar 21 • 10
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 22
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models Paper • 2403.17005 • Published Mar 25 • 13
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators Paper • 2404.05014 • Published Apr 7 • 53
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Paper • 2404.05726 • Published Apr 8 • 20
PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations Paper • 2404.04421 • Published Apr 5 • 16
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model Paper • 2404.09967 • Published Apr 15 • 20
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 35
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model Paper • 2404.19759 • Published Apr 30 • 24
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models Paper • 2403.06098 • Published Mar 10 • 15
iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published May 24 • 12
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model Paper • 2405.20222 • Published May 30 • 10
I4VGen: Image as Stepping Stone for Text-to-Video Generation Paper • 2406.02230 • Published Jun 4 • 16
MotionClone: Training-Free Motion Cloning for Controllable Video Generation Paper • 2406.05338 • Published Jun 8 • 39
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing Paper • 2406.06523 • Published Jun 10 • 50
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published Jun 11 • 32
Hierarchical Patch Diffusion Models for High-Resolution Video Generation Paper • 2406.07792 • Published Jun 12 • 13
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation Paper • 2406.07686 • Published Jun 11 • 14
Vivid-ZOO: Multi-View Video Generation with Diffusion Model Paper • 2406.08659 • Published Jun 12 • 8
VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper • 2406.12275 • Published Jun 18 • 29
DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models Paper • 2407.01519 • Published Jul 1 • 22
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency Paper • 2407.02398 • Published Jul 2 • 14
VIMI: Grounding Video Generation through Multi-modal Instruction Paper • 2407.06304 • Published Jul 8 • 9
VEnhancer: Generative Space-Time Enhancement for Video Generation Paper • 2407.07667 • Published Jul 10 • 13
SEED-Story: Multimodal Long Story Generation with Large Language Model Paper • 2407.08683 • Published Jul 11 • 22
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models Paper • 2407.08701 • Published Jul 11 • 10
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Paper • 2407.15841 • Published Jul 22 • 39
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence Paper • 2407.16655 • Published Jul 23 • 28
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention Paper • 2407.19918 • Published Jul 29 • 48
Tora: Trajectory-oriented Diffusion Transformer for Video Generation Paper • 2407.21705 • Published Jul 31 • 25
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model Paper • 2408.00762 • Published Aug 1 • 9