video - a zzfive Collection

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 11

ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

Lumiere: A Space-Time Diffusion Model for Video Generation

Paper • 2401.12945 • Published Jan 23, 2024 • 86

AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

Paper • 2402.00769 • Published Feb 1, 2024 • 23

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27, 2024 • 89

Sora Generates Videos with Stunning Geometrical Consistency

Paper • 2402.17403 • Published Feb 27, 2024 • 18

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Paper • 2402.17723 • Published Feb 27, 2024 • 16

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 36

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Paper • 2403.03100 • Published Mar 5, 2024 • 39

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

Paper • 2403.02827 • Published Mar 5, 2024 • 9

Video Editing via Factorized Diffusion Distillation

Paper • 2403.09334 • Published Mar 14, 2024 • 24

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14, 2024 • 16

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Paper • 2108.01073 • Published Aug 2, 2021 • 8

AnimateDiff-Lightning: Cross-Model Diffusion Distillation

Paper • 2403.12706 • Published Mar 19, 2024 • 18

Streaming Dense Video Captioning

Paper • 2404.01297 • Published Apr 1, 2024 • 13

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Paper • 2404.09956 • Published Apr 15, 2024 • 12

MotionMaster: Training-free Camera Motion Transfer For Video Generation

Paper • 2404.15789 • Published Apr 24, 2024 • 13

LLM-AD: Large Language Model based Audio Description System

Paper • 2405.00983 • Published May 2, 2024 • 23

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Paper • 2405.11473 • Published May 19, 2024 • 58

ReVideo: Remake a Video with Motion and Content Control

Paper • 2405.13865 • Published May 22, 2024 • 26

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Paper • 2405.14598 • Published May 23, 2024 • 14

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Paper • 2405.15216 • Published May 24, 2024 • 17

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Paper • 2405.16537 • Published May 26, 2024 • 18

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Paper • 2405.15757 • Published May 24, 2024 • 16

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

Paper • 2405.17405 • Published May 27, 2024 • 17

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Paper • 2405.17414 • Published May 27, 2024 • 12

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Paper • 2405.18386 • Published May 28, 2024 • 23

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Paper • 2405.18750 • Published May 29, 2024 • 22

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Paper • 2405.18991 • Published May 29, 2024 • 12

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Paper • 2405.20222 • Published May 30, 2024 • 11

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Paper • 2405.19707 • Published May 30, 2024 • 8

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Paper • 2406.01493 • Published Jun 3, 2024 • 23

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Paper • 2406.00908 • Published Jun 3, 2024 • 12

Searching Priors Makes Text-to-Video Synthesis Better

Paper • 2406.03215 • Published Jun 5, 2024 • 14

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 76

SF-V: Single Forward Video Generation Model

Paper • 2406.04324 • Published Jun 6, 2024 • 26

VideoTetris: Towards Compositional Text-to-Video Generation

Paper • 2406.04277 • Published Jun 6, 2024 • 26

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Paper • 2406.05338 • Published Jun 8, 2024 • 42

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Paper • 2406.06523 • Published Jun 10, 2024 • 54

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

Paper • 2406.07792 • Published Jun 12, 2024 • 16

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Paper • 2406.07686 • Published Jun 11, 2024 • 17

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Paper • 2406.08656 • Published Jun 12, 2024 • 8

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Paper • 2406.08845 • Published Jun 13, 2024 • 9

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Paper • 2406.14130 • Published Jun 20, 2024 • 10

MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Paper • 2406.15252 • Published Jun 21, 2024 • 18

Video-Infinity: Distributed Long Video Generation

Paper • 2406.16260 • Published Jun 24, 2024 • 30

DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Paper • 2407.01519 • Published Jul 1, 2024 • 25

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Paper • 2407.00367 • Published Jun 29, 2024 • 10

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8, 2024 • 10

VEnhancer: Generative Space-Time Enhancement for Video Generation

Paper • 2407.07667 • Published Jul 10, 2024 • 16

Still-Moving: Customized Video Generation without Customized Video Data

Paper • 2407.08674 • Published Jul 11, 2024 • 13

CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation

Paper • 2407.06188 • Published Jul 8, 2024 • 2

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

Paper • 2407.09012 • Published Jul 12, 2024 • 10

Video Occupancy Models

Paper • 2407.09533 • Published Jun 25, 2024 • 8

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Paper • 2407.10285 • Published Jul 14, 2024 • 5

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Paper • 2407.12781 • Published Jul 17, 2024 • 13

Towards Understanding Unsafe Video Generation

Paper • 2407.12581 • Published Jul 17, 2024

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Paper • 2407.13759 • Published Jul 18, 2024 • 18

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

Paper • 2407.15642 • Published Jul 22, 2024 • 11

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 31

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Paper • 2407.14505 • Published Jul 19, 2024 • 27

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

Paper • 2407.19918 • Published Jul 29, 2024 • 52

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31, 2024 • 28

Fine-gained Zero-shot Video Sampling

Paper • 2407.21475 • Published Jul 31, 2024 • 6

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Paper • 2408.00458 • Published Aug 1, 2024 • 13

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Paper • 2408.00762 • Published Aug 1, 2024 • 11

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Paper • 2408.02629 • Published Aug 5, 2024 • 15

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Paper • 2408.03284 • Published Aug 6, 2024 • 11

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Paper • 2408.04631 • Published Aug 8, 2024 • 10

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

Paper • 2408.05205 • Published Aug 9, 2024 • 10

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Paper • 2408.06072 • Published Aug 12, 2024 • 40

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Paper • 2408.08189 • Published Aug 15, 2024 • 17

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Paper • 2408.10119 • Published Aug 19, 2024 • 17

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21, 2024 • 57

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18

Real-Time Video Generation with Pyramid Attention Broadcast

Paper • 2408.12588 • Published Aug 22, 2024 • 17

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 12

Training-free Long Video Generation with Chain of Diffusion Model Experts

Paper • 2408.13423 • Published Aug 24, 2024 • 24

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Paper • 2408.15239 • Published Aug 27, 2024 • 31

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Paper • 2409.01199 • Published Sep 2, 2024 • 14

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Paper • 2409.01055 • Published Sep 2, 2024 • 6

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 98

OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published Sep 17, 2024 • 14

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Paper • 2409.09401 • Published Sep 14, 2024 • 7

LVCD: Reference-based Lineart Video Colorization with Diffusion Models

Paper • 2409.12960 • Published Sep 19, 2024 • 25

Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation

Paper • 2409.12532 • Published Sep 19, 2024 • 5

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 34

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Paper • 2409.18964 • Published Sep 27, 2024 • 27

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 30

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Paper • 2410.03051 • Published Oct 4, 2024 • 6

Pyramidal Flow Matching for Efficient Video Generative Modeling

Paper • 2410.05954 • Published Oct 8, 2024 • 41

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Paper • 2410.05677 • Published Oct 8, 2024 • 14

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 37

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 57

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Paper • 2410.10774 • Published Oct 14, 2024 • 26

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Paper • 2410.10816 • Published Oct 14, 2024 • 21

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 99

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Paper • 2410.13830 • Published Oct 17, 2024 • 25

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Paper • 2410.17434 • Published Oct 22, 2024 • 30

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Paper • 2410.19355 • Published Oct 25, 2024 • 23

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Paper • 2410.20280 • Published Oct 26, 2024 • 23

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Paper • 2410.23277 • Published Oct 30, 2024 • 9

Fashion-VDM: Video Diffusion Model for Virtual Try-On

Paper • 2411.00225 • Published Oct 31, 2024 • 11

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 24

Motion Control for Enhanced Complex Action Video Generation

Paper • 2411.08328 • Published Nov 13, 2024 • 5

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 25

StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Paper • 2411.11045 • Published Nov 17, 2024 • 11

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 27

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

Paper • 2411.13503 • Published Nov 20, 2024 • 35

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

Paper • 2411.14762 • Published Nov 22, 2024 • 11

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Paper • 2411.15115 • Published Nov 22, 2024 • 9

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 20

AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Paper • 2411.17383 • Published Nov 26, 2024 • 7

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Paper • 2411.17440 • Published Nov 26, 2024 • 38

Free^2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

Paper • 2411.17041 • Published Nov 26, 2024 • 13

Video Depth without Video Models

Paper • 2411.19189 • Published Nov 28, 2024 • 40

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Paper • 2411.19108 • Published Nov 28, 2024 • 20

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Paper • 2411.18664 • Published Nov 27, 2024 • 24

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Paper • 2411.18673 • Published Nov 27, 2024 • 8

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 29

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 34

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Paper • 2411.18671 • Published Nov 27, 2024 • 20

Efficient Track Anything

Paper • 2411.18933 • Published Nov 28, 2024 • 17

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Paper • 2411.17459 • Published Nov 26, 2024 • 11

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Paper • 2412.01316 • Published Dec 2, 2024 • 9

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Paper • 2412.02259 • Published Dec 3, 2024 • 61

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Paper • 2412.03517 • Published Dec 4, 2024 • 19

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Paper • 2412.03085 • Published Dec 4, 2024 • 12

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Paper • 2412.04814 • Published Dec 6, 2024 • 48

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 22

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Paper • 2412.05263 • Published Dec 6, 2024 • 11

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Paper • 2412.04432 • Published Dec 5, 2024 • 16

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Paper • 2412.05355 • Published Dec 6, 2024 • 9

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 75

Mobile Video Diffusion

Paper • 2412.07583 • Published Dec 10, 2024 • 20

MoViE: Mobile Diffusion for Video Editing

Paper • 2412.06578 • Published Dec 9, 2024 • 18

Video Motion Transfer with Diffusion Transformers

Paper • 2412.07776 • Published Dec 10, 2024 • 17

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Paper • 2412.07760 • Published Dec 10, 2024 • 56

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published Dec 10, 2024 • 20

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Paper • 2412.06016 • Published Dec 8, 2024 • 20

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

Paper • 2412.09349 • Published Dec 12, 2024 • 8

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Paper • 2412.09856 • Published Dec 13, 2024 • 10

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Paper • 2412.11279 • Published Dec 15, 2024 • 12

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Paper • 2412.10533 • Published Dec 13, 2024 • 5

MIVE: New Design and Benchmark for Multi-Instance Video Editing

Paper • 2412.12877 • Published Dec 17, 2024 • 4

AniDoc: Animation Creation Made Easier

Paper • 2412.14173 • Published Dec 18, 2024 • 57

Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published Dec 18, 2024 • 14

VidTok: A Versatile and Open-Source Video Tokenizer

Paper • 2412.13061 • Published Dec 17, 2024 • 8

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 54

TRecViT: A Recurrent Video Transformer

Paper • 2412.14294 • Published Dec 18, 2024 • 13

Large Motion Video Autoencoding with Cross-modal Video VAE

Paper • 2412.17805 • Published Dec 23, 2024 • 24

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Paper • 2412.18597 • Published Dec 24, 2024 • 19

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Paper • 2412.16153 • Published Dec 20, 2024 • 6

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Paper • 2412.19645 • Published Dec 27, 2024 • 13

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2 • 55

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 48

LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 47

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2 • 44

Unifying Specialized Visual Encoders for Video Language Models

Paper • 2501.01426 • Published Jan 2 • 21

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Paper • 2501.01320 • Published Jan 2 • 11

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Paper • 2412.21059 • Published Dec 30, 2024 • 18

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper • 2501.02976 • Published Jan 6 • 56

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Paper • 2501.02690 • Published Jan 5 • 17

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Paper • 2501.03059 • Published Jan 6 • 22

TransPixar: Advancing Text-to-Video Generation with Transparency

Paper • 2501.03006 • Published Jan 6 • 27

Ingredients: Blending Custom Photos with Video Diffusion Transformers

Paper • 2501.01790 • Published Jan 3 • 8

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 47

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

Paper • 2501.03847 • Published Jan 7 • 23

Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Paper • 2501.03931 • Published Jan 7 • 15

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 42

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 73

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

Paper • 2501.04698 • Published Jan 8 • 15

Multi-subject Open-set Personalization in Video Generation

Paper • 2501.06187 • Published Jan 10 • 14

VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10 • 34

Diffusion Adversarial Post-Training for One-Step Video Generation

Paper • 2501.08316 • Published Jan 14 • 36

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Paper • 2501.08994 • Published Jan 15 • 15

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Paper • 2501.09019 • Published Jan 15 • 12

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Paper • 2501.09755 • Published Jan 16 • 37

X-Dyna: Expressive Dynamic Human Image Animation

Paper • 2501.10021 • Published Jan 17 • 14

GameFactory: Creating New Games with Generative Interactive Videos

Paper • 2501.08325 • Published Jan 14 • 68

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Paper • 2501.12375 • Published Jan 21 • 22

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Paper • 2501.08331 • Published Jan 14 • 20

EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Paper • 2501.10687 • Published Jan 18 • 14

Taming Teacher Forcing for Masked Autoregressive Video Generation

Paper • 2501.12389 • Published Jan 21 • 10

Improving Video Generation with Human Feedback

Paper • 2501.13918 • Published Jan 23 • 52

DiffuEraser: A Diffusion Model for Video Inpainting

Paper • 2501.10018 • Published Jan 17 • 15

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

Paper • 2501.13452 • Published Jan 23 • 8

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Paper • 2502.02492 • Published Feb 4 • 67

DynVFX: Augmenting Real Videos with Dynamic Content

Paper • 2502.03621 • Published Feb 5 • 30

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

Paper • 2502.04299 • Published Feb 6 • 18

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

Paper • 2502.03639 • Published Feb 5 • 9

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Paper • 2502.05173 • Published Feb 7 • 66

Fast Video Generation with Sliding Tile Attention

Paper • 2502.04507 • Published Feb 6 • 52

Goku: Flow Based Video Generative Foundation Models

Paper • 2502.04896 • Published Feb 7 • 106

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Paper • 2502.05179 • Published Feb 7 • 24

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

Paper • 2502.04363 • Published Feb 5 • 12

History-Guided Video Diffusion

Paper • 2502.06764 • Published Feb 10 • 12

Magic 1-For-1: Generating One Minute Video Clips within One Minute

Paper • 2502.07701 • Published Feb 11 • 36

Enhance-A-Video: Better Generated Video for Free

Paper • 2502.07508 • Published Feb 11 • 21

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Paper • 2502.07531 • Published Feb 11 • 13

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 44

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Paper • 2502.08639 • Published Feb 12 • 43

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

Paper • 2502.07737 • Published Feb 11 • 9

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Paper • 2502.06145 • Published Feb 10 • 18

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Paper • 2502.10248 • Published Feb 14 • 56

Phantom: Subject-consistent video generation via cross-modal alignment

Paper • 2502.11079 • Published Feb 16 • 60

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Paper • 2502.17258 • Published Feb 24 • 80

RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Paper • 2502.15894 • Published Feb 21 • 20

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Paper • 2502.20321 • Published Feb 27 • 30

Mobius: Text to Seamless Looping Video Generation via Latent Shift

Paper • 2502.20307 • Published Feb 27 • 19

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Paper • 2503.04606 • Published Mar 6 • 9

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Paper • 2503.05639 • Published Mar 7 • 24

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Paper • 2503.05638 • Published Mar 7 • 19

Automated Movie Generation via Multi-Agent CoT Planning

Paper • 2503.07314 • Published Mar 10 • 45

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Paper • 2503.05978 • Published Mar 7 • 36

Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

Paper • 2503.08605 • Published Mar 11 • 27

TPDiff: Temporal Pyramid Video Diffusion Model

Paper • 2503.09566 • Published Mar 12 • 46

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Paper • 2503.09151 • Published Mar 12 • 33

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Paper • 2503.09642 • Published Mar 12 • 19

Long Context Tuning for Video Generation

Paper • 2503.10589 • Published Mar 13 • 14

CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

Paper • 2503.10391 • Published Mar 13 • 11

Large-scale Pre-training for Grounded Video Caption Generation

Paper • 2503.10781 • Published Mar 13 • 16

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Paper • 2503.09279 • Published Mar 12 • 5

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Paper • 2503.06053 • Published Mar 8 • 138

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Paper • 2503.13444 • Published Mar 17 • 17

MTV-Inpaint: Multi-Task Long Video Inpainting

Paper • 2503.11412 • Published Mar 14 • 10

Long-Video Audio Synthesis with Multi-Agent Collaboration

Paper • 2503.10719 • Published Mar 13 • 9

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

Paper • 2503.08153 • Published Mar 11 • 3

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Paper • 2503.14492 • Published Mar 18 • 20

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Paper • 2503.13265 • Published Mar 17 • 15

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

Paper • 2503.14151 • Published Mar 18 • 10

Temporal Regularization Makes Your Video Generator Stronger

Paper • 2503.15417 • Published Mar 19 • 22

MusicInfuser: Making Video Diffusion Listen and Dance

Paper • 2503.14505 • Published Mar 18 • 11

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Paper • 2503.16421 • Published Mar 20 • 11

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Paper • 2503.12689 • Published Mar 16 • 5

Enabling Versatile Controls for Video Diffusion Models

Paper • 2503.16983 • Published Mar 21 • 15

Video-T1: Test-Time Scaling for Video Generation

Paper • 2503.18942 • Published Mar 24 • 91

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

Paper • 2503.18886 • Published Mar 24 • 22

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Paper • 2503.19325 • Published Mar 25 • 73

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

Paper • 2503.19907 • Published Mar 25 • 8

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

Paper • 2503.19881 • Published Mar 25 • 6

Wan: Open and Advanced Large-Scale Video Generative Models

Paper • 2503.20314 • Published Mar 26 • 56

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Paper • 2503.19462 • Published Mar 25 • 10

Synthetic Video Enhances Physical Fidelity in Video Synthesis

Paper • 2503.20822 • Published Mar 26 • 16

SketchVideo: Sketch-based Video Generation and Editing

Paper • 2503.23284 • Published Mar 30 • 23

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper • 2503.24379 • Published Mar 31 • 77

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Paper • 2503.23377 • Published Mar 30 • 57

SkyReels-A2: Compose Anything in Video Diffusion Transformers

Paper • 2504.02436 • Published Apr 3 • 39

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published Apr 7 • 110

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Paper • 2504.07083 • Published Apr 9 • 23

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Paper • 2504.05541 • Published Apr 7 • 16

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Paper • 2504.04010 • Published Apr 5 • 10

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Paper • 2504.08641 • Published Apr 11 • 7

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Paper • 2504.11427 • Published Apr 15 • 19

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Paper • 2504.13122 • Published Apr 17 • 21

SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

Paper • 2504.14396 • Published Apr 19 • 28

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Paper • 2504.14899 • Published Apr 21 • 21

Vidi: Large Multimodal Models for Video Understanding and Editing

Paper • 2504.15681 • Published Apr 22 • 15

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

Paper • 2504.14977 • Published Apr 21 • 11

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 12

3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

Paper • 2504.17414 • Published Apr 24 • 18

Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21 • 159

Subject-driven Video Generation via Disentangled Identity and Motion

Paper • 2504.17816 • Published Apr 23 • 12

ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

Paper • 2504.21855 • Published Apr 30 • 13

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Paper • 2505.04512 • Published May 7 • 36

Scaling Image and Video Generation via Test-Time Evolutionary Search

Paper • 2505.17618 • Published May 23 • 42

Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

Paper • 2505.17561 • Published May 23 • 31

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Paper • 2505.20292 • Published May 26 • 54

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Paper • 2505.18875 • Published May 24 • 42

MotionPro: A Precise Motion Controller for Image-to-Video Generation

Paper • 2505.20287 • Published May 26 • 20

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

Paper • 2505.20156 • Published May 26 • 2

MAGREF: Masked Guidance for Any-Reference Video Generation

Paper • 2505.23742 • Published May 29 • 10

ATI: Any Trajectory Instruction for Controllable Video Generation

Paper • 2505.22944 • Published May 28 • 7

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Paper • 2506.00996 • Published Jun 1 • 38

FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

Paper • 2506.01144 • Published Jun 1 • 14

Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Paper • 2506.04225 • Published Jun 4 • 27

IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Paper • 2506.03150 • Published Jun 3 • 21

LayerFlow: A Unified Model for Layer-aware Video Generation

Paper • 2506.04228 • Published Jun 4 • 13

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Paper • 2506.05301 • Published Jun 5 • 55

FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Paper • 2506.05046 • Published Jun 5 • 2

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Paper • 2506.07848 • Published Jun 9 • 4

Dynamic View Synthesis as an Inverse Problem

Paper • 2506.08004 • Published Jun 9 • 5

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Paper • 2506.07177 • Published Jun 8 • 22

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10 • 102

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Paper • 2506.09350 • Published Jun 11 • 48

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Paper • 2506.09984 • Published Jun 11 • 15

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Paper • 2506.09229 • Published Jun 10 • 5

Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

Paper • 2506.22432 • Published Jun 27 • 13

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Paper • 2506.23858 • Published Jun 30 • 31

Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

Paper • 2506.19852 • Published Jun 24 • 41

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Paper • 2506.23552 • Published Jun 30 • 10

STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Paper • 2506.22868 • Published Jun 28 • 5

StreamDiT: Real-Time Streaming Text-to-Video Generation

Paper • 2507.03745 • Published Jul 4 • 29

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Paper • 2507.07982 • Published Jul 10 • 33

A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Paper • 2507.07202 • Published Jul 9 • 22

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Paper • 2507.08801 • Published Jul 11 • 30

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Paper • 2507.11336 • Published Jul 15 • 4

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Paper • 2507.04984 • Published Jul 7 • 5

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Paper • 2507.15852 • Published Jul 21 • 38

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Paper • 2507.15728 • Published Jul 21 • 7

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Paper • 2507.16116 • Published Jul 22 • 10

nablaNABLA: Neighborhood Adaptive Block-Level Attention

Paper • 2507.13546 • Published Jul 17 • 120

Captain Cinema: Towards Short Movie Generation

Paper • 2507.18634 • Published Jul 24 • 40

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Paper • 2508.03694 • Published Aug 5 • 50

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Paper • 2508.02807 • Published Aug 4 • 13

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Paper • 2508.07981 • Published 28 days ago • 58

CharacterShot: Controllable and Consistent 4D Character Animation

Paper • 2508.07409 • Published 29 days ago • 38

Cut2Next: Generating Next Shot via In-Context Tuning

Paper • 2508.08244 • Published 28 days ago • 13

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Paper • 2508.07901 • Published 28 days ago • 39

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Paper • 2508.10881 • Published 25 days ago • 51

Waver: Wave Your Way to Lifelike Video Generation

Paper • 2508.15761 • Published 18 days ago • 33

Wan-S2V: Audio-Driven Cinematic Video Generation

Paper • 2508.18621 • Published 13 days ago • 16

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published 11 days ago • 30