Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Sihui Ji¹ Xi Chen¹ Shuai Yang³ Xin Tao² Pengfei Wan²
Hengshuang Zhao^1✉

¹The University of Hong Kong ²Kling Team, Kuaishou Technology
³Hong Kong University of Science and Technology (Guangzhou) ^✉Corresponding author

🔥 Updates

[2025.12.29]: We release multi-node distributed training scripts for both initialization and streaming long tuning (train_init_multinode.sh, train_long_multinode.sh).
[2025.12.24]: We release the benchmark of multi-prompt generation in our paper, a prompt set prompts/interactive_benchmark.jsonl consisting of 100 groups of narrative scripts, with each consisting of 6 successive 10-second prompts for a total of 100 videos lasting for 60 seconds.
[2025.12.14]: Training and inference code, model checkpoints are available.
[2025.12.14]: Release the project page and the arXiv version.

📷 Introduction

TL;DR: We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden and keeps the compatibility with any streaming video generation model with KV cache.

📌 Highlights

Long Context Memory with Limited Capacity: MemFlow maintains long-range memory for visual consistency with highly constrained capacity to guarantee lightweight computation and storage.
Adaptive Retrieval for Narrative Coherence: MemFlow dynamically retrieves the most relevant historical frames from memory with text prompt of the coming chunk to ensure narrative coherence.
Efficient and Real-time Inference: Memflow supports real-time generation with 18.7 FPS on a single H100 GPU, sacrificing only 7.9% speed compared to the memory-free baseline.

🛠️ Installation

Requirements

We tested this repo on the following setup:

Nvidia GPU with 80 GB memory (A100, and A800 are tested).
Linux operating system.

Other hardware setup could also work but hasn't been tested.

Environment

Create a conda environment and install dependencies:

git clone https://github.com/KlingTeam/MemFlow
cd MemFlow
conda create -n memflow python=3.10 -y
conda activate memflow
conda install nvidia/label/cuda-12.4.1::cuda
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🧱 Download Checkpoints

Download models using huggingface-cli:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download KlingTeam/MemFlow --local-dir checkpoints

or using git:

git lfs install
git clone https://huggingface.co/KlingTeam/MemFlow

🔑 Inference

Single Prompt Video Generation

bash inference.sh

Interactive Long Video Generation

bash interactive_inference.sh

Hints for video prompt

For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.
MemFlow supports diverse interaction—action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see prompts/interactive_example.jsonl), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.

⚙️ Training

Download checkpoints

Please follow Self-Forcing to download text prompts and ODE initialized checkpoint.

Download Wan2.1-T2V-14B as the teacher model.

huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir wan_models/Wan2.1-T2V-14B

Stage 1: Self-Forcing Initialization for Memory Mechanism

Single-node training:

bash train_init.sh

Multi-node distributed training:

bash train_init_multinode.sh

Stage 2: Streaming Long Tuning Single-node training:

bash train_long.sh

Multi-node distributed training:

bash train_long_multinode.sh

Hints for two stage training

The bank_size is a tunable hyperparameter specified in configs/train_init.yaml and configs/train_long.yaml. It controls the number of latent frames stored in the memory bank. When bank_size matches the number of latent frames of frame sink in LongLive (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint longlive_base.pt obtained in Stage 1 of LongLive and fine-tune only the LoRA parameters, which significantly improves training efficiency.

📏 Evaluation & Benchmark

We provide a evaluation prompt set as benchmark for multi-prompt generation. Following LongLive, we customize 100 groups of narrative scripts, with each consisting of 6 successive 10-second prompts for a total of 100 videos lasting for 60 seconds. Set data_path in configs/interactive_inference.yaml as prompts/interactive_benchmark.jsonl for evaluation.

🤗 Acknowledgement

LongLive: the codebase we built upon. Thanks for their wonderful work.
Self-Forcing: the algorithm we built upon. Thanks for their wonderful work.
Wan: the base model we built upon. Thanks for their wonderful work.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{ji2025memflow,
      title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives}, 
      author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
      year={2025},
      eprint={2512.14699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.14699}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for KlingTeam/MemFlow

Base model

Wan-AI/Wan2.1-T2V-1.3B

Finetuned

(29)

this model

Papers for KlingTeam/MemFlow

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

CamCloneMaster: Enabling Reference-based Camera Control for Video Generation

Paper • 2506.03140 • Published Jun 3, 2025 • 1