Text-to-Video

Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Sihui Ji1 Xi Chen1 Shuai Yang3 Xin Tao2 Pengfei Wan2
Hengshuang Zhao1βœ‰

1The University of Hong Kong    2Kling Team, Kuaishou Technology
3Hong Kong University of Science and Technology (Guangzhou)    βœ‰Corresponding author

     

πŸ”₯ Updates

πŸ“· Introduction

TL;DR: We propose MemFlow to address the core challenge of long-context consistency and narrative coherence in streaming video generation. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden and keeps the compatibility with any streaming video generation model with KV cache.

Watch the video

πŸ“Œ Highlights

  1. Long Context Memory with Limited Capacity: MemFlow maintains long-range memory for visual consistency with highly constrained capacity to guarantee lightweight computation and storage.

  2. Adaptive Retrieval for Narrative Coherence: MemFlow dynamically retrieves the most relevant historical frames from memory with text prompt of the coming chunk to ensure narrative coherence.

  3. Efficient and Real-time Inference: Memflow supports real-time generation with 18.7 FPS on a single H100 GPU, sacrificing only 7.9% speed compared to the memory-free baseline.

πŸ› οΈ Installation

Requirements

We tested this repo on the following setup:

  • Nvidia GPU with 80 GB memory (A100, and A800 are tested).
  • Linux operating system.

Other hardware setup could also work but hasn't been tested.

Environment

Create a conda environment and install dependencies:

git clone https://github.com/KlingTeam/MemFlow
cd MemFlow
conda create -n memflow python=3.10 -y
conda activate memflow
conda install nvidia/label/cuda-12.4.1::cuda
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🧱 Download Checkpoints

Download models using huggingface-cli:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download KlingTeam/MemFlow --local-dir checkpoints

or using git:

git lfs install
git clone https://huggingface.co/KlingTeam/MemFlow

πŸ”‘ Inference

Single Prompt Video Generation

bash inference.sh

Interactive Long Video Generation

bash interactive_inference.sh

Hints for video prompt

  1. For each subject and background appearing in a video, maintaining consistent descriptions across different prompts within the same video greatly improves global coherence during prompt switches. See the example for the exact prompt set we used to produce some of our videos on the demo page.

  2. MemFlow supports diverse interactionβ€”action changes, introducing/removing objects, background shifts, and more. While large-scale continuous camera motions can be achieved through appropriate cinematic language (see prompts/interactive_example.jsonl), rapid shot-to-shot transitions or fast cutscene-style edits are not supported.

βš™οΈ Training

Download checkpoints

Please follow Self-Forcing to download text prompts and ODE initialized checkpoint.

Download Wan2.1-T2V-14B as the teacher model.

huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir wan_models/Wan2.1-T2V-14B

Stage 1: Self-Forcing Initialization for Memory Mechanism

bash train_init.sh

Stage 2: Streaming Long Tuning

bash train_long.sh

Hints for two stage training

The bank_size is a tunable hyperparameter specified in configs/train_init.yaml and configs/train_long.yaml. It controls the number of latent frames stored in the memory bank. When bank_size matches the number of latent frames of frame sink in LongLive (as in our default setting), training can optionally start directly from Stage 2 (Streaming Long Tuning). Specifically, we initialize from the checkpoint longlive_base.pt obtained in Stage 1 of LongLive and fine-tune only the LoRA parameters, which significantly improves training efficiency.

πŸ€— Acknowledgement

  • LongLive: the codebase we built upon. Thanks for their wonderful work.
  • Self-Forcing: the algorithm we built upon. Thanks for their wonderful work.
  • Wan: the base model we built upon. Thanks for their wonderful work.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{ji2025memflow,
      title={MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives}, 
      author={Ji, Sihui and Chen, Xi and Yang, Shuai and Tao, Xin and Wan, Pengfei and Zhao, Hengshuang},
      year={2025},
      eprint={2512.14699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.14699}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for KlingTeam/MemFlow

Finetuned
(22)
this model