susunghong commited on
Commit
0efe410
·
verified ·
1 Parent(s): 2b70e2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # MusicInfuser
5
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://susunghong.github.io/MusicInfuser/)
6
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2503.14505)
7
+
8
+ MusicInfuser adapts a text-to-video diffusion model to align with music, generating dance videos according to the music and text prompts.
9
+
10
+ ## Requirements
11
+
12
+ We have tested on Python 3.10 with `torch>=2.4.1+cu118`, `torchaudio>=2.4.1+cu118`, and `torchvision>=0.19.1+cu118`. This repository requires a single A100 GPU for training and inference.
13
+
14
+ ## Installation
15
+ ```bash
16
+ # Clone the repository
17
+ git clone https://github.com/SusungHong/MusicInfuser
18
+ cd MusicInfuser
19
+
20
+ # Create and activate conda environment
21
+ conda create -n musicinfuser python=3.10
22
+ conda activate musicinfuser
23
+
24
+ # Install dependencies
25
+ pip install -r requirements.txt
26
+ pip install -e ./mochi --no-build-isolation
27
+
28
+ # Download model weights
29
+ python ./music_infuser/download_weights.py weights/
30
+ ```
31
+
32
+ ## Inference
33
+ To generate videos from music inputs:
34
+ ```bash
35
+ python inference.py --input-file {MP3 or MP4 to extract audio from} \
36
+ --prompt {prompt} \
37
+ --num-frames {number of frames}
38
+ ```
39
+
40
+ with the following arguments:
41
+ - `--input-file`: Input file (MP3 or MP4) to extract audio from.
42
+ - `--prompt`: Prompt for the dancer generation. The more specific a prompt is, generally the better the results, but more specificity decreases the effect of audio. Default: `"a professional female dancer dancing K-pop in an advanced dance setting in a studio with a white background, captured from a front view"`
43
+ - `--num-frames`: Number of frames to generate. While originally trained with 73 frames, MusicInfuser can extrapolate to longer sequences. Default: `145`
44
+
45
+ also consider:
46
+ - `--seed`: Random seed for generation. The resulting dance also depends on the random seed, so feel free to change it. Default: `None`
47
+ - `--cfg-scale`: Classifier-Free Guidance (CFG) scale for the text prompt. Default: `6.0`
48
+
49
+ ## Dataset
50
+ For the AIST dataset, please see the terms of use and download it at [the AIST Dance Video Database](https://aistdancedb.ongaaccel.jp/).
51
+
52
+ ## Training
53
+ To train the model on your dataset:
54
+
55
+ 1. Preprocess your data:
56
+ ```bash
57
+ bash music_infuser/preprocess.bash -v {dataset path} -o {processed video output dir} -w {path to pretrained mochi} --num_frames {number of frames}
58
+ ```
59
+
60
+ 2. Run training:
61
+ ```bash
62
+ bash music_infuser/run.bash -c music_infuser/configs/music_infuser.yaml -n 1
63
+ ```
64
+
65
+ **Note:** The current implementation only supports single-GPU training, which requires approximately 80GB of VRAM to train with 73-frame sequences.
66
+
67
+ ## VLM Evaluation
68
+ For evaluating the model using Visual Language Models:
69
+
70
+ 1. Follow the instructions in `vlm_eval/README.md` to set up the VideoLLaMA2 evaluation framework
71
+ 2. It is recommended to use a separate environment from MusicInfuser for the evaluation
72
+
73
+
74
+ ## Citation
75
+
76
+ ```bibtex
77
+ @article{hong2025musicinfuser,
78
+ title={MusicInfuser: Making Video Diffusion Listen and Dance},
79
+ author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
80
+ journal={arXiv preprint arXiv:2503.14505},
81
+ year={2025}
82
+ }
83
+ ```
84
+
85
+ ## Acknowledgements
86
+
87
+ This code builds upon the following awesome repositories:
88
+ - [Mochi](https://github.com/genmoai/mochi)
89
+ - [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
90
+ - [VideoChat2](https://github.com/OpenGVLab/Ask-Anything)
91
+
92
+ We thank the authors for open-sourcing their code and models, which made this work possible.