Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,92 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# MusicInfuser
|
5 |
+
[](https://susunghong.github.io/MusicInfuser/)
|
6 |
+
[](https://arxiv.org/abs/2503.14505)
|
7 |
+
|
8 |
+
MusicInfuser adapts a text-to-video diffusion model to align with music, generating dance videos according to the music and text prompts.
|
9 |
+
|
10 |
+
## Requirements
|
11 |
+
|
12 |
+
We have tested on Python 3.10 with `torch>=2.4.1+cu118`, `torchaudio>=2.4.1+cu118`, and `torchvision>=0.19.1+cu118`. This repository requires a single A100 GPU for training and inference.
|
13 |
+
|
14 |
+
## Installation
|
15 |
+
```bash
|
16 |
+
# Clone the repository
|
17 |
+
git clone https://github.com/SusungHong/MusicInfuser
|
18 |
+
cd MusicInfuser
|
19 |
+
|
20 |
+
# Create and activate conda environment
|
21 |
+
conda create -n musicinfuser python=3.10
|
22 |
+
conda activate musicinfuser
|
23 |
+
|
24 |
+
# Install dependencies
|
25 |
+
pip install -r requirements.txt
|
26 |
+
pip install -e ./mochi --no-build-isolation
|
27 |
+
|
28 |
+
# Download model weights
|
29 |
+
python ./music_infuser/download_weights.py weights/
|
30 |
+
```
|
31 |
+
|
32 |
+
## Inference
|
33 |
+
To generate videos from music inputs:
|
34 |
+
```bash
|
35 |
+
python inference.py --input-file {MP3 or MP4 to extract audio from} \
|
36 |
+
--prompt {prompt} \
|
37 |
+
--num-frames {number of frames}
|
38 |
+
```
|
39 |
+
|
40 |
+
with the following arguments:
|
41 |
+
- `--input-file`: Input file (MP3 or MP4) to extract audio from.
|
42 |
+
- `--prompt`: Prompt for the dancer generation. The more specific a prompt is, generally the better the results, but more specificity decreases the effect of audio. Default: `"a professional female dancer dancing K-pop in an advanced dance setting in a studio with a white background, captured from a front view"`
|
43 |
+
- `--num-frames`: Number of frames to generate. While originally trained with 73 frames, MusicInfuser can extrapolate to longer sequences. Default: `145`
|
44 |
+
|
45 |
+
also consider:
|
46 |
+
- `--seed`: Random seed for generation. The resulting dance also depends on the random seed, so feel free to change it. Default: `None`
|
47 |
+
- `--cfg-scale`: Classifier-Free Guidance (CFG) scale for the text prompt. Default: `6.0`
|
48 |
+
|
49 |
+
## Dataset
|
50 |
+
For the AIST dataset, please see the terms of use and download it at [the AIST Dance Video Database](https://aistdancedb.ongaaccel.jp/).
|
51 |
+
|
52 |
+
## Training
|
53 |
+
To train the model on your dataset:
|
54 |
+
|
55 |
+
1. Preprocess your data:
|
56 |
+
```bash
|
57 |
+
bash music_infuser/preprocess.bash -v {dataset path} -o {processed video output dir} -w {path to pretrained mochi} --num_frames {number of frames}
|
58 |
+
```
|
59 |
+
|
60 |
+
2. Run training:
|
61 |
+
```bash
|
62 |
+
bash music_infuser/run.bash -c music_infuser/configs/music_infuser.yaml -n 1
|
63 |
+
```
|
64 |
+
|
65 |
+
**Note:** The current implementation only supports single-GPU training, which requires approximately 80GB of VRAM to train with 73-frame sequences.
|
66 |
+
|
67 |
+
## VLM Evaluation
|
68 |
+
For evaluating the model using Visual Language Models:
|
69 |
+
|
70 |
+
1. Follow the instructions in `vlm_eval/README.md` to set up the VideoLLaMA2 evaluation framework
|
71 |
+
2. It is recommended to use a separate environment from MusicInfuser for the evaluation
|
72 |
+
|
73 |
+
|
74 |
+
## Citation
|
75 |
+
|
76 |
+
```bibtex
|
77 |
+
@article{hong2025musicinfuser,
|
78 |
+
title={MusicInfuser: Making Video Diffusion Listen and Dance},
|
79 |
+
author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
|
80 |
+
journal={arXiv preprint arXiv:2503.14505},
|
81 |
+
year={2025}
|
82 |
+
}
|
83 |
+
```
|
84 |
+
|
85 |
+
## Acknowledgements
|
86 |
+
|
87 |
+
This code builds upon the following awesome repositories:
|
88 |
+
- [Mochi](https://github.com/genmoai/mochi)
|
89 |
+
- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
|
90 |
+
- [VideoChat2](https://github.com/OpenGVLab/Ask-Anything)
|
91 |
+
|
92 |
+
We thank the authors for open-sourcing their code and models, which made this work possible.
|