Safetensors
qwen2
iris2c commited on
Commit
c642c64
·
verified ·
1 Parent(s): 2d8d2e1
Files changed (1) hide show
  1. README.md +125 -183
README.md CHANGED
@@ -1,166 +1,85 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- pipeline_tag: text-to-audio
6
- tags:
7
- - music_generation
8
- ---
9
- [//]: # (# InspireMusic)
10
- <p align="center">
11
- <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
12
- <img alt="logo" src="./asset/logo.png" width="100%"></a>
13
- </p>
14
-
15
- [//]: # (<p align="center">)
16
-
17
- [//]: # ( <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
18
-
19
- [//]: # ( <img alt="InspireMusic" src="https://svg-banners.vercel.app/api?type=origin&text1=Inspire%20Music🎶&text2=🤗%20A%20Fundamental%20Music%20Song%20Audio%20Generation%20Toolkit&width=800&height=210"></a>)
20
-
21
- [//]: # (</p>)
22
 
23
  <p align="center">
24
- <a href="https://iris2c.github.io/InspireMusic" target="_blank">
25
- <img alt="Demo" src="https://img.shields.io/badge/Demo%20👈🏻-InspireMusic?labelColor=%20%23FDB062&label=InspireMusic&color=%20%23f79009"></a>
26
- <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
27
- <img alt="Code" src="https://img.shields.io/badge/Code%20⭐-InspireMusic?labelColor=%20%237372EB&label=InspireMusic&color=%20%235462eb"></a>
28
-
29
- <a href="https://modelscope.cn/models/iic/InspireMusic-1.5B-Long" target="_blank">
30
- <img alt="Model" src="https://img.shields.io/badge/InspireMusic-Model-green"></a>
31
-
32
- <a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank">
33
- <img alt="Space" src="https://img.shields.io/badge/Spaces-ModelScope-pink?labelColor=%20%237b8afb&label=Spaces&color=%20%230a5af8"></a>
34
-
35
- <a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank">
36
- <img alt="Space" src="https://img.shields.io/badge/HuggingFace-Spaces?labelColor=%20%239b8afb&label=Spaces&color=%20%237a5af8"></a>
37
-
38
- <a href="https://arxiv.org/abs/" target="_blank">
39
- <img alt="Paper" src="https://img.shields.io/badge/arXiv-Paper-lightgrey"></a>
40
- <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
41
-
42
- [//]: # (<a href="https://huggingface.co/FunAudioLLM/InspireMusic-Base" target="_blank">)
43
-
44
- [//]: # ( <img alt="Model" src="https://img.shields.io/badge/Model-InspireMusic?labelColor=%20%23FDA199&label=InspireMusic&color=orange"></a>)
45
-
46
- [//]: # (<a href="https://arxiv.org/abs/" target="_blank">)
47
-
48
- [//]: # ( <img alt="Paper" src="https://img.shields.io/badge/Paper-arXiv?labelColor=%20%23528bff&label=arXiv&color=%20%23155EEF"></a>)
49
-
50
- [//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
51
-
52
- [//]: # ( <img alt="Githube Star" src="https://img.shields.io/github/stars/FunAudioLLM/InspireMusic"></a>)
53
-
54
- [//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic/blob/main/asset/QR.jpg" target="_blank">)
55
-
56
- [//]: # ( <img src="https://img.shields.io/badge/group%20chat-group?&labelColor=%20%235462eb&color=%20%235462eb" alt="chat on WeChat"></a>)
57
- [//]: # (<a href="https://discord.gg/nSPpRU7fRr" target="_blank">)
58
-
59
- [//]: # ( <img src="https://img.shields.io/badge/discord-chat?&labelColor=%20%235462eb&color=%20%235462eb" alt="chat on Discord"></a>)
60
-
61
- [//]: # ( <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
62
-
63
- [//]: # ( <img alt="Static Badge" src="https://img.shields.io/badge/v0.1-version?logo=free&color=%20%23155EEF&label=version&labelColor=%20%23528bff"></a>)
64
- [//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic/graphs/commit-activity" target="_blank">)
65
-
66
- [//]: # (<img alt="Commits last month" src="https://img.shields.io/github/commit-activity/m/FunAudioLLM/InspireMusic?labelColor=%20%2332b583&color=%20%2312b76a"></a>)
67
-
68
- [//]: # ( <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
69
-
70
- [//]: # ( <img alt="Issues closed" src="https://img.shields.io/github/issues-search?query=repo%3AFunAudioLLM%2FInspireMusic%20is%3Aclosed&label=issues%20closed&labelColor=%20%237d89b0&color=%20%235d6b98"></a>)
71
-
72
- [//]: # ( <a href="https://github.com/FunAudioLLM/InspireMusic/discussions/" target="_blank">)
73
-
74
- [//]: # ( <img alt="Discussion posts" src="https://img.shields.io/github/discussions/FunAudioLLM/InspireMusic?labelColor=%20%239b8afb&color=%20%237a5af8"></a>)
75
  </p>
76
 
77
- InspireMusic is a fundamental AIGC toolkit and models designed for music, song, and audio generation using PyTorch.
78
 
79
- ![GitHub Repo stars](https://img.shields.io/github/stars/FunAudioLLM/InspireMusic) Please support our community project 💖 by starring it on GitHub 加⭐支持 🙏
 
 
 
 
 
 
80
 
81
  ---
82
- <a name="Highligts"></a>
83
  ## Highlights
84
- **InspireMusic** focuses on music generation, song generation and audio generation.
85
- - A unified framework for music/song/audio generation. Controllable with text prompts, music genres, music structures, etc.
86
- - Support music generation tasks with high audio quality, with available sampling rates of 24kHz, 48kHz.
87
- - Support long-form audio generation.
88
- - Convenient fine-tuning and inference. Support mixed precision training (FP16, FP32). Provide convenient fine-tuning and inference scripts and strategies, allowing users to easily fine-tune their music generation models.
89
-
90
- <a name="What's News"></a>
91
- ## What's New 🔥
92
-
93
- - 2025/02: InspireMusic demo is available on [ModelScope Space](https://modelscope.cn/studios/iic/InspireMusic/summary) and [HuggingFace Space](https://huggingface.co/spaces/FunAudioLLM/InspireMusic).
94
- - 2025/01: Open-source [InspireMusic-Base](https://modelscope.cn/models/iic/InspireMusic/summary), [InspireMusic-Base-24kHz](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary), [InspireMusic-1.5B](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary), [InspireMusic-1.5B-24kHz](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary), [InspireMusic-1.5B-Long](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) models for music generation. Models are available on both ModelScope and HuggingFace.
95
- - 2024/12: Support to generate 48kHz audio with super resolution flow matching.
96
- - 2024/11: Welcome to preview 👉🏻 [**InspireMusic Demos**](https://iris2c.github.io/InspireMusic) 👈🏻. We're excited to share this with you and are working hard to bring even more features and models soon. Your support and feedback mean a lot to us!
97
- - 2024/11: We are thrilled to announce the open-sourcing of the **InspireMusic** [code repository](https://github.com/FunAudioLLM/InspireMusic) and [demos](https://iris2c.github.io/InspireMusic). **InspireMusic** is a unified framework for music, song, and audio generation, featuring capabilities such as text-to-music conversion, music structure, genre control, and timestamp management. InspireMusic stands out for its exceptional music generation and instruction-following abilities.
98
 
 
99
  ## Introduction
100
  > [!Note]
101
  > This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.
102
 
103
  > [!Tip]
104
- > To explore the performance, please refer to [InspireMusic Demo Page](https://iris2c.github.io/InspireMusic). We will open-source better & larger models soon.
105
 
106
- InspireMusic is a unified music, song and audio generation framework through the audio tokenization and detokenization process integrated with a large autoregressive transformer. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both inference and training code for AI generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates autoregressive Transformer and conditional flow-matching modeling (CFM), allowing for the controllable generation of music, songs, and audio with both textual and structural music conditioning, as well as neural audio tokenizers. Currently, the toolkit supports text-to-music generation and plans to expand its capabilities to include text-to-song and text-to-audio generation in the future.
107
 
108
  ## InspireMusic
109
- <p align="center">
110
- <table>
111
- <tr>
112
- <td style="text-align:center;">
113
- <img alt="Light" src="asset/InspireMusic.png" width="100%" />
114
- </tr>
115
- <tr>
116
- <td style="text-align:center;">
117
- <b>Figure 1.</b> An overview of the InspireMusic framework.
118
-
119
- We introduce InspireMusic, a unified framework for music, song and audio generation, capable of producing 48kHz long-form audio. InspireMusic employs an autoregressive transformer to generate music tokens in response to textual input. Complementing this, an ODE-based diffusion model, specifically flow matching, is utilized to reconstruct latent features from these generated music tokens. Then a vocoder generates audio waveforms from the reconstructed features. for input text, an ODE-based diffusion model, flow matching, to reconstruct latent features from the generated music tokens, and a vocoder to generate audio waveforms. InspireMusic is capable of text-to-music, music continuation, music reconstruction, and music super resolution tasks. It employs WavTokenizer as an audio tokenizer to convert 24kHz audio into 75Hz discrete tokens, while HifiCodec serves as a music tokenizer, transforming 48kHz audio into 150Hz latent features compatible with the flow matching model.
120
- </td>
121
- </tr>
122
- </table>
123
- </p>
124
 
 
125
  ## Installation
126
-
127
  ### Clone
128
-
129
  - Clone the repo
130
  ``` sh
131
  git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
132
  # If you failed to clone submodule due to network failures, please run the following command until success
133
  cd InspireMusic
134
- git submodule update --init --recursive
 
 
135
  ```
136
 
137
- ### Install
138
- InspireMusic requires Python 3.8, PyTorch 2.0.1. To install InspireMusic, you can run one of the following:
139
 
140
  - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
141
  - Create Conda env:
142
- ``` sh
143
  conda create -n inspiremusic python=3.8
144
  conda activate inspiremusic
145
  cd InspireMusic
146
  # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
147
  conda install -y -c conda-forge pynini==2.1.5
148
  pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
149
- # install flash attention to speedup training, support version 2.6.3
150
  pip install flash-attn --no-build-isolation
151
  ```
152
- Currently support on CUDA Version 11.x.
153
 
154
  - Install within the package:
155
- ```sh
156
  cd InspireMusic
157
  # You can run to install the packages
158
  python setup.py install
159
  pip install flash-attn --no-build-isolation
160
  ```
161
-
162
  We also recommend having `sox` or `ffmpeg` installed, either through your system or Anaconda:
163
- ```sh
164
  # # Install sox
165
  # ubuntu
166
  sudo apt-get install sox libsox-dev
@@ -174,10 +93,30 @@ sudo apt-get install ffmpeg
174
  sudo yum install ffmpeg
175
  ```
176
 
177
- ### Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
 
 
179
  Here is a quick example inference script for music generation.
180
- ``` sh
181
  cd InspireMusic
182
  mkdir -p pretrained_models
183
 
@@ -189,46 +128,43 @@ git clone https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long.git pretrain
189
 
190
  cd examples/music_generation
191
  # run a quick inference example
192
- bash infer_1.5b_long.sh
193
  ```
194
 
195
  Here is a quick start running script to run music generation task including data preparation pipeline, model training, inference.
196
- ``` sh
197
  cd InspireMusic/examples/music_generation/
198
- bash run.sh
199
  ```
200
 
201
  ### One-line Inference
202
  #### Text-to-music Task
203
-
204
  One-line Shell script for text-to-music task.
205
- ``` sh
206
  cd examples/music_generation
207
- # with flow matching
208
- # use one-line command to get a quick try
209
  python -m inspiremusic.cli.inference
210
 
211
  # custom the config like the following one-line command
212
  python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav
213
 
214
- # without flow matching
215
  python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True
216
  ```
217
 
218
  Alternatively, you can run the inference with just a few lines of Python code.
219
  ```python
220
- from inspiremusic.cli.inference import InspireMusicUnified
221
- from inspiremusic.cli.inference import set_env_variables
222
  if __name__ == "__main__":
223
- set_env_variables()
224
- model = InspireMusicUnified(model_name = "InspireMusic-1.5B-Long")
225
  model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")
226
  ```
227
 
228
  #### Music Continuation Task
229
-
230
  One-line Shell script for music continuation task.
231
- ``` sh
232
  cd examples/music_generation
233
  # with flow matching
234
  python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
@@ -238,55 +174,50 @@ python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wa
238
 
239
  Alternatively, you can run the inference with just a few lines of Python code.
240
  ```python
241
- from inspiremusic.cli.inference import InspireMusicUnified
242
- from inspiremusic.cli.inference import set_env_variables
243
  if __name__ == "__main__":
244
- set_env_variables()
245
- model = InspireMusicUnified(model_name = "InspireMusic-1.5B-Long")
246
  # just use audio prompt
247
  model.inference("continuation", None, "audio_prompt.wav")
248
  # use both text prompt and audio prompt
249
  model.inference("continuation", "Continue to generate jazz music.", "audio_prompt.wav")
250
  ```
251
-
252
  ## Models
253
- ### Download Model
254
-
255
- We strongly recommend that you download our pretrained `InspireMusic model`.
256
-
257
- If you are an expert in this field, and you are only interested in training your own InspireMusic model from scratch, you can skip this step.
258
-
259
- ``` sh
260
- # git模型下载,请确保已安装git lfs
261
  mkdir -p pretrained_models
262
- git clone https://www.modelscope.cn/iic/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic
263
  ```
264
 
265
  ### Available Models
266
  Currently, we open source the music generation models support 24KHz mono and 48KHz stereo audio.
267
- The table below presents the links to the ModelScope and Huggingface model hub. More models will be available soon.
268
-
269
- | Model name | Model Links | Remarks |
270
- |---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
271
- | InspireMusic-Base-24kHz | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base-24kHz) | Pre-trained Music Generation Model, 24kHz mono, 30s |
272
- | InspireMusic-Base | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base) | Pre-trained Music Generation Model, 48kHz, 30s |
273
- | InspireMusic-1.5B-24kHz | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz) | Pre-trained Music Generation 1.5B Model, 24kHz mono, 30s |
274
- | InspireMusic-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B) | Pre-trained Music Generation 1.5B Model, 48kHz, 30s |
275
- | InspireMusic-1.5B-Long| [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long) | Pre-trained Music Generation 1.5B Model, 48kHz, support long-form music generation more than 5mins |
276
- | InspireSong-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]() | Pre-trained Song Generation 1.5B Model, 48kHz stereo |
277
- | InspireAudio-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]() | Pre-trained Audio Generation 1.5B Model, 48kHz stereo |
278
- | Wavtokenizer[<sup>[1]</sup>](https://openreview.net/forum?id=yBlVlS2Fd9) (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=wavtokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/wavtokenizer) | An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio. |
279
- | Music_tokenizer (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[2]</sup> at 24kHz audio. |
280
- | Music_tokenizer (150Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec at 48kHz audio. |
281
-
 
282
  ## Basic Usage
283
-
284
- At the moment, InspireMusic contains the training code and inference code for [music generation](https://github.com/FunAudioLLM/InspireMusic/tree/main/examples/music_generation). More tasks such as song generation and audio generation will be supported in future.
285
 
286
  ### Training
287
-
288
- Here is an example to train LLM model, support FP16 training.
289
- ```sh
290
  torchrun --nnodes=1 --nproc_per_node=8 \
291
  --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
292
  inspiremusic/bin/train.py \
@@ -307,7 +238,7 @@ torchrun --nnodes=1 --nproc_per_node=8 \
307
  ```
308
 
309
  Here is an example code to train flow matching model, does not support FP16 training.
310
- ```sh
311
  torchrun --nnodes=1 --nproc_per_node=8 \
312
  --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
313
  inspiremusic/bin/train.py \
@@ -329,14 +260,13 @@ torchrun --nnodes=1 --nproc_per_node=8 \
329
  ### Inference
330
 
331
  Here is an example script to quickly do model inference.
332
- ``` sh
333
  cd InspireMusic/examples/music_generation/
334
- bash infer.sh
335
  ```
336
-
337
  Here is an example code to run inference with normal mode, i.e., with flow matching model for text-to-music and music continuation tasks.
338
- ```sh
339
- pretrained_model_dir = "./pretrained_models/InspireMusic/"
340
  for task in 'text-to-music' 'continuation'; do
341
  python inspiremusic/bin/inference.py --task $task \
342
  --gpu 0 \
@@ -347,15 +277,12 @@ for task in 'text-to-music' 'continuation'; do
347
  --music_tokenizer $pretrained_model_dir/music_tokenizer \
348
  --wavtokenizer $pretrained_model_dir/wavtokenizer \
349
  --result_dir `pwd`/exp/inspiremusic/${task}_test \
350
- --chorus verse \
351
- --min_generate_audio_seconds 8 \
352
- --max_generate_audio_seconds 30
353
  done
354
  ```
355
-
356
  Here is an example code to run inference with fast mode, i.e., without flow matching model for text-to-music and music continuation tasks.
357
- ```sh
358
- pretrained_model_dir = "./pretrained_models/InspireMusic/"
359
  for task in 'text-to-music' 'continuation'; do
360
  python inspiremusic/bin/inference.py --task $task \
361
  --gpu 0 \
@@ -367,11 +294,26 @@ for task in 'text-to-music' 'continuation'; do
367
  --wavtokenizer $pretrained_model_dir/wavtokenizer \
368
  --result_dir `pwd`/exp/inspiremusic/${task}_test \
369
  --chorus verse \
370
- --fast \
371
- --min_generate_audio_seconds 8 \
372
- --max_generate_audio_seconds 30
373
  done
374
  ```
375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  ## Disclaimer
377
- The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
 
1
+ <p align="center"> <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank"> <img alt="logo" src="./asset/logo.png" width="100%"></a></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  <p align="center">
4
+ <a href="https://funaudiollm.github.io/inspiremusic" target="_blank"><img alt="Demo" src="https://img.shields.io/badge/Demo-InspireMusic?labelColor=%20%23FDB062&label=InspireMusic&color=%20%23f79009"></a>
5
+ <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank"><img alt="Code" src="https://img.shields.io/badge/Code-InspireMusic?labelColor=%20%237372EB&label=InspireMusic&color=%20%235462eb"></a>
6
+ <a href="https://modelscope.cn/models/iic/InspireMusic" target="_blank"><img alt="Model" src="https://img.shields.io/badge/InspireMusic-Model-green"></a>
7
+ <a href="https://modelscope.cn/studios/iic/InspireMusic/summary" target="_blank"><img alt="Space" src="https://img.shields.io/badge/Spaces-ModelScope-pink?labelColor=%20%237b8afb&label=Spaces&color=%20%230a5af8"></a>
8
+ <a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank"><img alt="Space" src="https://img.shields.io/badge/HuggingFace-Spaces?labelColor=%20%239b8afb&label=Spaces&color=%20%237a5af8"></a>
9
+ <a href="http://arxiv.org/abs/2503.00084" target="_blank"><img alt="Paper" src="https://img.shields.io/badge/arXiv-Paper-green"></a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  </p>
11
 
12
+ ![GitHub Repo stars](https://img.shields.io/github/stars/FunAudioLLM/InspireMusic) Please support our community by starring it 感谢大家支持
13
 
14
+ [**Highlights**](#highlights)
15
+ | [**Introduction**](#introduction)
16
+ | [**Installation**](#installation)
17
+ | [**Quick Start**](#quick-start)
18
+ | [**Tutorial**](https://github.com/FunAudioLLM/InspireMusic#tutorial)
19
+ | [**Models**](#model-zoo)
20
+ | [**Contact**](#contact)
21
 
22
  ---
23
+ <a name="highlights"></a>
24
  ## Highlights
25
+ **InspireMusic** focuses on music generation, song generation, and audio generation.
26
+ - A unified toolkit designed for music, song, and audio generation.
27
+ - Music generation tasks with high audio quality.
28
+ - Long-form music generation.
 
 
 
 
 
 
 
 
 
 
29
 
30
+ <a name="introduction"></a>
31
  ## Introduction
32
  > [!Note]
33
  > This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.
34
 
35
  > [!Tip]
36
+ > To preview the performance, please refer to [InspireMusic Demo Page](https://funaudiollm.github.io/inspiremusic).
37
 
38
+ InspireMusic is a unified music, song, and audio generation framework through the audio tokenization integrated with autoregressive transformer and flow-matching based model. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both training and inference codes for AI-based generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates audio tokenizers with autoregressive transformer and super-resolution flow-matching modeling, allowing for the controllable generation of music, song, and audio with both text and audio prompts. The toolkit currently supports music generation, will support song generation, audio generation in the future.
39
 
40
  ## InspireMusic
41
+ <p align="center"><table><tr><td style="text-align:center;"><img alt="Light" src="asset/InspireMusic.png" width="100%" /></tr><tr><td style="text-align:center;">
42
+ Figure 1: An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, audio generation capable of producing high-quality long-form audio. InspireMusic consists of the following three key components. <b>Audio Tokenizers</b> convert the raw audio waveform into discrete audio tokens that can be efficiently processed and trained by the autoregressive transformer model. Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer<a href="https://openreview.net/forum?id=yBlVlS2Fd9" target="_blank"><sup>[1]</sup></a>. <b>Autoregressive Transformer</b> model is based on Qwen2.5<a href="https://arxiv.org/abs/2412.15115" target="_blank"><sup>[2]</sup></a> as the backbone model and is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant token sequences. The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens. <b>Super-Resolution Flow-Matching Model</b> based on flow modeling method, maps the generated tokens to latent features with high-resolution fine-grained acoustic details<a href="https://arxiv.org/abs/2305.02765" target="_blank"><sup>[3]</sup></a> obtained from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.
43
+ </td></tr></table></p>
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ <a name="installation"></a>
46
  ## Installation
 
47
  ### Clone
 
48
  - Clone the repo
49
  ``` sh
50
  git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
51
  # If you failed to clone submodule due to network failures, please run the following command until success
52
  cd InspireMusic
53
+ git submodule update --recursive
54
+ # or you can download the third_party repo Matcha-TTS manually
55
+ cd third_party && git clone https://github.com/shivammehta25/Matcha-TTS.git
56
  ```
57
 
58
+ ### Install from Source
59
+ InspireMusic requires Python>=3.8, PyTorch>=2.0.1, flash attention==2.6.2/2.6.3, CUDA>=11.2. You can install the dependencies with the following commands:
60
 
61
  - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
62
  - Create Conda env:
63
+ ``` shell
64
  conda create -n inspiremusic python=3.8
65
  conda activate inspiremusic
66
  cd InspireMusic
67
  # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
68
  conda install -y -c conda-forge pynini==2.1.5
69
  pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
70
+ # install flash attention to speedup training
71
  pip install flash-attn --no-build-isolation
72
  ```
 
73
 
74
  - Install within the package:
75
+ ```shell
76
  cd InspireMusic
77
  # You can run to install the packages
78
  python setup.py install
79
  pip install flash-attn --no-build-isolation
80
  ```
 
81
  We also recommend having `sox` or `ffmpeg` installed, either through your system or Anaconda:
82
+ ```shell
83
  # # Install sox
84
  # ubuntu
85
  sudo apt-get install sox libsox-dev
 
93
  sudo yum install ffmpeg
94
  ```
95
 
96
+ ### Use Docker
97
+ Run the following command to build a docker image from Dockerfile provided.
98
+ ```shell
99
+ docker build -t inspiremusic .
100
+ ```
101
+ Run the following command to start the docker container in interactive mode.
102
+ ```shell
103
+ docker run -ti --gpus all -v .:/workspace/InspireMusic inspiremusic
104
+ ```
105
+
106
+ ### Use Docker Compose
107
+ Run the following command to build a docker compose environment and docker image from the docker-compose.yml file.
108
+ ```shell
109
+ docker compose up -d --build
110
+ ```
111
+ Run the following command to attach to the docker container in interactive mode.
112
+ ```shell
113
+ docker exec -ti inspire-music bash
114
+ ```
115
 
116
+ <a name="quick-start"></a>
117
+ ### Quick Start
118
  Here is a quick example inference script for music generation.
119
+ ``` shell
120
  cd InspireMusic
121
  mkdir -p pretrained_models
122
 
 
128
 
129
  cd examples/music_generation
130
  # run a quick inference example
131
+ sh infer_1.5b_long.sh
132
  ```
133
 
134
  Here is a quick start running script to run music generation task including data preparation pipeline, model training, inference.
135
+ ``` shell
136
  cd InspireMusic/examples/music_generation/
137
+ sh run.sh
138
  ```
139
 
140
  ### One-line Inference
141
  #### Text-to-music Task
 
142
  One-line Shell script for text-to-music task.
143
+ ``` shell
144
  cd examples/music_generation
145
+ # with flow matching, use one-line command to get a quick try
 
146
  python -m inspiremusic.cli.inference
147
 
148
  # custom the config like the following one-line command
149
  python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav
150
 
151
+ # without flow matching, use one-line command to get a quick try
152
  python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True
153
  ```
154
 
155
  Alternatively, you can run the inference with just a few lines of Python code.
156
  ```python
157
+ from inspiremusic.cli.inference import InspireMusic
158
+ from inspiremusic.cli.inference import env_variables
159
  if __name__ == "__main__":
160
+ env_variables()
161
+ model = InspireMusic(model_name = "InspireMusic-Base")
162
  model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")
163
  ```
164
 
165
  #### Music Continuation Task
 
166
  One-line Shell script for music continuation task.
167
+ ```shell
168
  cd examples/music_generation
169
  # with flow matching
170
  python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
 
174
 
175
  Alternatively, you can run the inference with just a few lines of Python code.
176
  ```python
177
+ from inspiremusic.cli.inference import InspireMusic
178
+ from inspiremusic.cli.inference import env_variables
179
  if __name__ == "__main__":
180
+ env_variables()
181
+ model = InspireMusic(model_name = "InspireMusic-Base")
182
  # just use audio prompt
183
  model.inference("continuation", None, "audio_prompt.wav")
184
  # use both text prompt and audio prompt
185
  model.inference("continuation", "Continue to generate jazz music.", "audio_prompt.wav")
186
  ```
187
+ <a name="model-zoo"></a>
188
  ## Models
189
+ ### Download Models
190
+ You may download our pretrained InspireMusic models for music generation.
191
+ ```shell
192
+ # use git to download models,please make sure git lfs is installed.
 
 
 
 
193
  mkdir -p pretrained_models
194
+ git clone https://www.modelscope.cn/iic/InspireMusic.git pretrained_models/InspireMusic
195
  ```
196
 
197
  ### Available Models
198
  Currently, we open source the music generation models support 24KHz mono and 48KHz stereo audio.
199
+ The table below presents the links to the ModelScope and Huggingface model hub.
200
+
201
+ | Model name | Model Links | Remarks |
202
+ |--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
203
+ | InspireMusic-Base-24kHz | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base-24kHz) | Pre-trained Music Generation Model, 24kHz mono, 30s |
204
+ | InspireMusic-Base | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base) | Pre-trained Music Generation Model, 48kHz, 30s |
205
+ | InspireMusic-1.5B-24kHz | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz) | Pre-trained Music Generation 1.5B Model, 24kHz mono, 30s |
206
+ | InspireMusic-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B) | Pre-trained Music Generation 1.5B Model, 48kHz, 30s |
207
+ | InspireMusic-1.5B-Long | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long) | Pre-trained Music Generation 1.5B Model, 48kHz, support long-form music generation up to several minutes |
208
+ | InspireSong-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]() | Pre-trained Song Generation 1.5B Model, 48kHz stereo |
209
+ | InspireAudio-1.5B | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]() | Pre-trained Audio Generation 1.5B Model, 48kHz stereo |
210
+ | Wavtokenizer[<sup>[1]</sup>](https://openreview.net/forum?id=yBlVlS2Fd9) (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=wavtokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/wavtokenizer) | An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio. |
211
+ | Music_tokenizer (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[3]</sup> at 24kHz audio. |
212
+ | Music_tokenizer (150Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[3]</sup> at 48kHz audio. |
213
+
214
+ <a name="tutorial"></a>
215
  ## Basic Usage
216
+ At the moment, InspireMusic contains the training and inference codes for [music generation](https://github.com/FunAudioLLM/InspireMusic/tree/main/examples/music_generation).
 
217
 
218
  ### Training
219
+ Here is an example to train LLM model, support BF16/FP16 training.
220
+ ```shell
 
221
  torchrun --nnodes=1 --nproc_per_node=8 \
222
  --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
223
  inspiremusic/bin/train.py \
 
238
  ```
239
 
240
  Here is an example code to train flow matching model, does not support FP16 training.
241
+ ```shell
242
  torchrun --nnodes=1 --nproc_per_node=8 \
243
  --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
244
  inspiremusic/bin/train.py \
 
260
  ### Inference
261
 
262
  Here is an example script to quickly do model inference.
263
+ ```shell
264
  cd InspireMusic/examples/music_generation/
265
+ sh infer.sh
266
  ```
 
267
  Here is an example code to run inference with normal mode, i.e., with flow matching model for text-to-music and music continuation tasks.
268
+ ```shell
269
+ pretrained_model_dir = "pretrained_models/InspireMusic/"
270
  for task in 'text-to-music' 'continuation'; do
271
  python inspiremusic/bin/inference.py --task $task \
272
  --gpu 0 \
 
277
  --music_tokenizer $pretrained_model_dir/music_tokenizer \
278
  --wavtokenizer $pretrained_model_dir/wavtokenizer \
279
  --result_dir `pwd`/exp/inspiremusic/${task}_test \
280
+ --chorus verse
 
 
281
  done
282
  ```
 
283
  Here is an example code to run inference with fast mode, i.e., without flow matching model for text-to-music and music continuation tasks.
284
+ ```shell
285
+ pretrained_model_dir = "pretrained_models/InspireMusic/"
286
  for task in 'text-to-music' 'continuation'; do
287
  python inspiremusic/bin/inference.py --task $task \
288
  --gpu 0 \
 
294
  --wavtokenizer $pretrained_model_dir/wavtokenizer \
295
  --result_dir `pwd`/exp/inspiremusic/${task}_test \
296
  --chorus verse \
297
+ --fast
 
 
298
  done
299
  ```
300
 
301
+ ### Hardware requirements
302
+ Previous test on H800 GPU, InspireMusic could generate 30 seconds audio with real-time factor (RTF) around 1.6~1.8. For normal mode, we recommend using hardware with at least 24GB of GPU memory for better experience. For fast mode, 12GB GPU memory is enough.
303
+
304
+ ## Citation
305
+ ```bibtex
306
+ @misc{InspireMusic2025,
307
+ title={InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation},
308
+ author={Chong Zhang and Yukun Ma and Qian Chen and Wen Wang and Shengkui Zhao and Zexu Pan and Hao Wang and Chongjia Ni and Trung Hieu Nguyen and Kun Zhou and Yidi Jiang and Chaohong Tan and Zhifu Gao and Zhihao Du and Bin Ma},
309
+ year={2025},
310
+ eprint={2503.00084},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.SD},
313
+ url={https://arxiv.org/abs/2503.00084},
314
+ }
315
+ ```
316
+
317
+ ---
318
  ## Disclaimer
319
+ The content provided above is for research purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.