FunAudioLLM
/

InspireMusic-Base-24kHz

Safetensors

qwen2

Model card Files Files and versions Community

iris2c commited on Mar 28

Commit

c642c64

verified ·

1 Parent(s): 2d8d2e1

update

Browse files

Files changed (1) hide show

README.md +125 -183

README.md CHANGED Viewed

@@ -1,166 +1,85 @@
----
-license: apache-2.0
-language:
-- en
-pipeline_tag: text-to-audio
-tags:
-- music_generation
----
-[//]: # (# InspireMusic)
-<p align="center">
- <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
-        <img alt="logo" src="./asset/logo.png" width="100%"></a>
-</p>
-[//]: # (<p align="center">)
-[//]: # ( <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
-[//]: # (        <img alt="InspireMusic" src="https://svg-banners.vercel.app/api?type=origin&text1=Inspire%20Music🎶&text2=🤗%20A%20Fundamental%20Music%20Song%20Audio%20Generation%20Toolkit&width=800&height=210"></a>)
-[//]: # (</p>)
 <p align="center">
- <a href="https://iris2c.github.io/InspireMusic" target="_blank">
-        <img alt="Demo" src="https://img.shields.io/badge/Demo%20👈🏻-InspireMusic?labelColor=%20%23FDB062&label=InspireMusic&color=%20%23f79009"></a>
-<a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
-        <img alt="Code" src="https://img.shields.io/badge/Code%20⭐-InspireMusic?labelColor=%20%237372EB&label=InspireMusic&color=%20%235462eb"></a>
-<a href="https://modelscope.cn/models/iic/InspireMusic-1.5B-Long" target="_blank">
-        <img alt="Model" src="https://img.shields.io/badge/InspireMusic-Model-green"></a>
-<a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank">
-        <img alt="Space" src="https://img.shields.io/badge/Spaces-ModelScope-pink?labelColor=%20%237b8afb&label=Spaces&color=%20%230a5af8"></a>
-<a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank">
-        <img alt="Space" src="https://img.shields.io/badge/HuggingFace-Spaces?labelColor=%20%239b8afb&label=Spaces&color=%20%237a5af8"></a>
-<a href="https://arxiv.org/abs/" target="_blank">
-        <img alt="Paper" src="https://img.shields.io/badge/arXiv-Paper-lightgrey"></a>
-<a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">
-[//]: # (<a href="https://huggingface.co/FunAudioLLM/InspireMusic-Base" target="_blank">)
-[//]: # (        <img alt="Model" src="https://img.shields.io/badge/Model-InspireMusic?labelColor=%20%23FDA199&label=InspireMusic&color=orange"></a>)
-[//]: # (<a href="https://arxiv.org/abs/" target="_blank">)
-[//]: # (        <img alt="Paper" src="https://img.shields.io/badge/Paper-arXiv?labelColor=%20%23528bff&label=arXiv&color=%20%23155EEF"></a>)
-[//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
-[//]: # (        <img alt="Githube Star" src="https://img.shields.io/github/stars/FunAudioLLM/InspireMusic"></a>)
-[//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic/blob/main/asset/QR.jpg" target="_blank">)
-[//]: # (        <img src="https://img.shields.io/badge/group%20chat-group?&labelColor=%20%235462eb&color=%20%235462eb" alt="chat on WeChat"></a>)
-[//]: # (<a href="https://discord.gg/nSPpRU7fRr" target="_blank">)
-[//]: # (        <img src="https://img.shields.io/badge/discord-chat?&labelColor=%20%235462eb&color=%20%235462eb" alt="chat on Discord"></a>)
-[//]: # (    <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
-[//]: # (        <img alt="Static Badge" src="https://img.shields.io/badge/v0.1-version?logo=free&color=%20%23155EEF&label=version&labelColor=%20%23528bff"></a>)
-[//]: # (<a href="https://github.com/FunAudioLLM/InspireMusic/graphs/commit-activity" target="_blank">)
-[//]: # (<img alt="Commits last month" src="https://img.shields.io/github/commit-activity/m/FunAudioLLM/InspireMusic?labelColor=%20%2332b583&color=%20%2312b76a"></a>)
-[//]: # (    <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank">)
-[//]: # (        <img alt="Issues closed" src="https://img.shields.io/github/issues-search?query=repo%3AFunAudioLLM%2FInspireMusic%20is%3Aclosed&label=issues%20closed&labelColor=%20%237d89b0&color=%20%235d6b98"></a>)
-[//]: # (    <a href="https://github.com/FunAudioLLM/InspireMusic/discussions/" target="_blank">)
-[//]: # (        <img alt="Discussion posts" src="https://img.shields.io/github/discussions/FunAudioLLM/InspireMusic?labelColor=%20%239b8afb&color=%20%237a5af8"></a>)
 </p>
-InspireMusic is a fundamental AIGC toolkit and models designed for music, song, and audio generation using PyTorch.
-![GitHub Repo stars](https://img.shields.io/github/stars/FunAudioLLM/InspireMusic) Please support our community project 💖 by starring it on GitHub 加⭐支持 🙏
 ---
-<a name="Highligts"></a>
 ## Highlights
-**InspireMusic** focuses on music generation, song generation and audio generation.
-- A unified framework for music/song/audio generation. Controllable with text prompts, music genres, music structures, etc.
-- Support music generation tasks with high audio quality, with available sampling rates of 24kHz, 48kHz.
-- Support long-form audio generation.
-- Convenient fine-tuning and inference. Support mixed precision training (FP16, FP32). Provide convenient fine-tuning and inference scripts and strategies, allowing users to easily fine-tune their music generation models.
-<a name="What's News"></a>
-## What's New 🔥
-- 2025/02: InspireMusic demo is available on [ModelScope Space](https://modelscope.cn/studios/iic/InspireMusic/summary) and [HuggingFace Space](https://huggingface.co/spaces/FunAudioLLM/InspireMusic).
-- 2025/01: Open-source [InspireMusic-Base](https://modelscope.cn/models/iic/InspireMusic/summary), [InspireMusic-Base-24kHz](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary), [InspireMusic-1.5B](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary), [InspireMusic-1.5B-24kHz](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary), [InspireMusic-1.5B-Long](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) models for music generation. Models are available on both ModelScope and HuggingFace.
-- 2024/12: Support to generate 48kHz audio with super resolution flow matching.
-- 2024/11: Welcome to preview 👉🏻 [**InspireMusic Demos**](https://iris2c.github.io/InspireMusic) 👈🏻. We're excited to share this with you and are working hard to bring even more features and models soon. Your support and feedback mean a lot to us!
-- 2024/11: We are thrilled to announce the open-sourcing of the **InspireMusic** [code repository](https://github.com/FunAudioLLM/InspireMusic) and [demos](https://iris2c.github.io/InspireMusic). **InspireMusic** is a unified framework for music, song, and audio generation, featuring capabilities such as text-to-music conversion, music structure, genre control, and timestamp management. InspireMusic stands out for its exceptional music generation and instruction-following abilities.
 ## Introduction
 > [!Note]
 > This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.
 > [!Tip]
-> To explore the performance, please refer to [InspireMusic Demo Page](https://iris2c.github.io/InspireMusic). We will open-source better & larger models soon.
-InspireMusic is a unified music, song and audio generation framework through the audio tokenization and detokenization process integrated with a large autoregressive transformer. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both inference and training code for AI generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates autoregressive Transformer and conditional flow-matching modeling (CFM), allowing for the controllable generation of music, songs, and audio with both textual and structural music conditioning, as well as neural audio tokenizers. Currently, the toolkit supports text-to-music generation and plans to expand its capabilities to include text-to-song and text-to-audio generation in the future.
 ## InspireMusic
-<p align="center">
-  <table>
-    <tr>
-       <td style="text-align:center;">
-      <img alt="Light" src="asset/InspireMusic.png" width="100%" />
-    </tr>
-    <tr>
-      <td style="text-align:center;">
-<b>Figure 1.</b> An overview of the InspireMusic framework.
-We introduce InspireMusic, a unified framework for music, song and audio generation, capable of producing 48kHz long-form audio. InspireMusic employs an autoregressive transformer to generate music tokens in response to textual input. Complementing this, an ODE-based diffusion model, specifically flow matching, is utilized to reconstruct latent features from these generated music tokens. Then a vocoder generates audio waveforms from the reconstructed features. for input text, an ODE-based diffusion model, flow matching, to reconstruct latent features from the generated music tokens, and a vocoder to generate audio waveforms. InspireMusic is capable of text-to-music, music continuation, music reconstruction, and music super resolution tasks. It employs WavTokenizer as an audio tokenizer to convert 24kHz audio into 75Hz discrete tokens, while HifiCodec serves as a music tokenizer, transforming 48kHz audio into 150Hz latent features compatible with the flow matching model.
-      </td>
-    </tr>
-  </table>
-</p>
 ## Installation
 ### Clone
 - Clone the repo
 ``` sh
 git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
 # If you failed to clone submodule due to network failures, please run the following command until success
 cd InspireMusic
-git submodule update --init --recursive
 ```
-### Install
-InspireMusic requires Python 3.8, PyTorch 2.0.1. To install InspireMusic, you can run one of the following:
 - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
 - Create Conda env:
-``` sh
 conda create -n inspiremusic python=3.8
 conda activate inspiremusic
 cd InspireMusic
 # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
 conda install -y -c conda-forge pynini==2.1.5
 pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
-# install flash attention to speedup training, support version 2.6.3
 pip install flash-attn --no-build-isolation
 ```
-Currently support on CUDA Version 11.x.
 - Install within the package:
-```sh
 cd InspireMusic
 # You can run to install the packages
 python setup.py install
 pip install flash-attn --no-build-isolation
 ```
 We also recommend having `sox` or `ffmpeg` installed, either through your system or Anaconda:
-```sh
 # # Install sox
 # ubuntu
 sudo apt-get install sox libsox-dev
@@ -174,10 +93,30 @@ sudo apt-get install ffmpeg
 sudo yum install ffmpeg
 ```
-### Quick Start
 Here is a quick example inference script for music generation.
-``` sh
 cd InspireMusic
 mkdir -p pretrained_models
@@ -189,46 +128,43 @@ git clone https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long.git pretrain
 cd examples/music_generation
 # run a quick inference example
-bash infer_1.5b_long.sh
 ```
 Here is a quick start running script to run music generation task including data preparation pipeline, model training, inference.
-``` sh
 cd InspireMusic/examples/music_generation/
-bash run.sh
 ```
 ### One-line Inference
 #### Text-to-music Task
 One-line Shell script for text-to-music task.
-``` sh
 cd examples/music_generation
-# with flow matching
-# use one-line command to get a quick try
 python -m inspiremusic.cli.inference
 # custom the config like the following one-line command
 python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav
-# without flow matching
 python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True
 ```
 Alternatively, you can run the inference with just a few lines of Python code.
 ```python
-from inspiremusic.cli.inference import InspireMusicUnified
-from inspiremusic.cli.inference import set_env_variables
 if __name__ == "__main__":
-  set_env_variables()
-  model = InspireMusicUnified(model_name = "InspireMusic-1.5B-Long")
   model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")
 ```
 #### Music Continuation Task
 One-line Shell script for music continuation task.
-``` sh
 cd examples/music_generation
 # with flow matching
 python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
@@ -238,55 +174,50 @@ python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wa
 Alternatively, you can run the inference with just a few lines of Python code.
 ```python
-from inspiremusic.cli.inference import InspireMusicUnified
-from inspiremusic.cli.inference import set_env_variables
 if __name__ == "__main__":
-  set_env_variables()
-  model = InspireMusicUnified(model_name = "InspireMusic-1.5B-Long")
   # just use audio prompt
   model.inference("continuation", None, "audio_prompt.wav")
   # use both text prompt and audio prompt
   model.inference("continuation", "Continue to generate jazz music.", "audio_prompt.wav")
 ```
 ## Models
-### Download Model
-We strongly recommend that you download our pretrained `InspireMusic model`.
-If you are an expert in this field, and you are only interested in training your own InspireMusic model from scratch, you can skip this step.
-``` sh
-# git模型下载，请确保已安装git lfs
 mkdir -p pretrained_models
-git clone https://www.modelscope.cn/iic/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic
 ```
 ### Available Models
 Currently, we open source the music generation models support 24KHz mono and 48KHz stereo audio.
-The table below presents the links to the ModelScope and Huggingface model hub. More models will be available soon.
-| Model name                            | Model Links                                                                                                                                                                                                                                                                                                                                   | Remarks                                                                                            |
-|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
-| InspireMusic-Base-24kHz               | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base-24kHz)                                                                        | Pre-trained Music Generation Model, 24kHz mono, 30s                                                |
-| InspireMusic-Base                     | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base)                                                                                         | Pre-trained Music Generation Model, 48kHz, 30s                                                     |
-| InspireMusic-1.5B-24kHz               | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz)                                                                        | Pre-trained Music Generation 1.5B Model, 24kHz mono, 30s                                           |
-| InspireMusic-1.5B                     | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B)                                                                                    | Pre-trained Music Generation 1.5B Model, 48kHz, 30s                                                |
-| InspireMusic-1.5B-Long ⭐              | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long)                                                                          | Pre-trained Music Generation 1.5B Model, 48kHz, support long-form music generation more than 5mins |
-| InspireSong-1.5B                      | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]()                                                                                                                                                                                          | Pre-trained Song Generation 1.5B Model, 48kHz stereo                                               |
-| InspireAudio-1.5B                     | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]()                                                                                                                                                                                          | Pre-trained Audio Generation 1.5B Model, 48kHz stereo                                              |
-| Wavtokenizer[<sup>[1]</sup>](https://openreview.net/forum?id=yBlVlS2Fd9) (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=wavtokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/wavtokenizer)       | An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio.                 |
-| Music_tokenizer (75Hz)                | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[2]</sup> at 24kHz audio.                                 |
-| Music_tokenizer (150Hz)               | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec at 48kHz audio.                                               |
 ## Basic Usage
-At the moment, InspireMusic contains the training code and inference code for [music generation](https://github.com/FunAudioLLM/InspireMusic/tree/main/examples/music_generation). More tasks such as song generation and audio generation will be supported in future.
 ### Training
-Here is an example to train LLM model, support FP16 training.
-```sh
 torchrun --nnodes=1 --nproc_per_node=8 \
     --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
     inspiremusic/bin/train.py \
@@ -307,7 +238,7 @@ torchrun --nnodes=1 --nproc_per_node=8 \
 ```
 Here is an example code to train flow matching model, does not support FP16 training.
-```sh
 torchrun --nnodes=1 --nproc_per_node=8 \
     --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
     inspiremusic/bin/train.py \
@@ -329,14 +260,13 @@ torchrun --nnodes=1 --nproc_per_node=8 \
 ### Inference
 Here is an example script to quickly do model inference.
-``` sh
 cd InspireMusic/examples/music_generation/
-bash infer.sh
 ```
 Here is an example code to run inference with normal mode, i.e., with flow matching model for text-to-music and music continuation tasks.
-```sh
-pretrained_model_dir = "./pretrained_models/InspireMusic/"
 for task in 'text-to-music' 'continuation'; do
   python inspiremusic/bin/inference.py --task $task \
       --gpu 0 \
@@ -347,15 +277,12 @@ for task in 'text-to-music' 'continuation'; do
       --music_tokenizer $pretrained_model_dir/music_tokenizer \
       --wavtokenizer $pretrained_model_dir/wavtokenizer \
       --result_dir `pwd`/exp/inspiremusic/${task}_test \
-      --chorus verse \
-      --min_generate_audio_seconds 8 \
-      --max_generate_audio_seconds 30
 done
 ```
 Here is an example code to run inference with fast mode, i.e., without flow matching model for text-to-music and music continuation tasks.
-```sh
-pretrained_model_dir = "./pretrained_models/InspireMusic/"
 for task in 'text-to-music' 'continuation'; do
   python inspiremusic/bin/inference.py --task $task \
       --gpu 0 \
@@ -367,11 +294,26 @@ for task in 'text-to-music' 'continuation'; do
       --wavtokenizer $pretrained_model_dir/wavtokenizer \
       --result_dir `pwd`/exp/inspiremusic/${task}_test \
       --chorus verse \
-      --fast \
-      --min_generate_audio_seconds 8 \
-      --max_generate_audio_seconds 30
 done
 ```
 ## Disclaimer
-The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.

+<p align="center"> <a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank"> <img alt="logo" src="./asset/logo.png" width="100%"></a></p>
 <p align="center">
+ <a href="https://funaudiollm.github.io/inspiremusic" target="_blank"><img alt="Demo" src="https://img.shields.io/badge/Demo-InspireMusic?labelColor=%20%23FDB062&label=InspireMusic&color=%20%23f79009"></a>
+<a href="https://github.com/FunAudioLLM/InspireMusic" target="_blank"><img alt="Code" src="https://img.shields.io/badge/Code-InspireMusic?labelColor=%20%237372EB&label=InspireMusic&color=%20%235462eb"></a>
+<a href="https://modelscope.cn/models/iic/InspireMusic" target="_blank"><img alt="Model" src="https://img.shields.io/badge/InspireMusic-Model-green"></a>
+<a href="https://modelscope.cn/studios/iic/InspireMusic/summary" target="_blank"><img alt="Space" src="https://img.shields.io/badge/Spaces-ModelScope-pink?labelColor=%20%237b8afb&label=Spaces&color=%20%230a5af8"></a>
+<a href="https://huggingface.co/spaces/FunAudioLLM/InspireMusic" target="_blank"><img alt="Space" src="https://img.shields.io/badge/HuggingFace-Spaces?labelColor=%20%239b8afb&label=Spaces&color=%20%237a5af8"></a>
+<a href="http://arxiv.org/abs/2503.00084" target="_blank"><img alt="Paper" src="https://img.shields.io/badge/arXiv-Paper-green"></a>
 </p>
+![GitHub Repo stars](https://img.shields.io/github/stars/FunAudioLLM/InspireMusic) Please support our community by starring it 感谢大家支持
+[**Highlights**](#highlights)
+| [**Introduction**](#introduction)
+| [**Installation**](#installation)
+| [**Quick Start**](#quick-start)
+| [**Tutorial**](https://github.com/FunAudioLLM/InspireMusic#tutorial)
+| [**Models**](#model-zoo)
+| [**Contact**](#contact)
 ---
+<a name="highlights"></a>
 ## Highlights
+**InspireMusic** focuses on music generation, song generation, and audio generation.
+- A unified toolkit designed for music, song, and audio generation.
+- Music generation tasks with high audio quality.
+- Long-form music generation.
+<a name="introduction"></a>
 ## Introduction
 > [!Note]
 > This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.
 > [!Tip]
+> To preview the performance, please refer to [InspireMusic Demo Page](https://funaudiollm.github.io/inspiremusic).
+InspireMusic is a unified music, song, and audio generation framework through the audio tokenization integrated with autoregressive transformer and flow-matching based model. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both training and inference codes for AI-based generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates audio tokenizers with autoregressive transformer and super-resolution flow-matching modeling, allowing for the controllable generation of music, song, and audio with both text and audio prompts. The toolkit currently supports music generation, will support song generation, audio generation in the future.
 ## InspireMusic
+<p align="center"><table><tr><td style="text-align:center;"><img alt="Light" src="asset/InspireMusic.png" width="100%" /></tr><tr><td style="text-align:center;">
+Figure 1: An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, audio generation capable of producing high-quality long-form audio. InspireMusic consists of the following three key components. <b>Audio Tokenizers</b> convert the raw audio waveform into discrete audio tokens that can be efficiently processed and trained by the autoregressive transformer model. Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer<a href="https://openreview.net/forum?id=yBlVlS2Fd9" target="_blank"><sup>[1]</sup></a>. <b>Autoregressive Transformer</b> model is based on Qwen2.5<a href="https://arxiv.org/abs/2412.15115" target="_blank"><sup>[2]</sup></a> as the backbone model and is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant token sequences. The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens. <b>Super-Resolution Flow-Matching Model</b> based on flow modeling method, maps the generated tokens to latent features with high-resolution fine-grained acoustic details<a href="https://arxiv.org/abs/2305.02765" target="_blank"><sup>[3]</sup></a> obtained from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.
+</td></tr></table></p>
+<a name="installation"></a>
 ## Installation
 ### Clone
 - Clone the repo
 ``` sh
 git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
 # If you failed to clone submodule due to network failures, please run the following command until success
 cd InspireMusic
+git submodule update --recursive
+# or you can download the third_party repo Matcha-TTS manually
+cd third_party && git clone https://github.com/shivammehta25/Matcha-TTS.git
 ```
+### Install from Source
+InspireMusic requires Python>=3.8, PyTorch>=2.0.1, flash attention==2.6.2/2.6.3, CUDA>=11.2. You can install the dependencies with the following commands:
 - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
 - Create Conda env:
+``` shell
 conda create -n inspiremusic python=3.8
 conda activate inspiremusic
 cd InspireMusic
 # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
 conda install -y -c conda-forge pynini==2.1.5
 pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
+# install flash attention to speedup training
 pip install flash-attn --no-build-isolation
 ```
 - Install within the package:
+```shell
 cd InspireMusic
 # You can run to install the packages
 python setup.py install
 pip install flash-attn --no-build-isolation
 ```
 We also recommend having `sox` or `ffmpeg` installed, either through your system or Anaconda:
+```shell
 # # Install sox
 # ubuntu
 sudo apt-get install sox libsox-dev
 sudo yum install ffmpeg
 ```
+### Use Docker
+Run the following command to build a docker image from Dockerfile provided.
+```shell
+docker build -t inspiremusic .
+```
+Run the following command to start the docker container in interactive mode.
+```shell
+docker run -ti --gpus all -v .:/workspace/InspireMusic inspiremusic
+```
+### Use Docker Compose
+Run the following command to build a docker compose environment and docker image from the docker-compose.yml file.
+```shell
+docker compose up -d --build
+```
+Run the following command to attach to the docker container in interactive mode.
+```shell
+docker exec -ti inspire-music bash
+```
+<a name="quick-start"></a>
+### Quick Start
 Here is a quick example inference script for music generation.
+``` shell
 cd InspireMusic
 mkdir -p pretrained_models
 cd examples/music_generation
 # run a quick inference example
+sh infer_1.5b_long.sh
 ```
 Here is a quick start running script to run music generation task including data preparation pipeline, model training, inference.
+``` shell
 cd InspireMusic/examples/music_generation/
+sh run.sh
 ```
 ### One-line Inference
 #### Text-to-music Task
 One-line Shell script for text-to-music task.
+``` shell
 cd examples/music_generation
+# with flow matching, use one-line command to get a quick try
 python -m inspiremusic.cli.inference
 # custom the config like the following one-line command
 python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav
+# without flow matching, use one-line command to get a quick try
 python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True
 ```
 Alternatively, you can run the inference with just a few lines of Python code.
 ```python
+from inspiremusic.cli.inference import InspireMusic
+from inspiremusic.cli.inference import env_variables
 if __name__ == "__main__":
+  env_variables()
+  model = InspireMusic(model_name = "InspireMusic-Base")
   model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")
 ```
 #### Music Continuation Task
 One-line Shell script for music continuation task.
+```shell
 cd examples/music_generation
 # with flow matching
 python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
 Alternatively, you can run the inference with just a few lines of Python code.
 ```python
+from inspiremusic.cli.inference import InspireMusic
+from inspiremusic.cli.inference import env_variables
 if __name__ == "__main__":
+  env_variables()
+  model = InspireMusic(model_name = "InspireMusic-Base")
   # just use audio prompt
   model.inference("continuation", None, "audio_prompt.wav")
   # use both text prompt and audio prompt
   model.inference("continuation", "Continue to generate jazz music.", "audio_prompt.wav")
 ```
+<a name="model-zoo"></a>
 ## Models
+### Download Models
+You may download our pretrained InspireMusic models for music generation.
+```shell
+# use git to download models，please make sure git lfs is installed.
 mkdir -p pretrained_models
+git clone https://www.modelscope.cn/iic/InspireMusic.git pretrained_models/InspireMusic
 ```
 ### Available Models
 Currently, we open source the music generation models support 24KHz mono and 48KHz stereo audio.
+The table below presents the links to the ModelScope and Huggingface model hub.
+| Model name                           | Model Links                                                                                                                                                                                                                                                                                                                                   | Remarks                                                                                                  |
+|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
+| InspireMusic-Base-24kHz              | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-Base-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base-24kHz)                                                                        | Pre-trained Music Generation Model, 24kHz mono, 30s                                                      |
+| InspireMusic-Base                    | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-Base)                                                                                         | Pre-trained Music Generation Model, 48kHz, 30s                                                           |
+| InspireMusic-1.5B-24kHz              | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz)                                                                        | Pre-trained Music Generation 1.5B Model, 24kHz mono, 30s                                                 |
+| InspireMusic-1.5B                    | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B)                                                                                    | Pre-trained Music Generation 1.5B Model, 48kHz, 30s                                                      |
+| InspireMusic-1.5B-Long               | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/summary) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long)                                                                          | Pre-trained Music Generation 1.5B Model, 48kHz, support long-form music generation up to several minutes |
+| InspireSong-1.5B                     | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]()                                                                                                                                                                                          | Pre-trained Song Generation 1.5B Model, 48kHz stereo                                                     |
+| InspireAudio-1.5B                    | [![model](https://img.shields.io/badge/ModelScope-Model-lightgrey.svg)]() [![model](https://img.shields.io/badge/HuggingFace-Model-lightgrey.svg)]()                                                                                                                                                                                          | Pre-trained Audio Generation 1.5B Model, 48kHz stereo                                                    |
+| Wavtokenizer[<sup>[1]</sup>](https://openreview.net/forum?id=yBlVlS2Fd9) (75Hz) | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=wavtokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/wavtokenizer)       | An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio.                       |
+| Music_tokenizer (75Hz)               | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-24kHz/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-24kHz/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[3]</sup> at 24kHz audio.                                       |
+| Music_tokenizer (150Hz)              | [![model](https://img.shields.io/badge/ModelScope-Model-green.svg)](https://modelscope.cn/models/iic/InspireMusic-1.5B-Long/file/view/master?fileName=music_tokenizer%252Fmodel.pt) [![model](https://img.shields.io/badge/HuggingFace-Model-green.svg)](https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long/tree/main/music_tokenizer) | A music tokenizer based on HifiCodec<sup>[3]</sup> at 48kHz audio.                                       |
+<a name="tutorial"></a>
 ## Basic Usage
+At the moment, InspireMusic contains the training and inference codes for [music generation](https://github.com/FunAudioLLM/InspireMusic/tree/main/examples/music_generation).
 ### Training
+Here is an example to train LLM model, support BF16/FP16 training.
+```shell
 torchrun --nnodes=1 --nproc_per_node=8 \
     --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
     inspiremusic/bin/train.py \
 ```
 Here is an example code to train flow matching model, does not support FP16 training.
+```shell
 torchrun --nnodes=1 --nproc_per_node=8 \
     --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
     inspiremusic/bin/train.py \
 ### Inference
 Here is an example script to quickly do model inference.
+```shell
 cd InspireMusic/examples/music_generation/
+sh infer.sh
 ```
 Here is an example code to run inference with normal mode, i.e., with flow matching model for text-to-music and music continuation tasks.
+```shell
+pretrained_model_dir = "pretrained_models/InspireMusic/"
 for task in 'text-to-music' 'continuation'; do
   python inspiremusic/bin/inference.py --task $task \
       --gpu 0 \
       --music_tokenizer $pretrained_model_dir/music_tokenizer \
       --wavtokenizer $pretrained_model_dir/wavtokenizer \
       --result_dir `pwd`/exp/inspiremusic/${task}_test \
+      --chorus verse
 done
 ```
 Here is an example code to run inference with fast mode, i.e., without flow matching model for text-to-music and music continuation tasks.
+```shell
+pretrained_model_dir = "pretrained_models/InspireMusic/"
 for task in 'text-to-music' 'continuation'; do
   python inspiremusic/bin/inference.py --task $task \
       --gpu 0 \
       --wavtokenizer $pretrained_model_dir/wavtokenizer \
       --result_dir `pwd`/exp/inspiremusic/${task}_test \
       --chorus verse \
+      --fast
 done
 ```
+### Hardware requirements
+Previous test on H800 GPU, InspireMusic could generate 30 seconds audio with real-time factor (RTF) around 1.6~1.8. For normal mode, we recommend using hardware with at least 24GB of GPU memory for better experience. For fast mode, 12GB GPU memory is enough.
+## Citation
+```bibtex
+@misc{InspireMusic2025,
+      title={InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation},
+      author={Chong Zhang and Yukun Ma and Qian Chen and Wen Wang and Shengkui Zhao and Zexu Pan and Hao Wang and Chongjia Ni and Trung Hieu Nguyen and Kun Zhou and Yidi Jiang and Chaohong Tan and Zhifu Gao and Zhihao Du and Bin Ma},
+      year={2025},
+      eprint={2503.00084},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD},
+      url={https://arxiv.org/abs/2503.00084},
+}
+```
+---
 ## Disclaimer
+The content provided above is for research purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.