Sayoyo commited on
Commit
365c1d1
ยท
1 Parent(s): 71922e7

[feat] add README

Browse files
README.md CHANGED
@@ -1,12 +1,230 @@
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: ACE Step
3
- emoji: ๐Ÿ˜ป
4
- colorFrom: blue
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.27.0
8
- app_file: app.py
9
- pinned: false
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">โœจ ACE-Step โœจ</h1>
2
+ <h1 align="center">๐ŸŽต A Step Towards Music Generation Foundation Model ๐ŸŽต</h1>
3
+ <p align="center">
4
+ <a href="https://ace-step.github.io/">Project</a> |
5
+ <a href="https://github.com/ace-step/ACE-Step">Code</a> |
6
+ <a href="https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B">Checkpoints</a> |
7
+ <a href="https://huggingface.co/spaces/ACE-Step/ACE-Step">Space Demo</a>
8
+ </p>
9
+
10
  ---
11
+ <p align="center">
12
+ <img src="./fig/orgnization_logos.png" width="100%" alt="Org Logo">
13
+ </p>
14
+
15
+ ## Table of Contents
16
+
17
+ - [Features](#-features)
18
+ - [Installation](#-installation)
19
+ - [Usage](#-user-interface-guide)
20
+
21
+ ## ๐Ÿ“ข News and Updates
22
+
23
+ - ๐Ÿš€ 2025.05.06: Open source demo code and model
24
+
25
+ ## TODOs๐Ÿ“‹
26
+ - [ ] ๐Ÿ” Release training code
27
+ - [ ] ๐Ÿ”„ Release LoRA training code & ๐ŸŽค RapMachine lora
28
+ - [ ] ๐ŸŽฎ Release ControlNet training code & ๐ŸŽค Singing2Accompaniment controlnet
29
+
30
+ ## ๐Ÿ—๏ธ Architecture
31
+
32
+ <p align="center">
33
+ <img src="./fig/ACE-Step_framework.png" width="100%" alt="ACE-Step Framework">
34
+ </p>
35
+
36
+
37
+ ## ๐Ÿ“ Abstract
38
+
39
+ We introduce ACE-Step, a novel open-source foundation model for music generation that overcomes key limitations of existing approaches and achieves state-of-the-art performance through a holistic architectural design. Current methods face inherent trade-offs between generation speed, musical coherence, and controllability. For instance, LLM-based models (e.g., Yue, SongGen) excel at lyric alignment but suffer from slow inference and structural artifacts. Diffusion models (e.g., DiffRhythm), on the other hand, enable faster synthesis but often lack long-range structural coherence.
40
+
41
+ ACE-Step bridges this gap by integrating diffusion-based generation with Sanaโ€™s Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer. It further leverages MERT and m-hubert to align semantic representations (REPA) during training, enabling rapid convergence. As a result, our model synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPUโ€”15ร— faster than LLM-based baselinesโ€”while achieving superior musical coherence and lyric alignment across melody, harmony, and rhythm metrics. Moreover, ACE-Step preserves fine-grained acoustic details, enabling advanced control mechanisms such as voice cloning, lyric editing, remixing, and track generation (e.g., lyric2vocal, singing2accompaniment).
42
+
43
+ Rather than building yet another end-to-end text-to-music pipeline, our vision is to establish a foundation model for music AI: a fast, general-purpose, efficient yet flexible architecture that makes it easy to train sub-tasks on top of it. This paves the way for developing powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. In short, we aim to build the Stable Diffusion moment for music.
44
+
45
+ ## โœจ Features
46
+
47
+ <p align="center">
48
+ <img src="./fig/application_map.png" width="100%" alt="ACE-Step Framework">
49
+ </p>
50
+
51
+ ### ๐ŸŽฏ Baseline Quality
52
+
53
+ #### ๐ŸŒˆ Diverse Styles & Genres
54
+ - ๐ŸŽธ Supports all mainstream music styles with various description formats including short tags, descriptive text, or use-case scenarios
55
+ - ๐ŸŽท Capable of generating music across different genres with appropriate instrumentation and style
56
+
57
+ #### ๐ŸŒ Multiple Languages
58
+ - ๐Ÿ—ฃ๏ธ Supports 19 languages with top 10 well-performing languages including:
59
+ - ๐Ÿ‡บ๐Ÿ‡ธ English, ๐Ÿ‡จ๐Ÿ‡ณ Chinese, ๐Ÿ‡ท๐Ÿ‡บ Russian, ๐Ÿ‡ช๐Ÿ‡ธ Spanish, ๐Ÿ‡ฏ๐Ÿ‡ต Japanese, ๐Ÿ‡ฉ๐Ÿ‡ช German, ๐Ÿ‡ซ๐Ÿ‡ท French, ๐Ÿ‡ต๐Ÿ‡น Portuguese, ๐Ÿ‡ฎ๐Ÿ‡น Italian, ๐Ÿ‡ฐ๐Ÿ‡ท Korean
60
+ - โš ๏ธ Due to data imbalance, less common languages may underperform
61
+
62
+ #### ๐ŸŽป Instrumental Styles
63
+ - ๐ŸŽน Supports various instrumental music generation across different genres and styles
64
+ - ๐ŸŽบ Capable of producing realistic instrumental tracks with appropriate timbre and expression for each instrument
65
+ - ๐ŸŽผ Can generate complex arrangements with multiple instruments while maintaining musical coherence
66
+
67
+ #### ๐ŸŽค Vocal Techniques
68
+ - ๐ŸŽ™๏ธ Capable of rendering various vocal styles and techniques with good quality
69
+ - ๐Ÿ—ฃ๏ธ Supports different vocal expressions including various singing techniques and styles
70
+
71
+ ### ๐ŸŽ›๏ธ Controllability
72
+
73
+ #### ๐Ÿ”„ Variations Generation
74
+ - โš™๏ธ Implemented using training-free, inference-time optimization techniques
75
+ - ๐ŸŒŠ Flow-matching model generates initial noise, then uses trigFlow's noise formula to add additional Gaussian noise
76
+ - ๐ŸŽš๏ธ Adjustable mixing ratio between original initial noise and new Gaussian noise to control variation degree
77
+
78
+ #### ๐ŸŽจ Repainting
79
+ - ๐Ÿ–Œ๏ธ Implemented by adding noise to the target audio input and applying mask constraints during the ODE process
80
+ - ๐Ÿ” When input conditions change from the original generation, only specific aspects can be modified while preserving the rest
81
+ - ๐Ÿ”€ Can be combined with Variations Generation techniques to create localized variations in style, lyrics, or vocals
82
+
83
+ #### โœ๏ธ Lyric Editing
84
+ - ๐Ÿ’ก Innovatively applies flow-edit technology to enable localized lyric modifications while preserving melody, vocals, and accompaniment
85
+ - ๐Ÿ”„ Works with both generated content and uploaded audio, greatly enhancing creative possibilities
86
+ - โ„น๏ธ Current limitation: can only modify small segments of lyrics at once to avoid distortion, but multiple edits can be applied sequentially
87
+
88
+ ### ๐Ÿš€ Applications
89
+
90
+ #### ๐ŸŽค Lyric2Vocal (LoRA)
91
+ - ๐Ÿ”Š Based on a LoRA fine-tuned on pure vocal data, allowing direct generation of vocal samples from lyrics
92
+ - ๐Ÿ› ๏ธ Offers numerous practical applications such as vocal demos, guide tracks, songwriting assistance, and vocal arrangement experimentation
93
+ - โฑ๏ธ Provides a quick way to test how lyrics might sound when sung, helping songwriters iterate faster
94
+
95
+ #### ๐Ÿ“ Text2Samples (LoRA)
96
+ - ๐ŸŽ›๏ธ Similar to Lyric2Vocal, but fine-tuned on pure instrumental and sample data
97
+ - ๐ŸŽต Capable of generating conceptual music production samples from text descriptions
98
+ - ๐Ÿงฐ Useful for quickly creating instrument loops, sound effects, and musical elements for production
99
+
100
+ ### ๐Ÿ”ฎ Coming Soon
101
+
102
+ #### ๐ŸŽค RapMachine
103
+ - ๐Ÿ”ฅ Fine-tuned on pure rap data to create an AI system specialized in rap generation
104
+ - ๐Ÿ† Expected capabilities include AI rap battles and narrative expression through rap
105
+ - ๐Ÿ“š Rap has exceptional storytelling and expressive capabilities, offering extraordinary application potential
106
+
107
+ #### ๐ŸŽ›๏ธ StemGen
108
+ - ๐ŸŽš๏ธ A controlnet-lora trained on multi-track data to generate individual instrument stems
109
+ - ๐ŸŽฏ Takes a reference track and specified instrument (or instrument reference audio) as input
110
+ - ๐ŸŽน Outputs an instrument stem that complements the reference track, such as creating a piano accompaniment for a flute melody or adding jazz drums to a lead guitar
111
+
112
+ #### ๐ŸŽค Singing2Accompaniment
113
+ - ๐Ÿ”„ The reverse process of StemGen, generating a mixed master track from a single vocal track
114
+ - ๐ŸŽต Takes a vocal track and specified style as input to produce a complete vocal accompaniment
115
+ - ๐ŸŽธ Creates full instrumental backing that complements the input vocals, making it easy to add professional-sounding accompaniment to any vocal recording
116
+
117
+ ## ๐Ÿ’ป Installation
118
+
119
+ ```bash
120
+ conda create -n ace_step python==3.10
121
+ conda activate ace_step
122
+ pip install -r requirements.txt
123
+ conda install ffmpeg
124
+ ```
125
+
126
+ ## ๐Ÿ–ฅ๏ธ Hardware Performance
127
+
128
+ We've tested ACE-Step on various hardware configurations with the following throughput results:
129
+
130
+ | Device | 27 Steps | 60 Steps |
131
+ |--------|-------------------------|-------------------------|
132
+ | NVIDIA A100 | 0.036675| 0.0815 |
133
+ | MacBook M2 Max | | 0.44 | 0.97 |
134
+ | NVIDIA RTX 4090 | 0.029 | 0.064 |
135
+
136
+ seconds cost per generated audio (seconds/audio)
137
+ For example, to generate a 180-second song, multiply 180 by the seconds cost per generated audio (seconds/audio) for the desired device and step count. This will give you the total time required for the generation process.
138
+
139
+ ## ๐Ÿš€ Usage
140
+
141
+ ![Demo Interface](fig/demo_interface.png)
142
+
143
+ ### ๐Ÿ” Basic Usage
144
+
145
+ ```bash
146
+ python app.py
147
+ ```
148
+
149
+ ### โš™๏ธ Advanced Usage
150
+
151
+ ```bash
152
+ python app.py --checkpoint_path /path/to/checkpoint --port 7865 --device_id 0 --share --bf16
153
+ ```
154
+
155
+ #### ๐Ÿ› ๏ธ Command Line Arguments
156
+
157
+ - `--checkpoint_path`: Path to the model checkpoint (default: downloads automatically)
158
+ - `--port`: Port to run the Gradio server on (default: 7865)
159
+ - `--device_id`: GPU device ID to use (default: 0)
160
+ - `--share`: Enable Gradio sharing link (default: False)
161
+ - `--bf16`: Use bfloat16 precision for faster inference (default: True)
162
+
163
+ ## ๐Ÿ“ฑ User Interface Guide
164
+
165
+ The ACE-Step interface provides several tabs for different music generation and editing tasks:
166
+
167
+ ### ๐Ÿ“ Text2Music Tab
168
+
169
+ 1. **๐Ÿ“‹ Input Fields**:
170
+ - **๐Ÿท๏ธ Tags**: Enter descriptive tags, genres, or scene descriptions separated by commas
171
+ - **๐Ÿ“œ Lyrics**: Enter lyrics with structure tags like [verse], [chorus], and [bridge]
172
+ - **โฑ๏ธ Audio Duration**: Set the desired duration of the generated audio (-1 for random)
173
+
174
+ 2. **โš™๏ธ Settings**:
175
+ - **๐Ÿ”ง Basic Settings**: Adjust inference steps, guidance scale, and seeds
176
+ - **๐Ÿ”ฌ Advanced Settings**: Fine-tune scheduler type, CFG type, ERG settings, and more
177
+
178
+ 3. **๐Ÿš€ Generation**: Click "Generate" to create music based on your inputs
179
+
180
+ ### ๐Ÿ”„ Retake Tab
181
+
182
+ - ๐ŸŽฒ Regenerate music with slight variations using different seeds
183
+ - ๐ŸŽš๏ธ Adjust variance to control how much the retake differs from the original
184
+
185
+ ### ๐ŸŽจ Repainting Tab
186
+
187
+ - ๐Ÿ–Œ๏ธ Selectively regenerate specific sections of the music
188
+ - โฑ๏ธ Specify start and end times for the section to repaint
189
+ - ๐Ÿ” Choose the source audio (text2music output, last repaint, or upload)
190
+
191
+ ### โœ๏ธ Edit Tab
192
+
193
+ - ๐Ÿ”„ Modify existing music by changing tags or lyrics
194
+ - ๐ŸŽ›๏ธ Choose between "only_lyrics" mode (preserves melody) or "remix" mode (changes melody)
195
+ - ๐ŸŽš๏ธ Adjust edit parameters to control how much of the original is preserved
196
+
197
+ ### ๐Ÿ“ Extend Tab
198
+
199
+ - โž• Add music to the beginning or end of an existing piece
200
+ - ๐Ÿ“ Specify left and right extension lengths
201
+ - ๐Ÿ” Choose the source audio to extend
202
+
203
+ ## Examples
204
+
205
+ The `examples/input_params` directory contains sample input parameters that can be used as references for generating music.
206
+
207
+ ## ๐Ÿ“œ License&Disclaimer
208
+
209
+ This project is licensed under [Apache License 2.0](./LICENSE)
210
+
211
+ ACE-Step enables original music generation across diverse genres, with applications in creative production, education, and entertainment. While designed to support positive and artistic use cases, we acknowledge potential risks such as unintentional copyright infringement due to stylistic similarity, inappropriate blending of cultural elements, and misuse for generating harmful content. To ensure responsible use, we encourage users to verify the originality of generated works, clearly disclose AI involvement, and obtain appropriate permissions when adapting protected styles or materials. By using ACE-Step, you agree to uphold these principles and respect artistic integrity, cultural diversity, and legal compliance. The authors are not responsible for any misuse of the model, including but not limited to copyright violations, cultural insensitivity, or the generation of harmful content.
212
+
213
+ ## ๐Ÿ™ Acknowledgements
214
+
215
+ This project is co-led by ACE Studio and StepFun.
216
+
217
+
218
+ ## ๐Ÿ“– Citation
219
+
220
+ If you find this project useful for your research, please consider citing:
221
 
222
+ ```BibTeX
223
+ @misc{gong2025acestep,
224
+ title={ACE-Step: A Step Towards Music Generation Foundation Model},
225
+ author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
226
+ howpublished={\url{https://github.com/ace-step/ACE-Step}},
227
+ year={2025},
228
+ note={GitHub repository}
229
+ }
230
+ ```
fig/ACE-Step_framework.png ADDED

Git LFS Details

  • SHA256: 013314e06bc28c3473ecc58a884a7e8b9bdb28133ad6bce0f8329dce887b6316
  • Pointer size: 132 Bytes
  • Size of remote file: 1.14 MB
fig/acestudio_logo.png ADDED

Git LFS Details

  • SHA256: e919aa57a414bf27d054fa10df264e583731642934d19c8c3167fd39e6011e3a
  • Pointer size: 130 Bytes
  • Size of remote file: 22.2 kB
fig/application_map.png ADDED

Git LFS Details

  • SHA256: 6a9ed2a3fa80d98e89df273169489dd038bcffd6f5cec6a7741c841b38dd38a4
  • Pointer size: 131 Bytes
  • Size of remote file: 259 kB
fig/demo_interface.png ADDED

Git LFS Details

  • SHA256: b8d45fbeb276262ca758d065a605fdb9d53df28eb568617b1a4af4bca27401df
  • Pointer size: 131 Bytes
  • Size of remote file: 636 kB
fig/orgnization_logos.png ADDED

Git LFS Details

  • SHA256: eb4b1cfc6a3b4f1a227bacd0e8776e0f3bb3dbe0485f0e00f586e8d4db4f9c3d
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
fig/stepfun_logo.png ADDED

Git LFS Details

  • SHA256: 9f3568c90629378c7c46451d57f3a983a5cd86670fc6f37890aab131e916c2ff
  • Pointer size: 129 Bytes
  • Size of remote file: 9.97 kB