File size: 5,472 Bytes
e40a223 1d2a7d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: apache-2.0
tags:
- music
- text2music
pipeline_tag: text-to-audio
language:
- en
- zh
- de
- fr
- es
- it
- pt
- pl
- tr
- ru
- cs
- nl
- ar
- ja
- hu
- ko
- hi
library_name: diffusers
---
# 🎤 Chinese Rap LoRA for ACE-Step (Rap Machine)
This is a hybrid rap voice model. We meticulously curated Chinese rap/hip-hop datasets for training, with rigorous data cleaning and recaptioning. The results demonstrate:
- Improved Chinese pronunciation accuracy
- Enhanced stylistic adherence to hip-hop and electronic genres
- Greater diversity in hip-hop vocal expressions
Audio Examples see: https://ace-step.github.io/#RapMachine
## Usage Guide
1. Generate higher-quality Chinese songs
2. Create superior hip-hop tracks
3. Blend with other genres to:
- Produce music with better vocal quality and detail
- Add experimental flavors (e.g., underground, street culture)
4. Fine-tune using these parameters:
**Vocal Controls**
**`vocal_timbre`**
- Examples: Bright, dark, warm, cold, breathy, nasal, gritty, smooth, husky, metallic, whispery, resonant, airy, smoky, sultry, light, clear, high-pitched, raspy, powerful, ethereal, flute-like, hollow, velvety, shrill, hoarse, mellow, thin, thick, reedy, silvery, twangy.
- Describes inherent vocal qualities.
**`techniques`** (List)
- Rap styles: `mumble rap`, `chopper rap`, `melodic rap`, `lyrical rap`, `trap flow`, `double-time rap`
- Vocal FX: `auto-tune`, `reverb`, `delay`, `distortion`
- Delivery: `whispered`, `shouted`, `spoken word`, `narration`, `singing`
- Other: `ad-libs`, `call-and-response`, `harmonized`
## Community Note
While a Chinese rap LoRA might seem niche for non-Chinese communities, we consistently demonstrate through such projects that ACE-step - as a music generation foundation model - holds boundless potential. It doesn't just improve pronunciation in one language, but spawns new styles.
The universal human appreciation of music is a precious asset. Like abstract LEGO blocks, these elements will eventually combine in more organic ways. May our open-source contributions propel the evolution of musical history forward.
---
# ACE-Step: A Step Towards Music Generation Foundation Model

## Model Description
ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design. It integrates diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer, achieving state-of-the-art performance in generation speed, musical coherence, and controllability.
**Key Features:**
- 15× faster than LLM-based baselines (20s for 4-minute music on A100)
- Superior musical coherence across melody, harmony, and rhythm
- full-song generation, duration control and accepts natural language descriptions
## Uses
### Direct Use
ACE-Step can be used for:
- Generating original music from text descriptions
- Music remixing and style transfer
- edit song lyrics
### Downstream Use
The model serves as a foundation for:
- Voice cloning applications
- Specialized music generation (rap, jazz, etc.)
- Music production tools
- Creative AI assistants
### Out-of-Scope Use
The model should not be used for:
- Generating copyrighted content without permission
- Creating harmful or offensive content
- Misrepresenting AI-generated music as human-created
## How to Get Started
see: https://github.com/ace-step/ACE-Step
## Hardware Performance
| Device | 27 Steps | 60 Steps |
|---------------|----------|----------|
| NVIDIA A100 | 27.27x | 12.27x |
| RTX 4090 | 34.48x | 15.63x |
| RTX 3090 | 12.76x | 6.48x |
| M2 Max | 2.27x | 1.03x |
*RTF (Real-Time Factor) shown - higher values indicate faster generation*
## Limitations
- Performance varies by language (top 10 languages perform best)
- Longer generations (>5 minutes) may lose structural coherence
- Rare instruments may not render perfectly
- Output Inconsistency: Highly sensitive to random seeds and input duration, leading to varied "gacha-style" results.
- Style-specific Weaknesses: Underperforms on certain genres (e.g. Chinese rap/zh_rap) Limited style adherence and musicality ceiling
- Continuity Artifacts: Unnatural transitions in repainting/extend operations
- Vocal Quality: Coarse vocal synthesis lacking nuance
- Control Granularity: Needs finer-grained musical parameter control
## Ethical Considerations
Users should:
- Verify originality of generated works
- Disclose AI involvement
- Respect cultural elements and copyrights
- Avoid harmful content generation
## Model Details
**Developed by:** ACE Studio and StepFun
**Model type:** Diffusion-based music generation with transformer conditioning
**License:** Apache 2.0
**Resources:**
- [Project Page](https://ace-step.github.io/)
- [Demo Space](https://huggingface.co/spaces/ACE-Step/ACE-Step)
- [GitHub Repository](https://github.com/ACE-Step/ACE-Step)
## Citation
```bibtex
@misc{gong2025acestep,
title={ACE-Step: A Step Towards Music Generation Foundation Model},
author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step}},
year={2025},
note={GitHub repository}
}
```
## Acknowledgements
This project is co-led by ACE Studio and StepFun. |