---
license: apache-2.0
---
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope
## Key Features
- 🚀 **First Unified Continuous Speech Tokenizer:** the first continuous audio tokenizer to effectively integrate semantic and acoustic features, suitable for both understanding and generation tasks.
- 🎧 **High-Quality Reconstruction:** Achieve high-quality audio generation by modeling continuous features with a VAE, minimizing information loss and preserving intricate acoustic textures.
- 🌐 **Convolution-Free Efficiency:** Built on a pure causal transformer architecture, completely eliminating convolutional layers for superior efficiency and a simpler design.
## Installation
```
pip install -r requirements.txt
```
## Quick start
```python
import torch
import torchaudio
from audio_tokenizer.modeling_audio_vae import AudioVAE
model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = model.cuda()
model.eval()
waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile')
sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()}
with torch.no_grad():
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
latent, frame_num = model.encode_latent(**sample)
output_waveform = model.decode(latent)
torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)
```
## Performance
### Speech reconstruction performance
Speech reconstruction performance comparison on various audio benchmark datasets. The best results are in bold.
System |
FrameRate |
SEED-ZH |
SEED-EN |
PESQ↑ |
SIM↑ |
STOI↑ |
PESQ↑ |
SIM↑ |
STOI↑ |
MiMo-Audio-Tokenizer |
25 |
2.71 |
0.89 |
0.93 |
2.43 |
0.85 |
0.92 |
GLM4-Voice-Tokenizer |
12.5 |
1.06 |
0.33 |
0.61 |
1.05 |
0.12 |
0.60 |
Baichuan-Audio-Tokenizer |
12.5 |
1.84 |
0.78 |
0.86 |
1.62 |
0.69 |
0.85 |
XY-Tokenizer |
12.5 |
2.27 |
0.77 |
0.90 |
2.14 |
0.82 |
0.90 |
Mimi |
75 |
2.05 |
0.73 |
0.89 |
2.01 |
0.77 |
0.89 |
XCodec2.0 |
50 |
2.19 |
0.80 |
0.92 |
2.37 |
0.82 |
0.93 |
BigCodec |
80 |
2.26 |
0.81 |
0.92 |
2.22 |
0.80 |
0.91 |
MingTok-Audio(ours) |
50 |
4.21 |
0.96 |
0.98 |
4.04 |
0.96 |
0.98 |
### The adaptation performance for downstream ASR tasks
Understanding ASR performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets |
Model |
Performance |
aishell2-ios |
LS-clean |
Hunan |
Minnan |
Guangyue |
Chuanyu |
Shanghai |
Understanding ASR |
Kimi-Audio |
2.56 |
1.28 |
31.93 |
80.28 |
41.49 |
6.69 |
60.64 |
Qwen2.5 Omni |
2.75 |
1.80 |
29.31 |
53.43 |
10.39 |
7.61 |
32.05 |
Qwen2 Audio |
2.92 |
1.60 |
25.88 |
123.78 |
7.59 |
7.77 |
31.73 |
Ming-UniAudio-16B-A3B(ours) |
2.84 |
1.62 |
9.80 |
16.50 |
5.51 |
5.46 |
14.65 |
### The adaptation performance for downstream TTS tasks
Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets |
Model |
Performance |
|
|
Seed-zh WER(%) |
Seed-zh SIM |
Seed-en WER(%) |
Seed-en SIM |
Generation |
Seed-TTS |
1.12 |
0.80 |
2.25 |
0.76 |
MiMo-Audio |
1.96 |
- |
5.37 |
- |
Qwen3-Omni-30B-A3B-Instruct |
1.07 |
- |
1.39 |
- |
Ming-Omni-Lite |
1.69 |
0.68 |
4.31 |
0.51 |
Ming-UniAudio-16B-A3B(ours) |
0.95 |
0.70 |
1.85 |
0.58 |
## Acknowledgements
1. We borrowed a lot of code from [X-Codec-2.0](https://github.com/zhenye234/X-Codec-2.0.git) for tokenizer training.
2. We thank the OpenAI team for developing the [Whisper](https://github.com/openai/whisper) model and making its weights publicly available.
## License and Legal Disclaimer
This code repository is licensed under the [MIT License](./LICENSE), and the Legal Disclaimer is located in the [LEGAL.md file](./LEGAL.md) under the project's root directory.
## Citation
If you find our work helpful, feel free to give us a cite.