--- license: apache-2.0 ---

📑 Technical Report｜📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope ## Key Features - 🚀 **First Unified Continuous Speech Tokenizer:** the first continuous audio tokenizer to effectively integrate semantic and acoustic features, suitable for both understanding and generation tasks. - 🎧 **High-Quality Reconstruction:** Achieve high-quality audio generation by modeling continuous features with a VAE, minimizing information loss and preserving intricate acoustic textures. - 🌐 **Convolution-Free Efficiency:** Built on a pure causal transformer architecture, completely eliminating convolutional layers for superior efficiency and a simpler design. ## Installation ``` pip install -r requirements.txt ``` ## Quick start ```python import torch import torchaudio from audio_tokenizer.modeling_audio_vae import AudioVAE model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio') model = model.cuda() model.eval() waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile') sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()} with torch.no_grad(): with torch.autocast(device_type='cuda', dtype=torch.bfloat16): latent, frame_num = model.encode_latent(**sample) output_waveform = model.decode(latent) torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000) ``` ## Performance ### Speech reconstruction performance
Speech reconstruction performance comparison on various audio benchmark datasets. The best results are in bold.

System FrameRate SEED-ZH SEED-EN

PESQ↑ SIM↑ STOI↑ PESQ↑ SIM↑ STOI↑

MiMo-Audio-Tokenizer 25 2.71 0.89 0.93 2.43 0.85 0.92

GLM4-Voice-Tokenizer 12.5 1.06 0.33 0.61 1.05 0.12 0.60

Baichuan-Audio-Tokenizer 12.5 1.84 0.78 0.86 1.62 0.69 0.85

XY-Tokenizer 12.5 2.27 0.77 0.90 2.14 0.82 0.90

Mimi 75 2.05 0.73 0.89 2.01 0.77 0.89

XCodec2.0 50 2.19 0.80 0.92 2.37 0.82 0.93

BigCodec 80 2.26 0.81 0.92 2.22 0.80 0.91

MingTok-Audio(ours) 50 4.21 0.96 0.98 4.04 0.96 0.98

### The adaptation performance for downstream ASR tasks
Understanding ASR performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai

Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64

Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05

Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73

Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65

### The adaptation performance for downstream TTS tasks
Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM

Generation Seed-TTS 1.12 0.80 2.25 0.76

MiMo-Audio 1.96 - 5.37 -

Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -

Ming-Omni-Lite 1.69 0.68 4.31 0.51

Ming-UniAudio-16B-A3B(ours) 0.95 0.70 1.85 0.58

## Acknowledgements 1. We borrowed a lot of code from [X-Codec-2.0](https://github.com/zhenye234/X-Codec-2.0.git) for tokenizer training. 2. We thank the OpenAI team for developing the [Whisper](https://github.com/openai/whisper) model and making its weights publicly available. ## License and Legal Disclaimer This code repository is licensed under the [MIT License](./LICENSE), and the Legal Disclaimer is located in the [LEGAL.md file](./LEGAL.md) under the project's root directory. ## Citation If you find our work helpful, feel free to give us a cite.

System	FrameRate	SEED-ZH			SEED-EN
System	FrameRate	PESQ↑	SIM↑	STOI↑	PESQ↑	SIM↑	STOI↑
MiMo-Audio-Tokenizer	25	2.71	0.89	0.93	2.43	0.85	0.92
GLM4-Voice-Tokenizer	12.5	1.06	0.33	0.61	1.05	0.12	0.60
Baichuan-Audio-Tokenizer	12.5	1.84	0.78	0.86	1.62	0.69	0.85
XY-Tokenizer	12.5	2.27	0.77	0.90	2.14	0.82	0.90
Mimi	75	2.05	0.73	0.89	2.01	0.77	0.89
XCodec2.0	50	2.19	0.80	0.92	2.37	0.82	0.93
BigCodec	80	2.26	0.81	0.92	2.22	0.80	0.91
MingTok-Audio(ours)	50	4.21	0.96	0.98	4.04	0.96	0.98

Datasets	Model	Performance
Datasets	Model	aishell2-ios	LS-clean	Hunan	Minnan	Guangyue	Chuanyu	Shanghai
Understanding ASR	Kimi-Audio	2.56	1.28	31.93	80.28	41.49	6.69	60.64
	Qwen2.5 Omni	2.75	1.80	29.31	53.43	10.39	7.61	32.05
	Qwen2 Audio	2.92	1.60	25.88	123.78	7.59	7.77	31.73
	Ming-UniAudio-16B-A3B(ours)	2.84	1.62	9.80	16.50	5.51	5.46	14.65

Datasets	Model	Performance
		Seed-zh WER(%)	Seed-zh SIM	Seed-en WER(%)	Seed-en SIM
Generation	Seed-TTS	1.12	0.80	2.25	0.76
	MiMo-Audio	1.96	-	5.37	-
	Qwen3-Omni-30B-A3B-Instruct	1.07	-	1.39	-
	Ming-Omni-Lite	1.69	0.68	4.31	0.51
	Ming-UniAudio-16B-A3B(ours)	0.95	0.70	1.85	0.58