Ming
Collection
7 items
•
Updated
•
8
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope
pip install -r requirements.txt
import torch
import torchaudio
from audio_tokenizer.modeling_audio_vae import AudioVAE
model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = model.cuda()
model.eval()
waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile')
sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()}
with torch.no_grad():
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
latent, frame_num = model.encode_latent(**sample)
output_waveform = model.decode(latent)
torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)
System | FrameRate | SEED-ZH | SEED-EN | ||||
---|---|---|---|---|---|---|---|
PESQ↑ | SIM↑ | STOI↑ | PESQ↑ | SIM↑ | STOI↑ | ||
MiMo-Audio-Tokenizer | 25 | 2.71 | 0.89 | 0.93 | 2.43 | 0.85 | 0.92 |
GLM4-Voice-Tokenizer | 12.5 | 1.06 | 0.33 | 0.61 | 1.05 | 0.12 | 0.60 |
Baichuan-Audio-Tokenizer | 12.5 | 1.84 | 0.78 | 0.86 | 1.62 | 0.69 | 0.85 |
XY-Tokenizer | 12.5 | 2.27 | 0.77 | 0.90 | 2.14 | 0.82 | 0.90 |
Mimi | 75 | 2.05 | 0.73 | 0.89 | 2.01 | 0.77 | 0.89 |
XCodec2.0 | 50 | 2.19 | 0.80 | 0.92 | 2.37 | 0.82 | 0.93 |
BigCodec | 80 | 2.26 | 0.81 | 0.92 | 2.22 | 0.80 | 0.91 |
MingTok-Audio(ours) | 50 | 4.21 | 0.96 | 0.98 | 4.04 | 0.96 | 0.98 |
Datasets | Model | Performance | ||||||
---|---|---|---|---|---|---|---|---|
aishell2-ios | LS-clean | Hunan | Minnan | Guangyue | Chuanyu | Shanghai | ||
Understanding ASR | Kimi-Audio | 2.56 | 1.28 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 |
Qwen2.5 Omni | 2.75 | 1.80 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | |
Qwen2 Audio | 2.92 | 1.60 | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | |
Ming-UniAudio-16B-A3B(ours) | 2.84 | 1.62 | 9.80 | 16.50 | 5.51 | 5.46 | 14.65 |
Datasets | Model | Performance | |||
---|---|---|---|---|---|
Seed-zh WER(%) | Seed-zh SIM | Seed-en WER(%) | Seed-en SIM | ||
Generation | Seed-TTS | 1.12 | 0.80 | 2.25 | 0.76 |
MiMo-Audio | 1.96 | - | 5.37 | - | |
Qwen3-Omni-30B-A3B-Instruct | 1.07 | - | 1.39 | - | |
Ming-Omni-Lite | 1.69 | 0.68 | 4.31 | 0.51 | |
Ming-UniAudio-16B-A3B(ours) | 0.95 | 0.70 | 1.85 | 0.58 |
This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.
If you find our work helpful, feel free to give us a cite.