CosyVoice2
This version of CosyVoice2 has been converted to run on the Axera NPU using w8a16 quantization. Compatible with Pulsar2 version: 4.2
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : Cosyvoice
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(η±θ―ζ΄ΎPro)
- M.2 Accelerator card
Speech Generation
Stage | Time |
---|---|
llm prefill ( input_token_num + prompt_token_num ε¨ [0,128 ] ) | 104 ms |
llm prefill ( input_token_num + prompt_token_num ε¨ [128,256 ] ) | 234 ms |
Decode | 21.24 token/s |
How to use
Download all files from this repository to the device
1. Text to Speech (Voice Cloning)
(1) Copy this project to AX650 Board
(2). Prepare Dependencies
Running HTTP Tokenizer Server and Processing Prompt Speech require these Python packages. If you run these two step on a PC, install them on the PC.
pip3 install -r scripts/requirements.txt
2. Start HTTP Tokenizer Server
cd scripts
python cosyvoice2_tokenizer.py --host {your host} --port {your port}
3. Run on AX650 Board
Moidfy the HTTP host in
run_ax650.sh
.Run
run_ax650.sh
root@ax650 ~/Cosyvoice2 # bash run_ax650.sh
rm: cannot remove 'output*.wav': No such file or directory
[I][ Init][ 108]: LLM init start
[I][ Init][ 34]: connect http://10.122.86.184:12345 ok
bos_id: 0, eos_id: 1773
7% | βββ | 2 / 27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][ Init][ 138]: attr.axmodel_num:24
100% | ββββββββββββββββββββββββββββββββ | 27 / 27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
[I][ Init][ 216]: max_token_len : 1023
[I][ Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
[I][ Init][ 229]: prefill_token_num : 128
[I][ Init][ 233]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 233]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 233]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 233]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 233]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 237]: prefill_max_token_num : 512
[I][ Init][ 249]: LLM init ok
[I][ Init][ 154]: Token2Wav init ok
[I][ main][ 273]:
[I][ Run][ 388]: input token num : 142, prefill_split_num : 2
[I][ Run][ 422]: input_num_token:128
[I][ Run][ 422]: input_num_token:14
[I][ Run][ 607]: ttft: 236.90 ms
[Main/Token2Wav Thread] Processing batch of 28 tokens...
Successfully saved audio to output_0.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 53 tokens...
Successfully saved audio to output_1.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_2.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_3.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_4.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_5.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_6.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_7.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_8.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_9.wav (32-bit Float PCM).
[I][ Run][ 723]: hit eos, llm finished
[I][ Run][ 753]: llm finished
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.
[I][ Run][ 758]: total decode tokens:271
[N][ Run][ 759]: hit eos,avg 21.47 token/s
Successfully saved audio to output_10.wav (32-bit Float PCM).
Successfully saved audio to output.wav (32-bit Float PCM).
Voice generation pipeline completed.
Type "q" to exit, Ctrl+c to stop current running
text >>
Output SpeechοΌ output.wav
Optional. Process Prompt Speech
If you want to replicate a specific sound, do this step.
You can use audio in asset/ .
(1). Downlaod wetext
pip3 install modelscope
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext
(2). Process Prompt Speech
Example:
python3 scripts/process_prompt.py --prompt_text asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1
Pass parameters according to the actual situation.
python3 scripts/process_prompt.py -h
usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]
[--output OUTPUT]
options:
-h, --help show this help message and exit
--model_dir MODEL_DIR
tokenizer configuration directionary
--wetext_dir WETEXT_DIR
path to wetext
--sample_rate SAMPLE_RATE
Sampling rate for prompt audio
--prompt_text PROMPT_TEXT
The text content of the prompt(reference) audio. Text or file path.
--prompt_speech PROMPT_SPEECH
The path to prompt(reference) audio.
--output OUTPUT Output data storage directory
After executing the above command, files like the following will be generated:
flow_embedding.txt
flow_prompt_speech_token.txt
llm_embedding.txt
llm_prompt_speech_token.txt
prompt_speech_feat.txt
prompt_text.txt
When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.
- Downloads last month
- 38