CosyVoice2

This version of CosyVoice2 has been converted to run on the Axera NPU using w8a16 quantization. Compatible with Pulsar2 version: 4.2

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : Cosyvoice

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Speech Generation

Stage Time
llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) 104 ms
llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) 234 ms
Decode 21.24 token/s

How to use

Download all files from this repository to the device

1. Text to Speech (Voice Cloning)

(1) Copy this project to AX650 Board

(2). Prepare Dependencies

Running HTTP Tokenizer Server and Processing Prompt Speech require these Python packages. If you run these two step on a PC, install them on the PC.

pip3 install -r scripts/requirements.txt

2. Start HTTP Tokenizer Server

cd scripts
python cosyvoice2_tokenizer.py --host {your host} --port {your port}   

3. Run on AX650 Board

  1. Moidfy the HTTP host in run_ax650.sh.

  2. Run run_ax650.sh

root@ax650 ~/Cosyvoice2 # bash run_ax650.sh 
rm: cannot remove 'output*.wav': No such file or directory
[I][                            Init][ 108]: LLM init start
[I][                            Init][  34]: connect http://10.122.86.184:12345 ok
bos_id: 0, eos_id: 1773
  7% | β–ˆβ–ˆβ–ˆ                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24
100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
[I][                            Init][ 216]: max_token_len : 1023
[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 229]: prefill_token_num : 128
[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 237]: prefill_max_token_num : 512
[I][                            Init][ 249]: LLM init ok
[I][                            Init][ 154]: Token2Wav init ok
[I][                            main][ 273]: 
[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2
[I][                             Run][ 422]: input_num_token:128
[I][                             Run][ 422]: input_num_token:14
[I][                             Run][ 607]: ttft: 236.90 ms
[Main/Token2Wav Thread] Processing batch of 28 tokens...
Successfully saved audio to output_0.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 53 tokens...
Successfully saved audio to output_1.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_2.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_3.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_4.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_5.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_6.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_7.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_8.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_9.wav (32-bit Float PCM).
[I][                             Run][ 723]: hit eos, llm finished
[I][                             Run][ 753]: llm finished
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.


[I][                             Run][ 758]: total decode tokens:271
[N][                             Run][ 759]: hit eos,avg 21.47 token/s

Successfully saved audio to output_10.wav (32-bit Float PCM).
Successfully saved audio to output.wav (32-bit Float PCM).

Voice generation pipeline completed.
Type "q" to exit, Ctrl+c to stop current running
text >> 

Output Speech: output.wav

Optional. Process Prompt Speech

If you want to replicate a specific sound, do this step.
You can use audio in asset/ .

(1). Downlaod wetext
pip3 install modelscope
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext
(2). Process Prompt Speech

Example:

python3 scripts/process_prompt.py --prompt_text  asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1

Pass parameters according to the actual situation.

python3 scripts/process_prompt.py -h

usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]
                         [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --model_dir MODEL_DIR
                        tokenizer configuration directionary
  --wetext_dir WETEXT_DIR
                        path to wetext
  --sample_rate SAMPLE_RATE
                        Sampling rate for prompt audio
  --prompt_text PROMPT_TEXT
                        The text content of the prompt(reference) audio. Text or file path.
  --prompt_speech PROMPT_SPEECH
                        The path to prompt(reference) audio.
  --output OUTPUT       Output data storage directory

After executing the above command, files like the following will be generated:

flow_embedding.txt  
flow_prompt_speech_token.txt  
llm_embedding.txt  
llm_prompt_speech_token.txt  
prompt_speech_feat.txt  
prompt_text.txt

When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support