File size: 7,166 Bytes
0e94272
 
 
 
cd8454d
 
 
0e94272
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd8454d
12da045
cd8454d
 
 
 
 
 
 
0e94272
 
 
 
cd8454d
0e94272
cd8454d
12da045
0e94272
cd8454d
0e94272
 
 
 
12da045
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e94272
cd8454d
 
 
 
 
 
 
 
0e94272
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
> We choose to go to the moon, not because they are easy, but because they are hard.
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。

## Release Notes:
V1.0

本次发布的DSTK包含三个模块:
1. 语音Tokenizer模块(Semantic Tokenizer)
   - 将语音的语义信息编码为离散的语音token
   - frame rate: 25Hz; codebook size: 4096, 支持中英文
1. 语音Detokenizer模块(Semantic Detokenizer)
   - 将离散语音token解码为可听的语音波形,完成语音的重建
   - 支持中英文
2. 文本转语音Token模块(Text2Token)
   - 将文本转换为语音token

## TTS pipeline
串联使用上述三个模型实现TTS的功能
<p align="center"><img src="figs/TTS.jpg" width="1200"></p>

## Non-parallel Speech Reconstruction Pipeline
串联使用tokenizer和detokenizer实现语音重建的功能
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>

上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平,二模型的参数量和使用的监督数据都远小于对照基线模型:
<p align="center"><img src="figs/eval1.jpg" width="1200"></p> 
<p align="center"><img src="figs/eval2.jpg" width="1200"></p> 

我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
<p align="center"><img src="figs/eval3.jpg"  width="1200"></p> 


## 更多关于三个模块的信息:
- [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
- [Semantic Detokenizer](semantic_detokenizer/README.md)
- [Text2Token](text2token/README.md)

## Installation

### Hardware: Ascend 910B with CANN 8.1 RC1 or GPU
### Create a separate environment if needed

```bash
# Create a conda env with python_version>=3.10  (you could also use virtualenv)
conda create -n dstk python=3.10
conda activate dstk

# run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
# for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
sh install_requirements.sh

# patch for G2P
# modify the first line in thirdparty/G2P/patch_for_deps.sh:
# SITE_PATH=/path/to/your/own/site-packages
# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
sh thirdparty/G2P/patch_for_deps.sh
```

### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)

## Usage:
### Pipelines


```python
import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from reconstuction_example import ReconstructionPipeline
from tts_example import TTSPipeline

ref_wav_path = dstk_path + "/00004557-00000030.wav"
input_wav_path = dstk_path + "/004892.wav"
vocoder_path = "/path/to/vocos-mel-24khz"

reconsturctor = ReconstructionPipeline(
    detok_vocoder=vocoder_path,
)

tts = TTSPipeline(
    detok_vocoder=vocoder_path,
    max_seg_len=30,
)

# for non-parallel speech reconstruction
generated_wave, target_sample_rate = reconsturctor.reconstruct(
    ref_wav_path, input_wav_path
)

with open("./recon.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

# for tts
ref_wav_path = input_wav_path
generated_wave, target_sample_rate = tts.synthesize(
    ref_wav_path,
    "荷花未全谢,又到中秋节。家家户户把月饼切,庆中秋。美酒多欢乐,整杯盘,猜拳行令,同赏月。",
)
with open("./tts.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

print("Finished")
```

### Tokenization
```python
import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

input_wav_path = dstk_path + "/004892.wav"

from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer

tokenizer = SpeechTokenizer()

raw_wav, sr = librosa.load(input_wav_path, sr=16000)
token_list, token_info_list = tokenizer.extract([raw_wav])  # 传入波形数据
for token_info in token_info_list:
    print(token_info["unit_sequence"] + "\n")
    print(token_info["reduced_unit_sequence"] + "\n")
```

### Text2Token
```python
import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from text2token.simple_infer import Text2TokenGenerator

input_text = "从离散语音token重建语音波形"
MAX_SEG_LEN = 30

t2u = Text2TokenGenerator()

phones = t2u.text2phone(input_text.strip())
print("phonemes of input text: %s are [%s]" % (input_text, phones))

speech_tokens_info = t2u.generate_for_long_input_text(
    [phones], max_segment_len=MAX_SEG_LEN
)

for infor in speech_tokens_info[0]:
    print(" ".join(infor) + "\n")
```
### Detokenization
```python
import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from semantic_detokenizer.chunk_infer import SpeechDetokenizer

# 从离散语音token重建语音波形
input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
vocoder_path = "/path/to/vocos-mel-24khz"
ref_wav_path = dstk_path + "/004892.wav"
# output of tokenizer given ref_wav as input
ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"

token_chunk_len = 75
chunk_cond_proportion = 0.3
chunk_look_ahead = 10
max_ref_duration = 4.5
ref_audio_cut_from_head = False

detoker = SpeechDetokenizer(
    vocoder_path=vocoder_path,
)

generated_wave, target_sample_rate = detoker.chunk_generate(
    ref_wav_path,
    ref_tokens.split(),
    input_tokens.split(),
    token_chunk_len,
    chunk_cond_proportion,
    chunk_look_ahead,
    max_ref_duration,
    ref_audio_cut_from_head,
)

with open("./detok.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")
```



# Core Developers:
[Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])

## Contributors:
[Hanlin Zhang]([email protected])

## Former Contributors:
Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang