Commit
·
bdecca1
1
Parent(s):
c831dba
refine readme, add logo, and fix a punct normalization problem in tn
Browse files- .gitattributes +1 -0
- README.md +14 -2
- README_CN.md +3 -1
- figs/DSTK_logo.jpg +3 -0
- semantic_tokenizer/f40ms/README.md +1 -1
- thirdparty/G2P/TN_processors.py +1 -1
.gitattributes
CHANGED
@@ -48,3 +48,4 @@ figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
|
|
48 |
004892.wav filter=lfs diff=lfs merge=lfs -text
|
49 |
00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
|
50 |
DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
|
|
|
|
48 |
004892.wav filter=lfs diff=lfs merge=lfs -text
|
49 |
00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
|
50 |
DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
|
51 |
+
figs/DSTK_logo.jpg filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -12,7 +12,9 @@ tags:
|
|
12 |
- text2token
|
13 |
---
|
14 |
|
15 |
-
|
|
|
|
|
16 |
|
17 |
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
18 |
|
@@ -41,6 +43,7 @@ As shown in figure below, the tokenizer and detokenizer could also form a pipeli
|
|
41 |
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
|
42 |
|
43 |
These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
|
|
|
44 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
45 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
46 |
|
@@ -72,6 +75,13 @@ sh install_requirements.sh
|
|
72 |
# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
|
73 |
sh thirdparty/G2P/patch_for_deps.sh
|
74 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
|
77 |
|
@@ -228,4 +238,6 @@ Discrete Speech Team, HKRC, Huawei
|
|
228 |
Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
|
229 |
|
230 |
## Acknowledgement
|
231 |
-
We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.
|
|
|
|
|
|
12 |
- text2token
|
13 |
---
|
14 |
|
15 |
+
<p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
|
16 |
+
|
17 |
+
> We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
|
18 |
|
19 |
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
20 |
|
|
|
43 |
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
|
44 |
|
45 |
These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
|
46 |
+
- All our experiments were conducted on the Ascend 910B, and the experimental results may differ slightly from those obtained on GPUs.
|
47 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
48 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
49 |
|
|
|
75 |
# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
|
76 |
sh thirdparty/G2P/patch_for_deps.sh
|
77 |
```
|
78 |
+
### Run on Ascend 910B platforms
|
79 |
+
```bash
|
80 |
+
# the env variable TOKENIZE_ON_NPU need to be defined
|
81 |
+
export TOKENIZE_ON_NPU=1
|
82 |
+
# this env variable is not need for GPUs, just do not define it.
|
83 |
+
```
|
84 |
+
|
85 |
|
86 |
### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
|
87 |
|
|
|
238 |
Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
|
239 |
|
240 |
## Acknowledgement
|
241 |
+
We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.
|
242 |
+
|
243 |
+
Special thanks to the [Textless NLP Project](https://speechbot.github.io/), which has inspired us to embark on this research direction.
|
README_CN.md
CHANGED
@@ -1,4 +1,6 @@
|
|
1 |
-
|
|
|
|
|
2 |
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
3 |
Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
|
4 |
|
|
|
1 |
+
<p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
|
2 |
+
|
3 |
+
> We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
|
4 |
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
5 |
Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
|
6 |
|
figs/DSTK_logo.jpg
ADDED
![]() |
Git LFS Details
|
semantic_tokenizer/f40ms/README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
## Speech Semantic Tokenizer
|
2 |
-
As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq).
|
3 |
<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
|
4 |
|
5 |
|
|
|
1 |
## Speech Semantic Tokenizer
|
2 |
+
As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.
|
3 |
<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
|
4 |
|
5 |
|
thirdparty/G2P/TN_processors.py
CHANGED
@@ -21,7 +21,7 @@ PUNCT_NORMALIZE = {',': ',', '。': '.', '、': ',', ';': ',', '‘': ',', '
|
|
21 |
'︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
|
22 |
'︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
|
23 |
'﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
|
24 |
-
'{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ','}
|
25 |
|
26 |
ALPHABET_NORM = {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm',
|
27 |
'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z',
|
|
|
21 |
'︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
|
22 |
'︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
|
23 |
'﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
|
24 |
+
'{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ',', '﹔': ','}
|
25 |
|
26 |
ALPHABET_NORM = {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm',
|
27 |
'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z',
|