gooorillax commited on
Commit
bdecca1
·
1 Parent(s): c831dba

refine readme, add logo, and fix a punct normalization problem in tn

Browse files
.gitattributes CHANGED
@@ -48,3 +48,4 @@ figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
48
  004892.wav filter=lfs diff=lfs merge=lfs -text
49
  00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
50
  DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
 
 
48
  004892.wav filter=lfs diff=lfs merge=lfs -text
49
  00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
50
  DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
51
+ figs/DSTK_logo.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -12,7 +12,9 @@ tags:
12
  - text2token
13
  ---
14
 
15
- > We choose to go to the moon, not because they are easy, but because they are hard.
 
 
16
 
17
  # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
18
 
@@ -41,6 +43,7 @@ As shown in figure below, the tokenizer and detokenizer could also form a pipeli
41
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
42
 
43
  These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
 
44
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
45
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
46
 
@@ -72,6 +75,13 @@ sh install_requirements.sh
72
  # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
73
  sh thirdparty/G2P/patch_for_deps.sh
74
  ```
 
 
 
 
 
 
 
75
 
76
  ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
77
 
@@ -228,4 +238,6 @@ Discrete Speech Team, HKRC, Huawei
228
  Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
229
 
230
  ## Acknowledgement
231
- We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.
 
 
 
12
  - text2token
13
  ---
14
 
15
+ <p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
16
+
17
+ > We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
18
 
19
  # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
20
 
 
43
  <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
44
 
45
  These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
46
+ - All our experiments were conducted on the Ascend 910B, and the experimental results may differ slightly from those obtained on GPUs.
47
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
48
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
49
 
 
75
  # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
76
  sh thirdparty/G2P/patch_for_deps.sh
77
  ```
78
+ ### Run on Ascend 910B platforms
79
+ ```bash
80
+ # the env variable TOKENIZE_ON_NPU need to be defined
81
+ export TOKENIZE_ON_NPU=1
82
+ # this env variable is not need for GPUs, just do not define it.
83
+ ```
84
+
85
 
86
  ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
87
 
 
238
  Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
239
 
240
  ## Acknowledgement
241
+ We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.
242
+
243
+ Special thanks to the [Textless NLP Project](https://speechbot.github.io/), which has inspired us to embark on this research direction.
README_CN.md CHANGED
@@ -1,4 +1,6 @@
1
- > We choose to go to the moon, not because they are easy, but because they are hard.
 
 
2
  # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
3
  Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
4
 
 
1
+ <p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
2
+
3
+ > We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
4
  # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
5
  Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
6
 
figs/DSTK_logo.jpg ADDED

Git LFS Details

  • SHA256: e8273396415b4cf775ea445f846e025932723ffcd8500bbbcc474ff7a96f1d7f
  • Pointer size: 131 Bytes
  • Size of remote file: 183 kB
semantic_tokenizer/f40ms/README.md CHANGED
@@ -1,5 +1,5 @@
1
  ## Speech Semantic Tokenizer
2
- As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq).
3
  <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
4
 
5
 
 
1
  ## Speech Semantic Tokenizer
2
+ As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.
3
  <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
4
 
5
 
thirdparty/G2P/TN_processors.py CHANGED
@@ -21,7 +21,7 @@ PUNCT_NORMALIZE = {',': ',', '。': '.', '、': ',', ';': ',', '‘': ',', '
21
  '︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
22
  '︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
23
  '﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
24
- '{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ','}
25
 
26
  ALPHABET_NORM = {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm',
27
  'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z',
 
21
  '︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
22
  '︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
23
  '﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
24
+ '{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ',', '﹔': ','}
25
 
26
  ALPHABET_NORM = {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm',
27
  'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z',