refine readme, add logo, and fix a punct normalization problem in tn

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +14 -2
README_CN.md +3 -1
figs/DSTK_logo.jpg +3 -0
semantic_tokenizer/f40ms/README.md +1 -1
thirdparty/G2P/TN_processors.py +1 -1

.gitattributes CHANGED Viewed

@@ -48,3 +48,4 @@ figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
 004892.wav filter=lfs diff=lfs merge=lfs -text
 00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
 DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text

 004892.wav filter=lfs diff=lfs merge=lfs -text
 00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
 DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
+figs/DSTK_logo.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -12,7 +12,9 @@ tags:
 - text2token
 ---
-> We choose to go to the moon, not because they are easy, but because they are hard.
 # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
@@ -41,6 +43,7 @@ As shown in figure below, the tokenizer and detokenizer could also form a pipeli
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
 These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
@@ -72,6 +75,13 @@ sh install_requirements.sh
 # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
 sh thirdparty/G2P/patch_for_deps.sh
 ```
 ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
@@ -228,4 +238,6 @@ Discrete Speech Team, HKRC, Huawei
 Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
 ## Acknowledgement
-We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.

 - text2token
 ---
+<p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
+> We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
 # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
 These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
+- All our experiments were conducted on the Ascend 910B, and the experimental results may differ slightly from those obtained on GPUs.
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 # run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
 sh thirdparty/G2P/patch_for_deps.sh
 ```
+### Run on Ascend 910B platforms
+```bash
+# the env variable TOKENIZE_ON_NPU need to be defined
+export TOKENIZE_ON_NPU=1
+# this env variable is not need for GPUs, just do not define it.
+```
 ### Download the vocos vocoder from [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)
 Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
 ## Acknowledgement
+We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.
+Special thanks to the [Textless NLP Project](https://speechbot.github.io/), which has inspired us to embark on this research direction.

README_CN.md CHANGED Viewed

@@ -1,4 +1,6 @@
-> We choose to go to the moon, not because they are easy, but because they are hard.
 # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
 Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包，旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形，以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。

+<p align="center"><img src="figs/DSTK_logo.jpg" width="100"></p>
+> We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...
 # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
 Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包，旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形，以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。

figs/DSTK_logo.jpg ADDED Viewed

Git LFS Details

SHA256: e8273396415b4cf775ea445f846e025932723ffcd8500bbbcc474ff7a96f1d7f
Pointer size: 131 Bytes
Size of remote file: 183 kB

semantic_tokenizer/f40ms/README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ## Speech Semantic Tokenizer
-As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq).
 <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>

 ## Speech Semantic Tokenizer
+As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.
 <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>

thirdparty/G2P/TN_processors.py CHANGED Viewed

@@ -21,7 +21,7 @@ PUNCT_NORMALIZE = {'，': ',', '。': '.', '、': ',', '；': ',', '‘': ',', '
                    '︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
                    '︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
                    '﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
-                   '{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ','}
 ALPHABET_NORM = {'ａ': 'a', 'ｂ': 'b', 'ｃ': 'c', 'ｄ': 'd', 'ｅ': 'e', 'ｆ': 'f', 'ｇ': 'g', 'ｈ': 'h', 'ｉ': 'i', 'ｊ': 'j', 'ｋ': 'k', 'ｌ': 'l', 'ｍ': 'm',
                  'ｎ': 'n', 'ｏ': 'o', 'ｐ': 'p', 'ｑ': 'q', 'ｒ': 'r', 'ｓ': 's', 'ｔ': 't', 'ｕ': 'u', 'ｖ': 'v', 'ｗ': 'w', 'ｘ': 'x', 'ｙ': 'y', 'ｚ': 'z',

                    '︔': ',', '︓': ',', '︕': '!', '︖': '?', '︗': ',', '︘': ',', '︙': ',', '︰': ',', '︱': ',', '︳': ',', '︵': ',',
                    '︶': ',', '︷': ',', '︸': ',', '︹': ',', '︺': ',', '︻': ',', '︼': ',', '︽': ',', '︾': ',', '︿': ',', '﹀': ',',
                    '﹁': ',', '﹂': ',', '﹃': ',', '﹄': ',', ';': ',', '[': ',', ']': ',', '`': ',', ':': ',', '"': ',',
+                   '{': ',', '}': ',', '~': ',', ')': ',', '(': ',', '_': '"', '’': '\'', '^': ',', '﹔': ','}
 ALPHABET_NORM = {'ａ': 'a', 'ｂ': 'b', 'ｃ': 'c', 'ｄ': 'd', 'ｅ': 'e', 'ｆ': 'f', 'ｇ': 'g', 'ｈ': 'h', 'ｉ': 'i', 'ｊ': 'j', 'ｋ': 'k', 'ｌ': 'l', 'ｍ': 'm',
                  'ｎ': 'n', 'ｏ': 'o', 'ｐ': 'p', 'ｑ': 'q', 'ｒ': 'r', 'ｓ': 's', 'ｔ': 't', 'ｕ': 'u', 'ｖ': 'v', 'ｗ': 'w', 'ｘ': 'x', 'ｙ': 'y', 'ｚ': 'z',