gooorillax commited on
Commit
0e94272
·
1 Parent(s): 349aefa

refined READMEs, add eval report

Browse files
.gitattributes CHANGED
@@ -47,3 +47,4 @@ figs/reconstruction.jpg filter=lfs diff=lfs merge=lfs -text
47
  figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
48
  004892.wav filter=lfs diff=lfs merge=lfs -text
49
  00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
 
 
47
  figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
48
  004892.wav filter=lfs diff=lfs merge=lfs -text
49
  00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
50
+ DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
DSTK_Eval.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b541b6ecbf5c77b78ba93b64ac4c62062869a6deabde1921c4dbff0451bacf54
3
+ size 115948
README.md CHANGED
@@ -4,48 +4,66 @@ language:
4
  - en
5
  - zh
6
  ---
7
- # Discrete Speech Tokenization Toolkit
 
 
 
 
 
 
8
  ## Release Notes:
9
  V1.0
10
 
11
- This release consists of the following models:
12
- 1. A speech semantic tokenzier (25Hz, codebook size=4096) for Chinese and English
13
- 2. A corresponding speech detokenizer for for Chinese and English
14
- 3. A text2token model (T2U) converts text to speech tokens.
 
 
 
 
 
15
 
16
- these models achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
 
 
 
 
 
 
 
 
17
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
18
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
19
 
20
  We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
21
  <p align="center"><img src="figs/eval3.jpg" width="1200"></p>
22
 
23
- ## Speech Semantic Tokenizer
24
- the speech semantic tokenizer is train with labelled data.
25
- <p align="center"><img src="figs/tokenizer.jpg" width="800"></p>
26
-
27
 
28
- ## Speech Detokenizer
29
- Our speech detokenizer is developed based on [F5-TTS](https://github.com/SWivid/F5-TTS) with two major updates added.
30
- 1. we adopt DiT block with cross attention, which is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
31
- <p align="center"><img src="figs/CADiT.jpg" height="600"></p>
32
 
33
- 2. we intruduced a chunk-wise streaming inferece process that can generate speech of any length.
34
- <p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
35
 
36
- ## Text2Token(T2U)
37
- Text2Token is transformer machine translation model, which is trained on about 380k hours of speech-text pairs with [fairseq](https://github.com/facebookresearch/fairseq).
38
-
39
- ## TTS pipeline
40
- As shown in tts_example.py, the 3 models could form a pipeline for TTS task.
41
- <p align="center"><img src="figs/TTS.jpg" width="1200"></p>
42
 
43
- ## Non-parallel Speech Reconstruction Pipeline
44
- As shown in reconstruction_example.py, the tokenizer and detokenizer could form a pipeline for speech reconstruction task.
45
- <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
 
46
 
47
  # Core Developers:
48
  [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
49
 
50
  ## Contributors:
51
- [Hanlin Zhang]([email protected])
 
 
 
 
4
  - en
5
  - zh
6
  ---
7
+
8
+ > We choose to go to the moon, not because they are easy, but because they are hard.
9
+
10
+ # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
11
+
12
+ The Discrete Speech Tokenization Toolkit (DSTK) is an open-source speech processing toolkit designed to provide a complete solution for speech discretization. It supports converting continuous speech signals into discrete speech tokens, reconstructing speech waveforms from discrete speech tokens, and converting text content into speech tokens. DSTK offers efficient, flexible, and modular foundational components for tasks such as speech understanding, speech synthesis, and multimodal learning.
13
+
14
  ## Release Notes:
15
  V1.0
16
 
17
+ This release of DSTK includes three modules:
18
+ 1. Semantic Tokenzier
19
+ - Encode the semantic information of speech into discrete speech tokens.
20
+ - frame rate: 25Hz; codebook size: 4096,supports both Chinese and English
21
+ 2. Semantic Detokenizer
22
+ - Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
23
+ - Supports both Chinese and English
24
+ 3. Text2token (T2U)
25
+ - Convert text content into speech tokens
26
 
27
+ ## TTS pipeline
28
+ As shown in the figure below, the 3 module could form a pipeline for TTS task.
29
+ <p align="center"><img src="figs/TTS.jpg" width="1200"></p>
30
+
31
+ ## Non-parallel Speech Reconstruction Pipeline
32
+ As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
33
+ <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
34
+
35
+ These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
36
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
37
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
38
 
39
  We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
40
  <p align="center"><img src="figs/eval3.jpg" width="1200"></p>
41
 
42
+ ## More details about the 3 models:
43
+ - [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
44
+ - [Semantic Detokenizer](semantic_detokenizer/README.md)
45
+ - [Text2Token](text2token/README.md)
46
 
47
+ ## Installation
 
 
 
48
 
49
+ ### Create a separate environment if needed
 
50
 
51
+ ```bash
52
+ # Create a conda env with python_version>=3.10 (you could also use virtualenv)
53
+ conda create -n dstk python=3.10
54
+ conda activate dstk
55
+ ```
 
56
 
57
+ ## More tools to be release:
58
+ - 12.5Hz Streaming Semantic Tokenizer and Detokenizer
59
+ - Speech Normalized Tokenizer
60
+ - Speech Disentangled Tokenizer
61
 
62
  # Core Developers:
63
  [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
64
 
65
  ## Contributors:
66
+ [Hanlin Zhang]([email protected])
67
+
68
+ ## Former Contributors:
69
+ Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
README_CN.md CHANGED
@@ -1,48 +1,58 @@
1
- # Discrete Speech Tokenization Toolkit
 
 
 
2
  ## Release Notes:
3
  V1.0
4
 
5
- 本次发布的工具包包括一下几个模型:
6
- 1. 一个语音tokenizer(25Hz, codebook size=4096)支持中英文
7
- 2. 一个对应的detokenizer支持中英文
8
- 3. 一个text2token模型(T2U)可将文本转换为语音token.
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- 这些模型在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的精度:
11
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
12
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
13
 
14
  我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
15
  <p align="center"><img src="figs/eval3.jpg" width="1200"></p>
16
 
17
- ## Speech Semantic Tokenizer
18
- 本tokenizer的采用了监督学习方法。训练中我们使用了从开源数据中采样的大约4000小时中英文语音文本对数据,语种比例为1:1.
19
- <p align="center"><img src="figs/tokenizer.jpg" width="800"></p>
20
 
 
 
 
 
21
 
22
- ## Speech Detokenizer
23
- 本detokenizer是基于[F5-TTS](https://github.com/SWivid/F5-TTS)开发,但是对齐进行了两项改进:
24
- 1. 采用了DiT with cross attention,类似于[GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice)的detokenizer.
25
- <p align="center"><img src="figs/CADiT.jpg" height="600"></p>
26
 
27
- 2. 开发了以chunk为单元的流式推理流程.
28
- <p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
29
 
30
- 本次发布的detokenizer使用了约6000小时的中英文数据包括Wenet4TTS(premium,standard),LibriTTS等等。
 
 
 
 
31
 
32
- ## Text2Token(T2U)
33
- Text2Token是一个transformer机器翻译模型,用[fairseq](https://github.com/facebookresearch/fairseq)在约38万小时的语音文本对数据上训练得到。
34
 
35
 
36
- ## TTS pipeline
37
- 在tts_example.py中, 我们给出了一个实例,串联使用上述三个模型实现TTS的功能.
38
- <p align="center"><img src="figs/TTS.jpg" width="1200"></p>
39
-
40
- ## Non-parallel Speech Reconstruction Pipeline
41
- 在reconstruction_example.py中, 有另外一个实例,串联使用tokenizer和detokenizer实现语音重建的功能.
42
- <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
43
-
44
  # Core Developers:
45
  [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
46
 
47
  ## Contributors:
48
  [Hanlin Zhang]([email protected])
 
 
 
 
1
+ > We choose to go to the moon, not because they are easy, but because they are hard.
2
+ # Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
3
+ Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
4
+
5
  ## Release Notes:
6
  V1.0
7
 
8
+ 本次发布的DSTK包含三个模块:
9
+ 1. 语音Tokenizer模块(Semantic Tokenizer)
10
+ - 将语音的语义信息编码为离散的语音token
11
+ - frame rate: 25Hz; codebook size: 4096, 支持中英文
12
+ 1. 语音Detokenizer模块(Semantic Detokenizer)
13
+ - 将离散语音token解码为可听的语音波形,完成语音的重建
14
+ - 支持中英文
15
+ 2. 文本转语音Token模块(Text2Token)
16
+ - 将文本转换为语音token
17
+
18
+ ## TTS pipeline
19
+ 串联使用上述三个模型实现TTS的功能
20
+ <p align="center"><img src="figs/TTS.jpg" width="1200"></p>
21
+
22
+ ## Non-parallel Speech Reconstruction Pipeline
23
+ 串联使用tokenizer和detokenizer实现语音重建的功能
24
+ <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
25
 
26
+ 上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平:
27
  <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
28
  <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
29
 
30
  我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
31
  <p align="center"><img src="figs/eval3.jpg" width="1200"></p>
32
 
 
 
 
33
 
34
+ ## 更多关于三个模块的信息:
35
+ - [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
36
+ - [Semantic Detokenizer](semantic_detokenizer/README.md)
37
+ - [Text2Token](text2token/README.md)
38
 
39
+ ## Installation
 
 
 
40
 
41
+ ### Create a separate environment if needed
 
42
 
43
+ ```bash
44
+ # Create a conda env with python_version>=3.10 (you could also use virtualenv)
45
+ conda create -n dstk python=3.10
46
+ conda activate dstk
47
+ ```
48
 
 
 
49
 
50
 
 
 
 
 
 
 
 
 
51
  # Core Developers:
52
  [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
53
 
54
  ## Contributors:
55
  [Hanlin Zhang]([email protected])
56
+
57
+ ## Former Contributors:
58
+ Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
figs/eval3.jpg CHANGED

Git LFS Details

  • SHA256: fbfae531059839cc3041e99a397e7385461f7c96cd7eea8b382d54cf63e9e717
  • Pointer size: 131 Bytes
  • Size of remote file: 306 kB

Git LFS Details

  • SHA256: 12b80fdc1bebf9384af07e18bee489d19bb6cae24097e64b193c680d5b29a38a
  • Pointer size: 131 Bytes
  • Size of remote file: 110 kB
semantic_detokenizer/README.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Speech Detokenizer
2
+ #### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
3
+
4
+ 1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
5
+ <p align="center"><img src="../figs/CADiT.jpg" height="600"></p>
6
+
7
+ 2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
8
+ <p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>
9
+
10
+ #### The detokenizer released this time was trained on approximately 6,000 hours of Chinese and English data. This dataset includes Wenet4TTS (both premium and standard), LibriTTS, and others.
semantic_tokenizer/f40ms/README.md CHANGED
@@ -1,7 +1,10 @@
1
- # SpeechTokenizerInference
 
 
2
 
3
- # 25Hz(80ms) speech semantic tokenizer based on fairseq
4
 
 
 
 
5
  pip install -r requirements_npu.txt
6
-
7
-
 
1
+ ## Speech Semantic Tokenizer
2
+ As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1.
3
+ <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
4
 
 
5
 
6
+ To run this semantic tokenizer alone, the required packages should be installed.
7
+ ```bash
8
+ # install requirements for this semantic tokenizer
9
  pip install -r requirements_npu.txt
10
+ ```
 
text2token/README.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ ## Text2Token(T2U)
2
+ The Text2Token module is a Transformer-based translation model. It takes phonemes as input, which can be converted from text using the G2P module. The Text2Token model released this time was trained on approximately 380k hours of speech-text paired data with [fairseq](https://github.com/facebookresearch/fairseq).