Commit
·
0e94272
1
Parent(s):
349aefa
refined READMEs, add eval report
Browse files- .gitattributes +1 -0
- DSTK_Eval.pdf +3 -0
- README.md +44 -26
- README_CN.md +36 -26
- figs/eval3.jpg +2 -2
- semantic_detokenizer/README.md +10 -0
- semantic_tokenizer/f40ms/README.md +7 -4
- text2token/README.md +2 -0
.gitattributes
CHANGED
|
@@ -47,3 +47,4 @@ figs/reconstruction.jpg filter=lfs diff=lfs merge=lfs -text
|
|
| 47 |
figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
|
| 48 |
004892.wav filter=lfs diff=lfs merge=lfs -text
|
| 49 |
00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 47 |
figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
|
| 48 |
004892.wav filter=lfs diff=lfs merge=lfs -text
|
| 49 |
00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text
|
DSTK_Eval.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b541b6ecbf5c77b78ba93b64ac4c62062869a6deabde1921c4dbff0451bacf54
|
| 3 |
+
size 115948
|
README.md
CHANGED
|
@@ -4,48 +4,66 @@ language:
|
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
---
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
## Release Notes:
|
| 9 |
V1.0
|
| 10 |
|
| 11 |
-
This release
|
| 12 |
-
1.
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
| 18 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
| 19 |
|
| 20 |
We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
|
| 21 |
<p align="center"><img src="figs/eval3.jpg" width="1200"></p>
|
| 22 |
|
| 23 |
-
##
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
|
| 28 |
-
##
|
| 29 |
-
Our speech detokenizer is developed based on [F5-TTS](https://github.com/SWivid/F5-TTS) with two major updates added.
|
| 30 |
-
1. we adopt DiT block with cross attention, which is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
|
| 31 |
-
<p align="center"><img src="figs/CADiT.jpg" height="600"></p>
|
| 32 |
|
| 33 |
-
|
| 34 |
-
<p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
|
| 42 |
|
| 43 |
-
##
|
| 44 |
-
|
| 45 |
-
|
|
|
|
| 46 |
|
| 47 |
# Core Developers:
|
| 48 |
[Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
|
| 49 |
|
| 50 |
## Contributors:
|
| 51 |
-
[Hanlin Zhang]([email protected])
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
---
|
| 7 |
+
|
| 8 |
+
> We choose to go to the moon, not because they are easy, but because they are hard.
|
| 9 |
+
|
| 10 |
+
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
| 11 |
+
|
| 12 |
+
The Discrete Speech Tokenization Toolkit (DSTK) is an open-source speech processing toolkit designed to provide a complete solution for speech discretization. It supports converting continuous speech signals into discrete speech tokens, reconstructing speech waveforms from discrete speech tokens, and converting text content into speech tokens. DSTK offers efficient, flexible, and modular foundational components for tasks such as speech understanding, speech synthesis, and multimodal learning.
|
| 13 |
+
|
| 14 |
## Release Notes:
|
| 15 |
V1.0
|
| 16 |
|
| 17 |
+
This release of DSTK includes three modules:
|
| 18 |
+
1. Semantic Tokenzier
|
| 19 |
+
- Encode the semantic information of speech into discrete speech tokens.
|
| 20 |
+
- frame rate: 25Hz; codebook size: 4096,supports both Chinese and English
|
| 21 |
+
2. Semantic Detokenizer
|
| 22 |
+
- Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
|
| 23 |
+
- Supports both Chinese and English
|
| 24 |
+
3. Text2token (T2U)
|
| 25 |
+
- Convert text content into speech tokens
|
| 26 |
|
| 27 |
+
## TTS pipeline
|
| 28 |
+
As shown in the figure below, the 3 module could form a pipeline for TTS task.
|
| 29 |
+
<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
|
| 30 |
+
|
| 31 |
+
## Non-parallel Speech Reconstruction Pipeline
|
| 32 |
+
As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
|
| 33 |
+
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
|
| 34 |
+
|
| 35 |
+
These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
|
| 36 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
| 37 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
| 38 |
|
| 39 |
We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
|
| 40 |
<p align="center"><img src="figs/eval3.jpg" width="1200"></p>
|
| 41 |
|
| 42 |
+
## More details about the 3 models:
|
| 43 |
+
- [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
|
| 44 |
+
- [Semantic Detokenizer](semantic_detokenizer/README.md)
|
| 45 |
+
- [Text2Token](text2token/README.md)
|
| 46 |
|
| 47 |
+
## Installation
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
### Create a separate environment if needed
|
|
|
|
| 50 |
|
| 51 |
+
```bash
|
| 52 |
+
# Create a conda env with python_version>=3.10 (you could also use virtualenv)
|
| 53 |
+
conda create -n dstk python=3.10
|
| 54 |
+
conda activate dstk
|
| 55 |
+
```
|
|
|
|
| 56 |
|
| 57 |
+
## More tools to be release:
|
| 58 |
+
- 12.5Hz Streaming Semantic Tokenizer and Detokenizer
|
| 59 |
+
- Speech Normalized Tokenizer
|
| 60 |
+
- Speech Disentangled Tokenizer
|
| 61 |
|
| 62 |
# Core Developers:
|
| 63 |
[Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
|
| 64 |
|
| 65 |
## Contributors:
|
| 66 |
+
[Hanlin Zhang]([email protected])
|
| 67 |
+
|
| 68 |
+
## Former Contributors:
|
| 69 |
+
Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
|
README_CN.md
CHANGED
|
@@ -1,48 +1,58 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
| 2 |
## Release Notes:
|
| 3 |
V1.0
|
| 4 |
|
| 5 |
-
|
| 6 |
-
1.
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
|
| 11 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
| 12 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
| 13 |
|
| 14 |
我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
|
| 15 |
<p align="center"><img src="figs/eval3.jpg" width="1200"></p>
|
| 16 |
|
| 17 |
-
## Speech Semantic Tokenizer
|
| 18 |
-
本tokenizer的采用了监督学习方法。训练中我们使用了从开源数据中采样的大约4000小时中英文语音文本对数据,语种比例为1:1.
|
| 19 |
-
<p align="center"><img src="figs/tokenizer.jpg" width="800"></p>
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
##
|
| 23 |
-
本detokenizer是基于[F5-TTS](https://github.com/SWivid/F5-TTS)开发,但是对齐进行了两项改进:
|
| 24 |
-
1. 采用了DiT with cross attention,类似于[GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice)的detokenizer.
|
| 25 |
-
<p align="center"><img src="figs/CADiT.jpg" height="600"></p>
|
| 26 |
|
| 27 |
-
|
| 28 |
-
<p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
## Text2Token(T2U)
|
| 33 |
-
Text2Token是一个transformer机器翻译模型,用[fairseq](https://github.com/facebookresearch/fairseq)在约38万小时的语音文本对数据上训练得到。
|
| 34 |
|
| 35 |
|
| 36 |
-
## TTS pipeline
|
| 37 |
-
在tts_example.py中, 我们给出了一个实例,串联使用上述三个模型实现TTS的功能.
|
| 38 |
-
<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
|
| 39 |
-
|
| 40 |
-
## Non-parallel Speech Reconstruction Pipeline
|
| 41 |
-
在reconstruction_example.py中, 有另外一个实例,串联使用tokenizer和detokenizer实现语音重建的功能.
|
| 42 |
-
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
|
| 43 |
-
|
| 44 |
# Core Developers:
|
| 45 |
[Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
|
| 46 |
|
| 47 |
## Contributors:
|
| 48 |
[Hanlin Zhang]([email protected])
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
> We choose to go to the moon, not because they are easy, but because they are hard.
|
| 2 |
+
# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
|
| 3 |
+
Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
|
| 4 |
+
|
| 5 |
## Release Notes:
|
| 6 |
V1.0
|
| 7 |
|
| 8 |
+
本次发布的DSTK包含三个模块:
|
| 9 |
+
1. 语音Tokenizer模块(Semantic Tokenizer)
|
| 10 |
+
- 将语音的语义信息编码为离散的语音token
|
| 11 |
+
- frame rate: 25Hz; codebook size: 4096, 支持中英文
|
| 12 |
+
1. 语音Detokenizer模块(Semantic Detokenizer)
|
| 13 |
+
- 将离散语音token解码为可听的语音波形,完成语音的重建
|
| 14 |
+
- 支持中英文
|
| 15 |
+
2. 文本转语音Token模块(Text2Token)
|
| 16 |
+
- 将文本转换为语音token
|
| 17 |
+
|
| 18 |
+
## TTS pipeline
|
| 19 |
+
串联使用上述三个模型实现TTS的功能
|
| 20 |
+
<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
|
| 21 |
+
|
| 22 |
+
## Non-parallel Speech Reconstruction Pipeline
|
| 23 |
+
串联使用tokenizer和detokenizer实现语音重建的功能
|
| 24 |
+
<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
|
| 25 |
|
| 26 |
+
上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平:
|
| 27 |
<p align="center"><img src="figs/eval1.jpg" width="1200"></p>
|
| 28 |
<p align="center"><img src="figs/eval2.jpg" width="1200"></p>
|
| 29 |
|
| 30 |
我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
|
| 31 |
<p align="center"><img src="figs/eval3.jpg" width="1200"></p>
|
| 32 |
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
## 更多关于三个模块的信息:
|
| 35 |
+
- [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
|
| 36 |
+
- [Semantic Detokenizer](semantic_detokenizer/README.md)
|
| 37 |
+
- [Text2Token](text2token/README.md)
|
| 38 |
|
| 39 |
+
## Installation
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
### Create a separate environment if needed
|
|
|
|
| 42 |
|
| 43 |
+
```bash
|
| 44 |
+
# Create a conda env with python_version>=3.10 (you could also use virtualenv)
|
| 45 |
+
conda create -n dstk python=3.10
|
| 46 |
+
conda activate dstk
|
| 47 |
+
```
|
| 48 |
|
|
|
|
|
|
|
| 49 |
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
# Core Developers:
|
| 52 |
[Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
|
| 53 |
|
| 54 |
## Contributors:
|
| 55 |
[Hanlin Zhang]([email protected])
|
| 56 |
+
|
| 57 |
+
## Former Contributors:
|
| 58 |
+
Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang
|
figs/eval3.jpg
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
semantic_detokenizer/README.md
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Speech Detokenizer
|
| 2 |
+
#### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
|
| 3 |
+
|
| 4 |
+
1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
|
| 5 |
+
<p align="center"><img src="../figs/CADiT.jpg" height="600"></p>
|
| 6 |
+
|
| 7 |
+
2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
|
| 8 |
+
<p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>
|
| 9 |
+
|
| 10 |
+
#### The detokenizer released this time was trained on approximately 6,000 hours of Chinese and English data. This dataset includes Wenet4TTS (both premium and standard), LibriTTS, and others.
|
semantic_tokenizer/f40ms/README.md
CHANGED
|
@@ -1,7 +1,10 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
# 25Hz(80ms) speech semantic tokenizer based on fairseq
|
| 4 |
|
|
|
|
|
|
|
|
|
|
| 5 |
pip install -r requirements_npu.txt
|
| 6 |
-
|
| 7 |
-
|
|
|
|
| 1 |
+
## Speech Semantic Tokenizer
|
| 2 |
+
As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1.
|
| 3 |
+
<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
|
| 4 |
|
|
|
|
| 5 |
|
| 6 |
+
To run this semantic tokenizer alone, the required packages should be installed.
|
| 7 |
+
```bash
|
| 8 |
+
# install requirements for this semantic tokenizer
|
| 9 |
pip install -r requirements_npu.txt
|
| 10 |
+
```
|
|
|
text2token/README.md
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Text2Token(T2U)
|
| 2 |
+
The Text2Token module is a Transformer-based translation model. It takes phonemes as input, which can be converted from text using the G2P module. The Text2Token model released this time was trained on approximately 380k hours of speech-text paired data with [fairseq](https://github.com/facebookresearch/fairseq).
|