refined READMEs, add eval report

Browse files

Files changed (8) hide show

.gitattributes +1 -0
DSTK_Eval.pdf +3 -0
README.md +44 -26
README_CN.md +36 -26
figs/eval3.jpg +2 -2
semantic_detokenizer/README.md +10 -0
semantic_tokenizer/f40ms/README.md +7 -4
text2token/README.md +2 -0

.gitattributes CHANGED Viewed

@@ -47,3 +47,4 @@ figs/reconstruction.jpg filter=lfs diff=lfs merge=lfs -text
 figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
 004892.wav filter=lfs diff=lfs merge=lfs -text
 00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text

 figs/tokenizer.jpg filter=lfs diff=lfs merge=lfs -text
 004892.wav filter=lfs diff=lfs merge=lfs -text
 00004557-00000030.wav filter=lfs diff=lfs merge=lfs -text
+DSTK_Eval.pdf filter=lfs diff=lfs merge=lfs -text

DSTK_Eval.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b541b6ecbf5c77b78ba93b64ac4c62062869a6deabde1921c4dbff0451bacf54
+size 115948

README.md CHANGED Viewed

@@ -4,48 +4,66 @@ language:
 - en
 - zh
 ---
-# Discrete Speech Tokenization Toolkit
 ## Release Notes:
 V1.0
-This release consists of the following models:
-1. A speech semantic tokenzier (25Hz, codebook size=4096) for Chinese and English
-2. A corresponding speech detokenizer for for Chinese and English
-3. A text2token model (T2U) converts text to speech tokens.
-these models achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
 <p align="center"><img src="figs/eval3.jpg"  width="1200"></p>
-## Speech Semantic Tokenizer
-the speech semantic tokenizer is train with labelled data.
-<p align="center"><img src="figs/tokenizer.jpg" width="800"></p>
-## Speech Detokenizer
-Our speech detokenizer is developed based on [F5-TTS](https://github.com/SWivid/F5-TTS) with two major updates added.
-1. we adopt DiT block with cross attention, which is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
-<p align="center"><img src="figs/CADiT.jpg" height="600"></p>
-2. we intruduced a chunk-wise streaming inferece process that can generate speech of any length.
-<p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
-## Text2Token(T2U)
-Text2Token is transformer machine translation model, which is trained on about 380k hours of speech-text pairs with [fairseq](https://github.com/facebookresearch/fairseq).
-## TTS pipeline
-As shown in tts_example.py, the 3 models could form a pipeline for TTS task.
-<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
-## Non-parallel Speech Reconstruction Pipeline
-As shown in reconstruction_example.py, the tokenizer and detokenizer could form a pipeline for speech reconstruction task.
-<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
 # Core Developers:
 [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
 ## Contributors:
-[Hanlin Zhang]([email protected])

 - en
 - zh
 ---
+> We choose to go to the moon, not because they are easy, but because they are hard.
+# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
+The Discrete Speech Tokenization Toolkit (DSTK) is an open-source speech processing toolkit designed to provide a complete solution for speech discretization. It supports converting continuous speech signals into discrete speech tokens, reconstructing speech waveforms from discrete speech tokens, and converting text content into speech tokens. DSTK offers efficient, flexible, and modular foundational components for tasks such as speech understanding, speech synthesis, and multimodal learning.
 ## Release Notes:
 V1.0
+This release of DSTK includes three modules：
+1. Semantic Tokenzier
+   - Encode the semantic information of speech into discrete speech tokens.
+   - frame rate: 25Hz; codebook size: 4096，supports both Chinese and English
+2. Semantic Detokenizer
+   - Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
+   - Supports both Chinese and English
+3. Text2token (T2U)
+   - Convert text content into speech tokens
+## TTS pipeline
+As shown in the figure below, the 3 module could form a pipeline for TTS task.
+<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
+## Non-parallel Speech Reconstruction Pipeline
+As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
+<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
+These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.
 <p align="center"><img src="figs/eval3.jpg"  width="1200"></p>
+## More details about the 3 models：
+- [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
+- [Semantic Detokenizer](semantic_detokenizer/README.md)
+- [Text2Token](text2token/README.md)
+## Installation
+### Create a separate environment if needed
+```bash
+# Create a conda env with python_version>=3.10  (you could also use virtualenv)
+conda create -n dstk python=3.10
+conda activate dstk
+```
+## More tools to be release:
+- 12.5Hz Streaming Semantic Tokenizer and Detokenizer
+- Speech Normalized Tokenizer
+- Speech Disentangled Tokenizer
 # Core Developers:
 [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
 ## Contributors:
+[Hanlin Zhang]([email protected])
+## Former Contributors:
+Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang

README_CN.md CHANGED Viewed

@@ -1,48 +1,58 @@
-# Discrete Speech Tokenization Toolkit
 ## Release Notes:
 V1.0
-本次发布的工具包包括一下几个模型:
-1. 一个语音tokenizer(25Hz, codebook size=4096)支持中英文
-2. 一个对应的detokenizer支持中英文
-3. 一个text2token模型(T2U)可将文本转换为语音token.
-这些模型在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的精度:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 我们基于LLM测试了本语音tokenizer的ASR精度，我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
 <p align="center"><img src="figs/eval3.jpg"  width="1200"></p>
-## Speech Semantic Tokenizer
-本tokenizer的采用了监督学习方法。训练中我们使用了从开源数据中采样的大约4000小时中英文语音文本对数据，语种比例为1:1.
-<p align="center"><img src="figs/tokenizer.jpg" width="800"></p>
-## Speech Detokenizer
-本detokenizer是基于[F5-TTS](https://github.com/SWivid/F5-TTS)开发，但是对齐进行了两项改进：
-1. 采用了DiT with cross attention，类似于[GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice)的detokenizer.
-<p align="center"><img src="figs/CADiT.jpg" height="600"></p>
-2. 开发了以chunk为单元的流式推理流程.
-<p align="center"><img src="figs/F5-streaming.jpg" width="1200"></p>
-本次发布的detokenizer使用了约6000小时的中英文数据包括Wenet4TTS（premium，standard），LibriTTS等等。
-## Text2Token(T2U)
-Text2Token是一个transformer机器翻译模型，用[fairseq](https://github.com/facebookresearch/fairseq)在约38万小时的语音文本对数据上训练得到。
-## TTS pipeline
-在tts_example.py中, 我们给出了一个实例，串联使用上述三个模型实现TTS的功能.
-<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
-## Non-parallel Speech Reconstruction Pipeline
-在reconstruction_example.py中, 有另外一个实例，串联使用tokenizer和detokenizer实现语音重建的功能.
-<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
 # Core Developers:
 [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
 ## Contributors:
 [Hanlin Zhang]([email protected])

+> We choose to go to the moon, not because they are easy, but because they are hard.
+# Discrete Speech Tokenization Toolkit [[English](README.md)|[Chinese](README_CN.md)]
+Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包，旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形，以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。
 ## Release Notes:
 V1.0
+本次发布的DSTK包含三个模块:
+1. 语音Tokenizer模块（Semantic Tokenizer）
+   - 将语音的语义信息编码为离散的语音token
+   - frame rate: 25Hz; codebook size: 4096, 支持中英文
+1. 语音Detokenizer模块（Semantic Detokenizer）
+   - 将离散语音token解码为可听的语音波形，完成语音的重建
+   - 支持中英文
+2. 文本转语音Token模块（Text2Token）
+   - 将文本转换为语音token
+## TTS pipeline
+串联使用上述三个模型实现TTS的功能
+<p align="center"><img src="figs/TTS.jpg" width="1200"></p>
+## Non-parallel Speech Reconstruction Pipeline
+串联使用tokenizer和detokenizer实现语音重建的功能
+<p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
+上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 我们基于LLM测试了本语音tokenizer的ASR精度，我们的tokenizer达到了与采用连续语音表征的模型相近的水平.
 <p align="center"><img src="figs/eval3.jpg"  width="1200"></p>
+## 更多关于三个模块的信息：
+- [Semantic Tokenizer](semantic_tokenizer/f40ms/README.md)
+- [Semantic Detokenizer](semantic_detokenizer/README.md)
+- [Text2Token](text2token/README.md)
+## Installation
+### Create a separate environment if needed
+```bash
+# Create a conda env with python_version>=3.10  (you could also use virtualenv)
+conda create -n dstk python=3.10
+conda activate dstk
+```
 # Core Developers:
 [Daxin Tan]([email protected]), [Dehua Tao]([email protected]), [Yusen Sun]([email protected]) and [Xiao Chen]([email protected])
 ## Contributors:
 [Hanlin Zhang]([email protected])
+## Former Contributors:
+Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang

figs/eval3.jpg CHANGED Viewed

Git LFS Details

SHA256: fbfae531059839cc3041e99a397e7385461f7c96cd7eea8b382d54cf63e9e717
Pointer size: 131 Bytes
Size of remote file: 306 kB

Git LFS Details

SHA256: 12b80fdc1bebf9384af07e18bee489d19bb6cae24097e64b193c680d5b29a38a
Pointer size: 131 Bytes
Size of remote file: 110 kB

semantic_detokenizer/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+## Speech Detokenizer
+#### Our detokenizer is developed based on the [F5-TTS](https://github.com/SWivid/F5-TTS) framework and features two specific improvements.
+1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice).
+<p align="center"><img src="../figs/CADiT.jpg" height="600"></p>
+2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.
+<p align="center"><img src="../figs/F5-streaming.jpg" width="1200"></p>
+#### The detokenizer released this time was trained on approximately 6,000 hours of Chinese and English data. This dataset includes Wenet4TTS (both premium and standard), LibriTTS, and others.

semantic_tokenizer/f40ms/README.md CHANGED Viewed

@@ -1,7 +1,10 @@
-# SpeechTokenizerInference
-# 25Hz(80ms) speech semantic tokenizer based on fairseq
 pip install -r requirements_npu.txt

+## Speech Semantic Tokenizer
+As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1.
+<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>
+To run this semantic tokenizer alone, the required packages should be installed.
+```bash
+# install requirements for this semantic tokenizer
 pip install -r requirements_npu.txt
+```

text2token/README.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ ## Text2Token(T2U)
2	+ The Text2Token module is a Transformer-based translation model. It takes phonemes as input, which can be converted from text using the G2P module. The Text2Token model released this time was trained on approximately 380k hours of speech-text paired data with [fairseq](https://github.com/facebookresearch/fairseq).