We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...

Discrete Speech Tokenization Toolkit [English|Chinese]

The Discrete Speech Tokenization Toolkit (DSTK) is an open-source speech processing toolkit designed to provide a complete solution for speech discretization. It supports converting continuous speech signals into discrete speech tokens, reconstructing speech waveforms from discrete speech tokens, and converting text content into speech tokens. DSTK offers efficient, flexible, and modular foundational components for tasks such as speech understanding, speech synthesis, and multimodal learning.

Release Notes:

V1.0

This release of DSTK includes three modules:

  1. Semantic Tokenzier
    • Encode the semantic information of speech into discrete speech tokens.
    • frame rate: 25Hz; codebook size: 4096;
    • Support both Chinese and English
  2. Semantic Detokenizer
    • Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
    • Support both Chinese and English
  3. Text2token (T2U)
    • Convert text content into speech tokens

TTS pipeline

As shown in the figure below, the 3 module could form a pipeline for TTS task.

Non-parallel Speech Reconstruction Pipeline

As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.

These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:

All our experiments were conducted on the Ascend 910B, and the experimental results may differ slightly from those obtained on GPUs.

We also evaluated the ASR performance of our semantic tokenizer using a LLM as backbone. Our model achieve comparable performance to models that use continuous speech representation.

More details about the 3 models:

Installation

Hardware: Ascend 910B with CANN 8.1 RC1 or GPU

Create a separate environment if needed

# Create a conda env with python_version>=3.10  (you could also use virtualenv)
conda create -n dstk python=3.10
conda activate dstk

# run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
# for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
sh install_requirements.sh

# patch for G2P
# modify the first line in thirdparty/G2P/patch_for_deps.sh:
# SITE_PATH=/path/to/your/own/site-packages
# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
sh thirdparty/G2P/patch_for_deps.sh

Run on Ascend 910B platforms

# the env variable TOKENIZE_ON_NPU need to be defined
export TOKENIZE_ON_NPU=1
# this env variable is not need for GPUs, just do not define it.

Download the vocos vocoder from vocos-mel-24khz

Usage:

Pipelines

import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from reconstuction_example import ReconstructionPipeline
from tts_example import TTSPipeline

ref_wav_path = dstk_path + "/00004557-00000030.wav"
input_wav_path = dstk_path + "/004892.wav"
vocoder_path = "/path/to/vocos-mel-24khz"

reconsturctor = ReconstructionPipeline(
    detok_vocoder=vocoder_path,
)

tts = TTSPipeline(
    detok_vocoder=vocoder_path,
    max_seg_len=30,
)

# for non-parallel speech reconstruction
generated_wave, target_sample_rate = reconsturctor.reconstruct(
    ref_wav_path, input_wav_path
)

with open("./recon.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

# for tts
ref_wav_path = input_wav_path
generated_wave, target_sample_rate = tts.synthesize(
    ref_wav_path,
    "荷花未全谢,又到中秋节。家家户户把月饼切,庆中秋。美酒多欢乐,整杯盘,猜拳行令,同赏月。",
)
with open("./tts.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

print("Finished")

Tokenization

import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

input_wav_path = dstk_path + "/004892.wav"

from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer

tokenizer = SpeechTokenizer()

raw_wav, sr = librosa.load(input_wav_path, sr=16000)
token_list, token_info_list = tokenizer.extract([raw_wav])  # 传入波形数据
for token_info in token_info_list:
    print(token_info["unit_sequence"] + "\n")
    print(token_info["reduced_unit_sequence"] + "\n")

Text2Token

import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from text2token.simple_infer import Text2TokenGenerator

input_text = "从离散语音token重建语音波形"
MAX_SEG_LEN = 30

t2u = Text2TokenGenerator()

phones = t2u.text2phone(input_text.strip())
print("phonemes of input text: %s are [%s]" % (input_text, phones))

speech_tokens_info = t2u.generate_for_long_input_text(
    [phones], max_segment_len=MAX_SEG_LEN
)

for infor in speech_tokens_info[0]:
    print(" ".join(infor) + "\n")

Detokenization

import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from semantic_detokenizer.chunk_infer import SpeechDetokenizer

# 从离散语音token重建语音波形
input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
vocoder_path = "/path/to/vocos-mel-24khz"
ref_wav_path = dstk_path + "/004892.wav"
# output of tokenizer given ref_wav as input
ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"

token_chunk_len = 75
chunk_cond_proportion = 0.3
chunk_look_ahead = 10
max_ref_duration = 4.5
ref_audio_cut_from_head = False

detoker = SpeechDetokenizer(
    vocoder_path=vocoder_path,
)

generated_wave, target_sample_rate = detoker.chunk_generate(
    ref_wav_path,
    ref_tokens.split(),
    input_tokens.split(),
    token_chunk_len,
    chunk_cond_proportion,
    chunk_look_ahead,
    max_ref_duration,
    ref_audio_cut_from_head,
)

with open("./detok.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

More tools to be release:

  • 12.5Hz Streaming Semantic Tokenizer and Detokenizer
  • Speech Normalized Tokenizer
  • Speech Disentangled Tokenizer

Core Developers:

Discrete Speech Team, HKRC, Huawei

Daxin Tan, Dehua Tao, Yusen Sun and Xiao Chen

Contributors:

Hanlin Zhang

Former Contributors:

Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang

Acknowledgement

We express our sincere gratitude to HKRC for their support of this project. Their contribution is gratefully acknowledged.

Special thanks to the Textless NLP Project, which has inspired us to embark on this research direction.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DiscreteSpeech/DSTK

Unable to build the model tree, the base model loops to the model itself. Learn more.