mrfakename commited on
Commit
4dab15f
·
verified ·
1 Parent(s): 48f948a

Sync from GitHub repo

Browse files

This Space is synced from the GitHub repo: https://github.com/SWivid/F5-TTS. Please submit contributions to the Space there

Files changed (46) hide show
  1. Dockerfile +1 -2
  2. README_REPO.md +53 -171
  3. app.py +197 -10
  4. data/Emilia_ZH_EN_pinyin/vocab.txt +2545 -2545
  5. data/librispeech_pc_test_clean_cross_sentence.lst +0 -0
  6. pyproject.toml +59 -0
  7. src/f5_tts/api.py +138 -0
  8. src/f5_tts/eval/README.md +49 -0
  9. src/f5_tts/eval/ecapa_tdnn.py +330 -0
  10. src/f5_tts/eval/eval_infer_batch.py +197 -0
  11. src/f5_tts/eval/eval_infer_batch.sh +13 -0
  12. src/f5_tts/eval/eval_librispeech_test_clean.py +73 -0
  13. src/f5_tts/eval/eval_seedtts_testset.py +75 -0
  14. src/f5_tts/eval/utils_eval.py +397 -0
  15. src/f5_tts/infer/README.md +111 -0
  16. src/f5_tts/infer/examples/basic/basic.toml +10 -0
  17. src/f5_tts/infer/examples/basic/basic_ref_en.wav +0 -0
  18. src/f5_tts/infer/examples/basic/basic_ref_zh.wav +0 -0
  19. src/f5_tts/infer/examples/multi/country.flac +0 -0
  20. src/f5_tts/infer/examples/multi/main.flac +0 -0
  21. src/f5_tts/infer/examples/multi/story.toml +19 -0
  22. src/f5_tts/infer/examples/multi/story.txt +1 -0
  23. src/f5_tts/infer/examples/multi/town.flac +0 -0
  24. src/f5_tts/infer/examples/vocab.txt +2545 -0
  25. src/f5_tts/infer/infer_cli.py +193 -0
  26. src/f5_tts/infer/speech_edit.py +191 -0
  27. src/f5_tts/infer/utils_infer.py +417 -0
  28. src/f5_tts/model/__init__.py +10 -0
  29. src/f5_tts/model/backbones/README.md +20 -0
  30. src/f5_tts/model/backbones/dit.py +163 -0
  31. src/f5_tts/model/backbones/mmdit.py +146 -0
  32. src/f5_tts/model/backbones/unett.py +219 -0
  33. src/f5_tts/model/cfm.py +287 -0
  34. src/f5_tts/model/dataset.py +296 -0
  35. src/f5_tts/model/modules.py +581 -0
  36. src/f5_tts/model/trainer.py +300 -0
  37. src/f5_tts/model/utils.py +185 -0
  38. src/f5_tts/scripts/count_max_epoch.py +33 -0
  39. src/f5_tts/scripts/count_params_gflops.py +39 -0
  40. src/f5_tts/train/README.md +69 -0
  41. src/f5_tts/train/datasets/prepare_csv_wavs.py +140 -0
  42. src/f5_tts/train/datasets/prepare_emilia.py +230 -0
  43. src/f5_tts/train/datasets/prepare_wenetspeech4tts.py +125 -0
  44. src/f5_tts/train/finetune_cli.py +145 -0
  45. src/f5_tts/train/finetune_gradio.py +1223 -0
  46. src/f5_tts/train/train.py +96 -0
Dockerfile CHANGED
@@ -17,8 +17,7 @@ WORKDIR /workspace
17
 
18
  RUN git clone https://github.com/SWivid/F5-TTS.git \
19
  && cd F5-TTS \
20
- && pip install --no-cache-dir -r requirements.txt \
21
- && pip install --no-cache-dir -r requirements_eval.txt
22
 
23
  ENV SHELL=/bin/bash
24
 
 
17
 
18
  RUN git clone https://github.com/SWivid/F5-TTS.git \
19
  && cd F5-TTS \
20
+ && pip install -e .[eval]
 
21
 
22
  ENV SHELL=/bin/bash
23
 
README_REPO.md CHANGED
@@ -16,230 +16,112 @@
16
 
17
  ### Thanks to all the contributors !
18
 
19
- ## Installation
20
-
21
- Clone the repository:
22
-
23
- ```bash
24
- git clone https://github.com/SWivid/F5-TTS.git
25
- cd F5-TTS
26
- ```
27
-
28
- Install torch with your CUDA version, e.g. :
29
-
30
- ```bash
31
- pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
32
- pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
33
- ```
34
 
35
- Install other packages:
36
 
37
  ```bash
38
- pip install -r requirements.txt
39
- ```
 
40
 
41
- **[Optional]**: We provide [Dockerfile](https://github.com/SWivid/F5-TTS/blob/main/Dockerfile) and you can use the following command to build it.
42
- ```bash
43
- docker build -t f5tts:v1 .
44
  ```
45
 
46
- ### Development
47
 
48
- When making a pull request, please use pre-commit to ensure code quality:
49
 
50
  ```bash
51
- pip install pre-commit
52
- pre-commit install
53
  ```
54
 
55
- This will run linters and formatters automatically before each commit.
56
-
57
- Manually run using:
58
 
59
  ```bash
60
- pre-commit run --all-files
61
- ```
62
-
63
- Note: Some model components have linting exceptions for E722 to accommodate tensor notation
64
-
65
-
66
- ## Prepare Dataset
67
-
68
- Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
69
-
70
- ```bash
71
- # prepare custom dataset up to your need
72
- # download corresponding dataset first, and fill in the path in scripts
73
-
74
- # Prepare the Emilia dataset
75
- python scripts/prepare_emilia.py
76
-
77
- # Prepare the Wenetspeech4TTS dataset
78
- python scripts/prepare_wenetspeech4tts.py
79
  ```
80
 
81
- ## Training & Finetuning
82
-
83
- Once your datasets are prepared, you can start the training process.
84
-
85
  ```bash
86
- # setup accelerate config, e.g. use multi-gpu ddp, fp16
87
- # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
88
- accelerate config
89
- accelerate launch train.py
90
- ```
91
- An initial guidance on Finetuning [#57](https://github.com/SWivid/F5-TTS/discussions/57).
92
-
93
- Gradio UI finetuning with `finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
94
-
95
- ### Wandb Logging
96
-
97
- By default, the training script does NOT use logging (assuming you didn't manually log in using `wandb login`).
98
-
99
- To turn on wandb logging, you can either:
100
-
101
- 1. Manually login with `wandb login`: Learn more [here](https://docs.wandb.ai/ref/cli/wandb-login)
102
- 2. Automatically login programmatically by setting an environment variable: Get an API KEY at https://wandb.ai/site/ and set the environment variable as follows:
103
-
104
- On Mac & Linux:
105
-
106
- ```
107
- export WANDB_API_KEY=<YOUR WANDB API KEY>
108
- ```
109
-
110
- On Windows:
111
-
112
- ```
113
- set WANDB_API_KEY=<YOUR WANDB API KEY>
114
  ```
115
- Moreover, if you couldn't access Wandb and want to log metrics offline, you can the environment variable as follows:
116
 
117
- ```
118
- export WANDB_MODE=offline
119
- ```
120
 
121
  ## Inference
122
 
123
- The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
124
 
125
- Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
126
- - To avoid possible inference failures, make sure you have seen through the following instructions.
127
- - A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s.
128
- - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
129
- - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.
130
-
131
- ### CLI Inference
132
-
133
- Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
134
-
135
- for change model use `--ckpt_file` to specify the model you want to load,
136
- for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
137
-
138
- ```bash
139
- python inference-cli.py \
140
- --model "F5-TTS" \
141
- --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
142
- --ref_text "Some call me nature, others call me mother nature." \
143
- --gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
144
-
145
- python inference-cli.py \
146
- --model "E2-TTS" \
147
- --ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
148
- --ref_text "对,这就是我,万人敬仰的太乙真人。" \
149
- --gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
150
-
151
- # Multi voice
152
- python inference-cli.py -c samples/story.toml
153
- ```
154
-
155
- ### Gradio App
156
  Currently supported features:
157
- - Chunk inference
158
- - Podcast Generation
159
- - Multiple Speech-Type Generation
160
 
161
- You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
 
 
162
 
163
  ```bash
164
- python gradio_app.py
165
- ```
166
 
167
- You can specify the port/host:
 
168
 
169
- ```bash
170
- python gradio_app.py --port 7860 --host 0.0.0.0
171
  ```
172
 
173
- Or launch a share link:
174
 
175
  ```bash
176
- python gradio_app.py --share
177
- ```
178
-
179
- ### Speech Editing
 
 
 
180
 
181
- To test speech editing capabilities, use the following command.
 
 
 
182
 
183
- ```bash
184
- python speech_edit.py
185
  ```
186
 
187
- ## Evaluation
188
 
189
- ### Prepare Test Datasets
 
190
 
191
- 1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
192
- 2. LibriSpeech test-clean: Download from [OpenSLR](http://www.openslr.org/12/).
193
- 3. Unzip the downloaded datasets and place them in the data/ directory.
194
- 4. Update the path for the test-clean data in `scripts/eval_infer_batch.py`
195
- 5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
196
 
197
- ### Batch Inference for Test Set
198
 
199
- To run batch inference for evaluations, execute the following commands:
200
 
201
- ```bash
202
- # batch inference for evaluations
203
- accelerate config # if not set before
204
- bash scripts/eval_infer_batch.sh
205
- ```
206
-
207
- ### Download Evaluation Model Checkpoints
208
 
209
- 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
210
- 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
211
- 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
212
 
213
- ### Objective Evaluation
214
 
215
- Install packages for evaluation:
216
 
217
  ```bash
218
- pip install -r requirements_eval.txt
219
- ```
220
-
221
- **Some Notes**
222
-
223
- For faster-whisper with CUDA 11:
224
-
225
- ```bash
226
- pip install --force-reinstall ctranslate2==3.24.0
227
  ```
228
 
229
- (Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
230
 
231
  ```bash
232
- pip install faster-whisper==0.10.1
233
  ```
234
 
235
- Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
236
- ```bash
237
- # Evaluation for Seed-TTS test set
238
- python scripts/eval_seedtts_testset.py
239
 
240
- # Evaluation for LibriSpeech-PC test-clean (cross-sentence)
241
- python scripts/eval_librispeech_test_clean.py
242
- ```
243
 
244
  ## Acknowledgements
245
 
 
16
 
17
  ### Thanks to all the contributors !
18
 
19
+ ## News
20
+ - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN).
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ ## Installation
23
 
24
  ```bash
25
+ # Create a python 3.10 conda env (you could also use virtualenv)
26
+ conda create -n f5-tts python=3.10
27
+ conda activate f5-tts
28
 
29
+ # Install pytorch with your CUDA version, e.g.
30
+ pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 
31
  ```
32
 
33
+ Then you can choose from a few options below:
34
 
35
+ ### 1. As a pip package (if just for inference)
36
 
37
  ```bash
38
+ pip install git+https://github.com/SWivid/F5-TTS.git
 
39
  ```
40
 
41
+ ### 2. Local editable (if also do training, finetuning)
 
 
42
 
43
  ```bash
44
+ git clone https://github.com/SWivid/F5-TTS.git
45
+ cd F5-TTS
46
+ pip install -e .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
 
49
+ ### 3. Build from dockerfile
 
 
 
50
  ```bash
51
+ docker build -t f5tts:v1 .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
 
53
 
 
 
 
54
 
55
  ## Inference
56
 
57
+ ### 1. Gradio App
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  Currently supported features:
 
 
 
60
 
61
+ - Basic TTS with Chunk Inference
62
+ - Multi-Style / Multi-Speaker Generation
63
+ - Voice Chat powered by Qwen2.5-3B-Instruct
64
 
65
  ```bash
66
+ # Launch a Gradio app (web interface)
67
+ f5-tts_infer-gradio
68
 
69
+ # Specify the port/host
70
+ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
71
 
72
+ # Launch a share link
73
+ f5-tts_infer-gradio --share
74
  ```
75
 
76
+ ### 2. CLI Inference
77
 
78
  ```bash
79
+ # Run with flags
80
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
81
+ f5-tts_infer-cli \
82
+ --model "F5-TTS" \
83
+ --ref_audio "ref_audio.wav" \
84
+ --ref_text "The content, subtitle or transcription of reference audio." \
85
+ --gen_text "Some text you want TTS model generate for you."
86
 
87
+ # Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
88
+ f5-tts_infer-cli
89
+ # Or with your own .toml file
90
+ f5-tts_infer-cli -c custom.toml
91
 
92
+ # Multi voice. See src/f5_tts/infer/README.md
93
+ f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
94
  ```
95
 
96
+ ### 3. More instructions
97
 
98
+ - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
99
+ - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
100
 
 
 
 
 
 
101
 
102
+ ## [Training](src/f5_tts/train)
103
 
 
104
 
105
+ ## [Evaluation](src/f5_tts/eval)
 
 
 
 
 
 
106
 
 
 
 
107
 
108
+ ## Development
109
 
110
+ Use pre-commit to ensure code quality (will run linters and formatters automatically)
111
 
112
  ```bash
113
+ pip install pre-commit
114
+ pre-commit install
 
 
 
 
 
 
 
115
  ```
116
 
117
+ When making a pull request, before each commit, run:
118
 
119
  ```bash
120
+ pre-commit run --all-files
121
  ```
122
 
123
+ Note: Some model components have linting exceptions for E722 to accommodate tensor notation
 
 
 
124
 
 
 
 
125
 
126
  ## Acknowledgements
127
 
app.py CHANGED
@@ -11,6 +11,7 @@ import soundfile as sf
11
  import torchaudio
12
  from cached_path import cached_path
13
  from pydub import AudioSegment
 
14
 
15
  try:
16
  import spaces
@@ -27,16 +28,14 @@ def gpu_decorator(func):
27
  return func
28
 
29
 
30
- from model import DiT, UNetT
31
- from model.utils import (
32
- save_spectrogram,
33
- )
34
- from model.utils_infer import (
35
  load_vocoder,
36
  load_model,
37
  preprocess_ref_audio_text,
38
  infer_process,
39
  remove_silence_for_generated_wav,
 
40
  )
41
 
42
  vocos = load_vocoder()
@@ -53,6 +52,31 @@ E2TTS_ema_model = load_model(
53
  UNetT, E2TTS_model_cfg, str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors"))
54
  )
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  @gpu_decorator
58
  def infer(ref_audio_orig, ref_text, gen_text, model, remove_silence, cross_fade_duration=0.15, speed=1):
@@ -147,8 +171,8 @@ with gr.Blocks() as app_credits:
147
  # Credits
148
 
149
  * [mrfakename](https://github.com/fakerybakery) for the original [online demo](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
150
- * [RootingInLoad](https://github.com/RootingInLoad) for the podcast generation
151
- * [jpgallegoar](https://github.com/jpgallegoar) for multiple speech-type generation
152
  """)
153
  with gr.Blocks() as app_tts:
154
  gr.Markdown("# Batched TTS")
@@ -250,7 +274,7 @@ with gr.Blocks() as app_podcast:
250
 
251
 
252
  def parse_speechtypes_text(gen_text):
253
- # Pattern to find (Emotion)
254
  pattern = r"\{(.*?)\}"
255
 
256
  # Split the text by the pattern
@@ -324,7 +348,6 @@ with gr.Blocks() as app_emotional:
324
  # Keep track of current number of speech types
325
  speech_type_count = gr.State(value=0)
326
 
327
- # Function to add a speech type
328
  # Function to add a speech type
329
  def add_speech_type_fn(speech_type_count):
330
  if speech_type_count < max_speech_types - 1:
@@ -350,6 +373,7 @@ with gr.Blocks() as app_emotional:
350
  def delete_speech_type_fn(speech_type_count):
351
  # Prepare updates
352
  row_updates = []
 
353
  for i in range(max_speech_types - 1):
354
  if i == index:
355
  row_updates.append(gr.update(visible=False))
@@ -492,6 +516,166 @@ with gr.Blocks() as app_emotional:
492
  outputs=generate_emotional_btn,
493
  )
494
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
  with gr.Blocks() as app:
496
  gr.Markdown(
497
  """
@@ -509,7 +693,10 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
509
  **NOTE: Reference text will be automatically transcribed with Whisper if not provided. For best results, keep your reference clips short (<15s). Ensure the audio is fully uploaded before generating.**
510
  """
511
  )
512
- gr.TabbedInterface([app_tts, app_podcast, app_emotional, app_credits], ["TTS", "Podcast", "Multi-Style", "Credits"])
 
 
 
513
 
514
 
515
  @click.command()
 
11
  import torchaudio
12
  from cached_path import cached_path
13
  from pydub import AudioSegment
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer
15
 
16
  try:
17
  import spaces
 
28
  return func
29
 
30
 
31
+ from f5_tts.model import DiT, UNetT
32
+ from f5_tts.infer.utils_infer import (
 
 
 
33
  load_vocoder,
34
  load_model,
35
  preprocess_ref_audio_text,
36
  infer_process,
37
  remove_silence_for_generated_wav,
38
+ save_spectrogram,
39
  )
40
 
41
  vocos = load_vocoder()
 
52
  UNetT, E2TTS_model_cfg, str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors"))
53
  )
54
 
55
+ chat_model_state = None
56
+ chat_tokenizer_state = None
57
+
58
+
59
+ def generate_response(messages, model, tokenizer):
60
+ """Generate response using Qwen"""
61
+ text = tokenizer.apply_chat_template(
62
+ messages,
63
+ tokenize=False,
64
+ add_generation_prompt=True,
65
+ )
66
+
67
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
68
+ generated_ids = model.generate(
69
+ **model_inputs,
70
+ max_new_tokens=512,
71
+ temperature=0.7,
72
+ top_p=0.95,
73
+ )
74
+
75
+ generated_ids = [
76
+ output_ids[len(input_ids) :] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
77
+ ]
78
+ return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
79
+
80
 
81
  @gpu_decorator
82
  def infer(ref_audio_orig, ref_text, gen_text, model, remove_silence, cross_fade_duration=0.15, speed=1):
 
171
  # Credits
172
 
173
  * [mrfakename](https://github.com/fakerybakery) for the original [online demo](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
174
+ * [RootingInLoad](https://github.com/RootingInLoad) for initial chunk generation and podcast app exploration
175
+ * [jpgallegoar](https://github.com/jpgallegoar) for multiple speech-type generation & voice chat
176
  """)
177
  with gr.Blocks() as app_tts:
178
  gr.Markdown("# Batched TTS")
 
274
 
275
 
276
  def parse_speechtypes_text(gen_text):
277
+ # Pattern to find {speechtype}
278
  pattern = r"\{(.*?)\}"
279
 
280
  # Split the text by the pattern
 
348
  # Keep track of current number of speech types
349
  speech_type_count = gr.State(value=0)
350
 
 
351
  # Function to add a speech type
352
  def add_speech_type_fn(speech_type_count):
353
  if speech_type_count < max_speech_types - 1:
 
373
  def delete_speech_type_fn(speech_type_count):
374
  # Prepare updates
375
  row_updates = []
376
+
377
  for i in range(max_speech_types - 1):
378
  if i == index:
379
  row_updates.append(gr.update(visible=False))
 
516
  outputs=generate_emotional_btn,
517
  )
518
 
519
+
520
+ with gr.Blocks() as app_chat:
521
+ gr.Markdown(
522
+ """
523
+ # Voice Chat
524
+ Have a conversation with an AI using your reference voice!
525
+ 1. Upload a reference audio clip and optionally its transcript.
526
+ 2. Load the chat model.
527
+ 3. Record your message through your microphone.
528
+ 4. The AI will respond using the reference voice.
529
+ """
530
+ )
531
+
532
+ load_chat_model_btn = gr.Button("Load Chat Model", variant="primary")
533
+
534
+ chat_interface_container = gr.Column(visible=False)
535
+
536
+ def load_chat_model():
537
+ global chat_model_state, chat_tokenizer_state
538
+ if chat_model_state is None:
539
+ show_info = gr.Info
540
+ show_info("Loading chat model...")
541
+ model_name = "Qwen/Qwen2.5-3B-Instruct"
542
+ chat_model_state = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
543
+ chat_tokenizer_state = AutoTokenizer.from_pretrained(model_name)
544
+ show_info("Chat model loaded.")
545
+
546
+ return gr.update(visible=False), gr.update(visible=True)
547
+
548
+ load_chat_model_btn.click(load_chat_model, outputs=[load_chat_model_btn, chat_interface_container])
549
+
550
+ with chat_interface_container:
551
+ with gr.Row():
552
+ with gr.Column():
553
+ ref_audio_chat = gr.Audio(label="Reference Audio", type="filepath")
554
+ with gr.Column():
555
+ with gr.Accordion("Advanced Settings", open=False):
556
+ model_choice_chat = gr.Radio(
557
+ choices=["F5-TTS", "E2-TTS"],
558
+ label="TTS Model",
559
+ value="F5-TTS",
560
+ )
561
+ remove_silence_chat = gr.Checkbox(
562
+ label="Remove Silences",
563
+ value=True,
564
+ )
565
+ ref_text_chat = gr.Textbox(
566
+ label="Reference Text",
567
+ info="Optional: Leave blank to auto-transcribe",
568
+ lines=2,
569
+ )
570
+ system_prompt_chat = gr.Textbox(
571
+ label="System Prompt",
572
+ value="You are not an AI assistant, you are whoever the user says you are. You must stay in character. Keep your responses concise since they will be spoken out loud.",
573
+ lines=2,
574
+ )
575
+
576
+ chatbot_interface = gr.Chatbot(label="Conversation")
577
+
578
+ with gr.Row():
579
+ with gr.Column():
580
+ audio_output_chat = gr.Audio(autoplay=True)
581
+ with gr.Column():
582
+ audio_input_chat = gr.Microphone(
583
+ label="Speak your message",
584
+ type="filepath",
585
+ )
586
+
587
+ clear_btn_chat = gr.Button("Clear Conversation")
588
+
589
+ conversation_state = gr.State(
590
+ value=[
591
+ {
592
+ "role": "system",
593
+ "content": "You are not an AI assistant, you are whoever the user says you are. You must stay in character. Keep your responses concise since they will be spoken out loud.",
594
+ }
595
+ ]
596
+ )
597
+
598
+ # Modify process_audio_input to use model and tokenizer from state
599
+ def process_audio_input(audio_path, history, conv_state):
600
+ """Handle audio input from user"""
601
+ if not audio_path:
602
+ return history, conv_state, ""
603
+
604
+ text = ""
605
+ text = preprocess_ref_audio_text(audio_path, text)[1]
606
+
607
+ if not text.strip():
608
+ return history, conv_state, ""
609
+
610
+ conv_state.append({"role": "user", "content": text})
611
+ history.append((text, None))
612
+
613
+ response = generate_response(conv_state, chat_model_state, chat_tokenizer_state)
614
+
615
+ conv_state.append({"role": "assistant", "content": response})
616
+ history[-1] = (text, response)
617
+
618
+ return history, conv_state, ""
619
+
620
+ def generate_audio_response(history, ref_audio, ref_text, model, remove_silence):
621
+ """Generate TTS audio for AI response"""
622
+ if not history or not ref_audio:
623
+ return None
624
+
625
+ last_user_message, last_ai_response = history[-1]
626
+ if not last_ai_response:
627
+ return None
628
+
629
+ audio_result, _ = infer(
630
+ ref_audio,
631
+ ref_text,
632
+ last_ai_response,
633
+ model,
634
+ remove_silence,
635
+ cross_fade_duration=0.15,
636
+ speed=1.0,
637
+ )
638
+ return audio_result
639
+
640
+ def clear_conversation():
641
+ """Reset the conversation"""
642
+ return [], [
643
+ {
644
+ "role": "system",
645
+ "content": "You are not an AI assistant, you are whoever the user says you are. You must stay in character. Keep your responses concise since they will be spoken out loud.",
646
+ }
647
+ ]
648
+
649
+ def update_system_prompt(new_prompt):
650
+ """Update the system prompt and reset the conversation"""
651
+ new_conv_state = [{"role": "system", "content": new_prompt}]
652
+ return [], new_conv_state
653
+
654
+ # Handle audio input
655
+ audio_input_chat.stop_recording(
656
+ process_audio_input,
657
+ inputs=[audio_input_chat, chatbot_interface, conversation_state],
658
+ outputs=[chatbot_interface, conversation_state],
659
+ ).then(
660
+ generate_audio_response,
661
+ inputs=[chatbot_interface, ref_audio_chat, ref_text_chat, model_choice_chat, remove_silence_chat],
662
+ outputs=audio_output_chat,
663
+ )
664
+
665
+ # Handle clear button
666
+ clear_btn_chat.click(
667
+ clear_conversation,
668
+ outputs=[chatbot_interface, conversation_state],
669
+ )
670
+
671
+ # Handle system prompt change and reset conversation
672
+ system_prompt_chat.change(
673
+ update_system_prompt,
674
+ inputs=system_prompt_chat,
675
+ outputs=[chatbot_interface, conversation_state],
676
+ )
677
+
678
+
679
  with gr.Blocks() as app:
680
  gr.Markdown(
681
  """
 
693
  **NOTE: Reference text will be automatically transcribed with Whisper if not provided. For best results, keep your reference clips short (<15s). Ensure the audio is fully uploaded before generating.**
694
  """
695
  )
696
+ gr.TabbedInterface(
697
+ [app_tts, app_podcast, app_emotional, app_chat, app_credits],
698
+ ["TTS", "Podcast", "Multi-Style", "Voice-Chat", "Credits"],
699
+ )
700
 
701
 
702
  @click.command()
data/Emilia_ZH_EN_pinyin/vocab.txt CHANGED
@@ -1,2545 +1,2545 @@
1
-
2
- !
3
- "
4
- #
5
- $
6
- %
7
- &
8
- '
9
- (
10
- )
11
- *
12
- +
13
- ,
14
- -
15
- .
16
- /
17
- 0
18
- 1
19
- 2
20
- 3
21
- 4
22
- 5
23
- 6
24
- 7
25
- 8
26
- 9
27
- :
28
- ;
29
- =
30
- >
31
- ?
32
- @
33
- A
34
- B
35
- C
36
- D
37
- E
38
- F
39
- G
40
- H
41
- I
42
- J
43
- K
44
- L
45
- M
46
- N
47
- O
48
- P
49
- Q
50
- R
51
- S
52
- T
53
- U
54
- V
55
- W
56
- X
57
- Y
58
- Z
59
- [
60
- \
61
- ]
62
- _
63
- a
64
- a1
65
- ai1
66
- ai2
67
- ai3
68
- ai4
69
- an1
70
- an3
71
- an4
72
- ang1
73
- ang2
74
- ang4
75
- ao1
76
- ao2
77
- ao3
78
- ao4
79
- b
80
- ba
81
- ba1
82
- ba2
83
- ba3
84
- ba4
85
- bai1
86
- bai2
87
- bai3
88
- bai4
89
- ban1
90
- ban2
91
- ban3
92
- ban4
93
- bang1
94
- bang2
95
- bang3
96
- bang4
97
- bao1
98
- bao2
99
- bao3
100
- bao4
101
- bei
102
- bei1
103
- bei2
104
- bei3
105
- bei4
106
- ben1
107
- ben2
108
- ben3
109
- ben4
110
- beng
111
- beng1
112
- beng2
113
- beng3
114
- beng4
115
- bi1
116
- bi2
117
- bi3
118
- bi4
119
- bian1
120
- bian2
121
- bian3
122
- bian4
123
- biao1
124
- biao2
125
- biao3
126
- bie1
127
- bie2
128
- bie3
129
- bie4
130
- bin1
131
- bin4
132
- bing1
133
- bing2
134
- bing3
135
- bing4
136
- bo
137
- bo1
138
- bo2
139
- bo3
140
- bo4
141
- bu2
142
- bu3
143
- bu4
144
- c
145
- ca1
146
- cai1
147
- cai2
148
- cai3
149
- cai4
150
- can1
151
- can2
152
- can3
153
- can4
154
- cang1
155
- cang2
156
- cao1
157
- cao2
158
- cao3
159
- ce4
160
- cen1
161
- cen2
162
- ceng1
163
- ceng2
164
- ceng4
165
- cha1
166
- cha2
167
- cha3
168
- cha4
169
- chai1
170
- chai2
171
- chan1
172
- chan2
173
- chan3
174
- chan4
175
- chang1
176
- chang2
177
- chang3
178
- chang4
179
- chao1
180
- chao2
181
- chao3
182
- che1
183
- che2
184
- che3
185
- che4
186
- chen1
187
- chen2
188
- chen3
189
- chen4
190
- cheng1
191
- cheng2
192
- cheng3
193
- cheng4
194
- chi1
195
- chi2
196
- chi3
197
- chi4
198
- chong1
199
- chong2
200
- chong3
201
- chong4
202
- chou1
203
- chou2
204
- chou3
205
- chou4
206
- chu1
207
- chu2
208
- chu3
209
- chu4
210
- chua1
211
- chuai1
212
- chuai2
213
- chuai3
214
- chuai4
215
- chuan1
216
- chuan2
217
- chuan3
218
- chuan4
219
- chuang1
220
- chuang2
221
- chuang3
222
- chuang4
223
- chui1
224
- chui2
225
- chun1
226
- chun2
227
- chun3
228
- chuo1
229
- chuo4
230
- ci1
231
- ci2
232
- ci3
233
- ci4
234
- cong1
235
- cong2
236
- cou4
237
- cu1
238
- cu4
239
- cuan1
240
- cuan2
241
- cuan4
242
- cui1
243
- cui3
244
- cui4
245
- cun1
246
- cun2
247
- cun4
248
- cuo1
249
- cuo2
250
- cuo4
251
- d
252
- da
253
- da1
254
- da2
255
- da3
256
- da4
257
- dai1
258
- dai2
259
- dai3
260
- dai4
261
- dan1
262
- dan2
263
- dan3
264
- dan4
265
- dang1
266
- dang2
267
- dang3
268
- dang4
269
- dao1
270
- dao2
271
- dao3
272
- dao4
273
- de
274
- de1
275
- de2
276
- dei3
277
- den4
278
- deng1
279
- deng2
280
- deng3
281
- deng4
282
- di1
283
- di2
284
- di3
285
- di4
286
- dia3
287
- dian1
288
- dian2
289
- dian3
290
- dian4
291
- diao1
292
- diao3
293
- diao4
294
- die1
295
- die2
296
- die4
297
- ding1
298
- ding2
299
- ding3
300
- ding4
301
- diu1
302
- dong1
303
- dong3
304
- dong4
305
- dou1
306
- dou2
307
- dou3
308
- dou4
309
- du1
310
- du2
311
- du3
312
- du4
313
- duan1
314
- duan2
315
- duan3
316
- duan4
317
- dui1
318
- dui4
319
- dun1
320
- dun3
321
- dun4
322
- duo1
323
- duo2
324
- duo3
325
- duo4
326
- e
327
- e1
328
- e2
329
- e3
330
- e4
331
- ei2
332
- en1
333
- en4
334
- er
335
- er2
336
- er3
337
- er4
338
- f
339
- fa1
340
- fa2
341
- fa3
342
- fa4
343
- fan1
344
- fan2
345
- fan3
346
- fan4
347
- fang1
348
- fang2
349
- fang3
350
- fang4
351
- fei1
352
- fei2
353
- fei3
354
- fei4
355
- fen1
356
- fen2
357
- fen3
358
- fen4
359
- feng1
360
- feng2
361
- feng3
362
- feng4
363
- fo2
364
- fou2
365
- fou3
366
- fu1
367
- fu2
368
- fu3
369
- fu4
370
- g
371
- ga1
372
- ga2
373
- ga3
374
- ga4
375
- gai1
376
- gai2
377
- gai3
378
- gai4
379
- gan1
380
- gan2
381
- gan3
382
- gan4
383
- gang1
384
- gang2
385
- gang3
386
- gang4
387
- gao1
388
- gao2
389
- gao3
390
- gao4
391
- ge1
392
- ge2
393
- ge3
394
- ge4
395
- gei2
396
- gei3
397
- gen1
398
- gen2
399
- gen3
400
- gen4
401
- geng1
402
- geng3
403
- geng4
404
- gong1
405
- gong3
406
- gong4
407
- gou1
408
- gou2
409
- gou3
410
- gou4
411
- gu
412
- gu1
413
- gu2
414
- gu3
415
- gu4
416
- gua1
417
- gua2
418
- gua3
419
- gua4
420
- guai1
421
- guai2
422
- guai3
423
- guai4
424
- guan1
425
- guan2
426
- guan3
427
- guan4
428
- guang1
429
- guang2
430
- guang3
431
- guang4
432
- gui1
433
- gui2
434
- gui3
435
- gui4
436
- gun3
437
- gun4
438
- guo1
439
- guo2
440
- guo3
441
- guo4
442
- h
443
- ha1
444
- ha2
445
- ha3
446
- hai1
447
- hai2
448
- hai3
449
- hai4
450
- han1
451
- han2
452
- han3
453
- han4
454
- hang1
455
- hang2
456
- hang4
457
- hao1
458
- hao2
459
- hao3
460
- hao4
461
- he1
462
- he2
463
- he4
464
- hei1
465
- hen2
466
- hen3
467
- hen4
468
- heng1
469
- heng2
470
- heng4
471
- hong1
472
- hong2
473
- hong3
474
- hong4
475
- hou1
476
- hou2
477
- hou3
478
- hou4
479
- hu1
480
- hu2
481
- hu3
482
- hu4
483
- hua1
484
- hua2
485
- hua4
486
- huai2
487
- huai4
488
- huan1
489
- huan2
490
- huan3
491
- huan4
492
- huang1
493
- huang2
494
- huang3
495
- huang4
496
- hui1
497
- hui2
498
- hui3
499
- hui4
500
- hun1
501
- hun2
502
- hun4
503
- huo
504
- huo1
505
- huo2
506
- huo3
507
- huo4
508
- i
509
- j
510
- ji1
511
- ji2
512
- ji3
513
- ji4
514
- jia
515
- jia1
516
- jia2
517
- jia3
518
- jia4
519
- jian1
520
- jian2
521
- jian3
522
- jian4
523
- jiang1
524
- jiang2
525
- jiang3
526
- jiang4
527
- jiao1
528
- jiao2
529
- jiao3
530
- jiao4
531
- jie1
532
- jie2
533
- jie3
534
- jie4
535
- jin1
536
- jin2
537
- jin3
538
- jin4
539
- jing1
540
- jing2
541
- jing3
542
- jing4
543
- jiong3
544
- jiu1
545
- jiu2
546
- jiu3
547
- jiu4
548
- ju1
549
- ju2
550
- ju3
551
- ju4
552
- juan1
553
- juan2
554
- juan3
555
- juan4
556
- jue1
557
- jue2
558
- jue4
559
- jun1
560
- jun4
561
- k
562
- ka1
563
- ka2
564
- ka3
565
- kai1
566
- kai2
567
- kai3
568
- kai4
569
- kan1
570
- kan2
571
- kan3
572
- kan4
573
- kang1
574
- kang2
575
- kang4
576
- kao1
577
- kao2
578
- kao3
579
- kao4
580
- ke1
581
- ke2
582
- ke3
583
- ke4
584
- ken3
585
- keng1
586
- kong1
587
- kong3
588
- kong4
589
- kou1
590
- kou2
591
- kou3
592
- kou4
593
- ku1
594
- ku2
595
- ku3
596
- ku4
597
- kua1
598
- kua3
599
- kua4
600
- kuai3
601
- kuai4
602
- kuan1
603
- kuan2
604
- kuan3
605
- kuang1
606
- kuang2
607
- kuang4
608
- kui1
609
- kui2
610
- kui3
611
- kui4
612
- kun1
613
- kun3
614
- kun4
615
- kuo4
616
- l
617
- la
618
- la1
619
- la2
620
- la3
621
- la4
622
- lai2
623
- lai4
624
- lan2
625
- lan3
626
- lan4
627
- lang1
628
- lang2
629
- lang3
630
- lang4
631
- lao1
632
- lao2
633
- lao3
634
- lao4
635
- le
636
- le1
637
- le4
638
- lei
639
- lei1
640
- lei2
641
- lei3
642
- lei4
643
- leng1
644
- leng2
645
- leng3
646
- leng4
647
- li
648
- li1
649
- li2
650
- li3
651
- li4
652
- lia3
653
- lian2
654
- lian3
655
- lian4
656
- liang2
657
- liang3
658
- liang4
659
- liao1
660
- liao2
661
- liao3
662
- liao4
663
- lie1
664
- lie2
665
- lie3
666
- lie4
667
- lin1
668
- lin2
669
- lin3
670
- lin4
671
- ling2
672
- ling3
673
- ling4
674
- liu1
675
- liu2
676
- liu3
677
- liu4
678
- long1
679
- long2
680
- long3
681
- long4
682
- lou1
683
- lou2
684
- lou3
685
- lou4
686
- lu1
687
- lu2
688
- lu3
689
- lu4
690
- luan2
691
- luan3
692
- luan4
693
- lun1
694
- lun2
695
- lun4
696
- luo1
697
- luo2
698
- luo3
699
- luo4
700
- lv2
701
- lv3
702
- lv4
703
- lve3
704
- lve4
705
- m
706
- ma
707
- ma1
708
- ma2
709
- ma3
710
- ma4
711
- mai2
712
- mai3
713
- mai4
714
- man1
715
- man2
716
- man3
717
- man4
718
- mang2
719
- mang3
720
- mao1
721
- mao2
722
- mao3
723
- mao4
724
- me
725
- mei2
726
- mei3
727
- mei4
728
- men
729
- men1
730
- men2
731
- men4
732
- meng
733
- meng1
734
- meng2
735
- meng3
736
- meng4
737
- mi1
738
- mi2
739
- mi3
740
- mi4
741
- mian2
742
- mian3
743
- mian4
744
- miao1
745
- miao2
746
- miao3
747
- miao4
748
- mie1
749
- mie4
750
- min2
751
- min3
752
- ming2
753
- ming3
754
- ming4
755
- miu4
756
- mo1
757
- mo2
758
- mo3
759
- mo4
760
- mou1
761
- mou2
762
- mou3
763
- mu2
764
- mu3
765
- mu4
766
- n
767
- n2
768
- na1
769
- na2
770
- na3
771
- na4
772
- nai2
773
- nai3
774
- nai4
775
- nan1
776
- nan2
777
- nan3
778
- nan4
779
- nang1
780
- nang2
781
- nang3
782
- nao1
783
- nao2
784
- nao3
785
- nao4
786
- ne
787
- ne2
788
- ne4
789
- nei3
790
- nei4
791
- nen4
792
- neng2
793
- ni1
794
- ni2
795
- ni3
796
- ni4
797
- nian1
798
- nian2
799
- nian3
800
- nian4
801
- niang2
802
- niang4
803
- niao2
804
- niao3
805
- niao4
806
- nie1
807
- nie4
808
- nin2
809
- ning2
810
- ning3
811
- ning4
812
- niu1
813
- niu2
814
- niu3
815
- niu4
816
- nong2
817
- nong4
818
- nou4
819
- nu2
820
- nu3
821
- nu4
822
- nuan3
823
- nuo2
824
- nuo4
825
- nv2
826
- nv3
827
- nve4
828
- o
829
- o1
830
- o2
831
- ou1
832
- ou2
833
- ou3
834
- ou4
835
- p
836
- pa1
837
- pa2
838
- pa4
839
- pai1
840
- pai2
841
- pai3
842
- pai4
843
- pan1
844
- pan2
845
- pan4
846
- pang1
847
- pang2
848
- pang4
849
- pao1
850
- pao2
851
- pao3
852
- pao4
853
- pei1
854
- pei2
855
- pei4
856
- pen1
857
- pen2
858
- pen4
859
- peng1
860
- peng2
861
- peng3
862
- peng4
863
- pi1
864
- pi2
865
- pi3
866
- pi4
867
- pian1
868
- pian2
869
- pian4
870
- piao1
871
- piao2
872
- piao3
873
- piao4
874
- pie1
875
- pie2
876
- pie3
877
- pin1
878
- pin2
879
- pin3
880
- pin4
881
- ping1
882
- ping2
883
- po1
884
- po2
885
- po3
886
- po4
887
- pou1
888
- pu1
889
- pu2
890
- pu3
891
- pu4
892
- q
893
- qi1
894
- qi2
895
- qi3
896
- qi4
897
- qia1
898
- qia3
899
- qia4
900
- qian1
901
- qian2
902
- qian3
903
- qian4
904
- qiang1
905
- qiang2
906
- qiang3
907
- qiang4
908
- qiao1
909
- qiao2
910
- qiao3
911
- qiao4
912
- qie1
913
- qie2
914
- qie3
915
- qie4
916
- qin1
917
- qin2
918
- qin3
919
- qin4
920
- qing1
921
- qing2
922
- qing3
923
- qing4
924
- qiong1
925
- qiong2
926
- qiu1
927
- qiu2
928
- qiu3
929
- qu1
930
- qu2
931
- qu3
932
- qu4
933
- quan1
934
- quan2
935
- quan3
936
- quan4
937
- que1
938
- que2
939
- que4
940
- qun2
941
- r
942
- ran2
943
- ran3
944
- rang1
945
- rang2
946
- rang3
947
- rang4
948
- rao2
949
- rao3
950
- rao4
951
- re2
952
- re3
953
- re4
954
- ren2
955
- ren3
956
- ren4
957
- reng1
958
- reng2
959
- ri4
960
- rong1
961
- rong2
962
- rong3
963
- rou2
964
- rou4
965
- ru2
966
- ru3
967
- ru4
968
- ruan2
969
- ruan3
970
- rui3
971
- rui4
972
- run4
973
- ruo4
974
- s
975
- sa1
976
- sa2
977
- sa3
978
- sa4
979
- sai1
980
- sai4
981
- san1
982
- san2
983
- san3
984
- san4
985
- sang1
986
- sang3
987
- sang4
988
- sao1
989
- sao2
990
- sao3
991
- sao4
992
- se4
993
- sen1
994
- seng1
995
- sha1
996
- sha2
997
- sha3
998
- sha4
999
- shai1
1000
- shai2
1001
- shai3
1002
- shai4
1003
- shan1
1004
- shan3
1005
- shan4
1006
- shang
1007
- shang1
1008
- shang3
1009
- shang4
1010
- shao1
1011
- shao2
1012
- shao3
1013
- shao4
1014
- she1
1015
- she2
1016
- she3
1017
- she4
1018
- shei2
1019
- shen1
1020
- shen2
1021
- shen3
1022
- shen4
1023
- sheng1
1024
- sheng2
1025
- sheng3
1026
- sheng4
1027
- shi
1028
- shi1
1029
- shi2
1030
- shi3
1031
- shi4
1032
- shou1
1033
- shou2
1034
- shou3
1035
- shou4
1036
- shu1
1037
- shu2
1038
- shu3
1039
- shu4
1040
- shua1
1041
- shua2
1042
- shua3
1043
- shua4
1044
- shuai1
1045
- shuai3
1046
- shuai4
1047
- shuan1
1048
- shuan4
1049
- shuang1
1050
- shuang3
1051
- shui2
1052
- shui3
1053
- shui4
1054
- shun3
1055
- shun4
1056
- shuo1
1057
- shuo4
1058
- si1
1059
- si2
1060
- si3
1061
- si4
1062
- song1
1063
- song3
1064
- song4
1065
- sou1
1066
- sou3
1067
- sou4
1068
- su1
1069
- su2
1070
- su4
1071
- suan1
1072
- suan4
1073
- sui1
1074
- sui2
1075
- sui3
1076
- sui4
1077
- sun1
1078
- sun3
1079
- suo
1080
- suo1
1081
- suo2
1082
- suo3
1083
- t
1084
- ta1
1085
- ta2
1086
- ta3
1087
- ta4
1088
- tai1
1089
- tai2
1090
- tai4
1091
- tan1
1092
- tan2
1093
- tan3
1094
- tan4
1095
- tang1
1096
- tang2
1097
- tang3
1098
- tang4
1099
- tao1
1100
- tao2
1101
- tao3
1102
- tao4
1103
- te4
1104
- teng2
1105
- ti1
1106
- ti2
1107
- ti3
1108
- ti4
1109
- tian1
1110
- tian2
1111
- tian3
1112
- tiao1
1113
- tiao2
1114
- tiao3
1115
- tiao4
1116
- tie1
1117
- tie2
1118
- tie3
1119
- tie4
1120
- ting1
1121
- ting2
1122
- ting3
1123
- tong1
1124
- tong2
1125
- tong3
1126
- tong4
1127
- tou
1128
- tou1
1129
- tou2
1130
- tou4
1131
- tu1
1132
- tu2
1133
- tu3
1134
- tu4
1135
- tuan1
1136
- tuan2
1137
- tui1
1138
- tui2
1139
- tui3
1140
- tui4
1141
- tun1
1142
- tun2
1143
- tun4
1144
- tuo1
1145
- tuo2
1146
- tuo3
1147
- tuo4
1148
- u
1149
- v
1150
- w
1151
- wa
1152
- wa1
1153
- wa2
1154
- wa3
1155
- wa4
1156
- wai1
1157
- wai3
1158
- wai4
1159
- wan1
1160
- wan2
1161
- wan3
1162
- wan4
1163
- wang1
1164
- wang2
1165
- wang3
1166
- wang4
1167
- wei1
1168
- wei2
1169
- wei3
1170
- wei4
1171
- wen1
1172
- wen2
1173
- wen3
1174
- wen4
1175
- weng1
1176
- weng4
1177
- wo1
1178
- wo2
1179
- wo3
1180
- wo4
1181
- wu1
1182
- wu2
1183
- wu3
1184
- wu4
1185
- x
1186
- xi1
1187
- xi2
1188
- xi3
1189
- xi4
1190
- xia1
1191
- xia2
1192
- xia4
1193
- xian1
1194
- xian2
1195
- xian3
1196
- xian4
1197
- xiang1
1198
- xiang2
1199
- xiang3
1200
- xiang4
1201
- xiao1
1202
- xiao2
1203
- xiao3
1204
- xiao4
1205
- xie1
1206
- xie2
1207
- xie3
1208
- xie4
1209
- xin1
1210
- xin2
1211
- xin4
1212
- xing1
1213
- xing2
1214
- xing3
1215
- xing4
1216
- xiong1
1217
- xiong2
1218
- xiu1
1219
- xiu3
1220
- xiu4
1221
- xu
1222
- xu1
1223
- xu2
1224
- xu3
1225
- xu4
1226
- xuan1
1227
- xuan2
1228
- xuan3
1229
- xuan4
1230
- xue1
1231
- xue2
1232
- xue3
1233
- xue4
1234
- xun1
1235
- xun2
1236
- xun4
1237
- y
1238
- ya
1239
- ya1
1240
- ya2
1241
- ya3
1242
- ya4
1243
- yan1
1244
- yan2
1245
- yan3
1246
- yan4
1247
- yang1
1248
- yang2
1249
- yang3
1250
- yang4
1251
- yao1
1252
- yao2
1253
- yao3
1254
- yao4
1255
- ye1
1256
- ye2
1257
- ye3
1258
- ye4
1259
- yi
1260
- yi1
1261
- yi2
1262
- yi3
1263
- yi4
1264
- yin1
1265
- yin2
1266
- yin3
1267
- yin4
1268
- ying1
1269
- ying2
1270
- ying3
1271
- ying4
1272
- yo1
1273
- yong1
1274
- yong2
1275
- yong3
1276
- yong4
1277
- you1
1278
- you2
1279
- you3
1280
- you4
1281
- yu1
1282
- yu2
1283
- yu3
1284
- yu4
1285
- yuan1
1286
- yuan2
1287
- yuan3
1288
- yuan4
1289
- yue1
1290
- yue4
1291
- yun1
1292
- yun2
1293
- yun3
1294
- yun4
1295
- z
1296
- za1
1297
- za2
1298
- za3
1299
- zai1
1300
- zai3
1301
- zai4
1302
- zan1
1303
- zan2
1304
- zan3
1305
- zan4
1306
- zang1
1307
- zang4
1308
- zao1
1309
- zao2
1310
- zao3
1311
- zao4
1312
- ze2
1313
- ze4
1314
- zei2
1315
- zen3
1316
- zeng1
1317
- zeng4
1318
- zha1
1319
- zha2
1320
- zha3
1321
- zha4
1322
- zhai1
1323
- zhai2
1324
- zhai3
1325
- zhai4
1326
- zhan1
1327
- zhan2
1328
- zhan3
1329
- zhan4
1330
- zhang1
1331
- zhang2
1332
- zhang3
1333
- zhang4
1334
- zhao1
1335
- zhao2
1336
- zhao3
1337
- zhao4
1338
- zhe
1339
- zhe1
1340
- zhe2
1341
- zhe3
1342
- zhe4
1343
- zhen1
1344
- zhen2
1345
- zhen3
1346
- zhen4
1347
- zheng1
1348
- zheng2
1349
- zheng3
1350
- zheng4
1351
- zhi1
1352
- zhi2
1353
- zhi3
1354
- zhi4
1355
- zhong1
1356
- zhong2
1357
- zhong3
1358
- zhong4
1359
- zhou1
1360
- zhou2
1361
- zhou3
1362
- zhou4
1363
- zhu1
1364
- zhu2
1365
- zhu3
1366
- zhu4
1367
- zhua1
1368
- zhua2
1369
- zhua3
1370
- zhuai1
1371
- zhuai3
1372
- zhuai4
1373
- zhuan1
1374
- zhuan2
1375
- zhuan3
1376
- zhuan4
1377
- zhuang1
1378
- zhuang4
1379
- zhui1
1380
- zhui4
1381
- zhun1
1382
- zhun2
1383
- zhun3
1384
- zhuo1
1385
- zhuo2
1386
- zi
1387
- zi1
1388
- zi2
1389
- zi3
1390
- zi4
1391
- zong1
1392
- zong2
1393
- zong3
1394
- zong4
1395
- zou1
1396
- zou2
1397
- zou3
1398
- zou4
1399
- zu1
1400
- zu2
1401
- zu3
1402
- zuan1
1403
- zuan3
1404
- zuan4
1405
- zui2
1406
- zui3
1407
- zui4
1408
- zun1
1409
- zuo
1410
- zuo1
1411
- zuo2
1412
- zuo3
1413
- zuo4
1414
- {
1415
- ~
1416
- ¡
1417
- ¢
1418
- £
1419
- ¥
1420
- §
1421
- ¨
1422
- ©
1423
- «
1424
- ®
1425
- ¯
1426
- °
1427
- ±
1428
- ²
1429
- ³
1430
- ´
1431
- µ
1432
- ·
1433
- ¹
1434
- º
1435
- »
1436
- ¼
1437
- ½
1438
- ¾
1439
- ¿
1440
- À
1441
- Á
1442
- Â
1443
- Ã
1444
- Ä
1445
- Å
1446
- Æ
1447
- Ç
1448
- È
1449
- É
1450
- Ê
1451
- Í
1452
- Î
1453
- Ñ
1454
- Ó
1455
- Ö
1456
- ×
1457
- Ø
1458
- Ú
1459
- Ü
1460
- Ý
1461
- Þ
1462
- ß
1463
- à
1464
- á
1465
- â
1466
- ã
1467
- ä
1468
- å
1469
- æ
1470
- ç
1471
- è
1472
- é
1473
- ê
1474
- ë
1475
- ì
1476
- í
1477
- î
1478
- ï
1479
- ð
1480
- ñ
1481
- ò
1482
- ó
1483
- ô
1484
- õ
1485
- ö
1486
- ø
1487
- ù
1488
- ú
1489
- û
1490
- ü
1491
- ý
1492
- Ā
1493
- ā
1494
- ă
1495
- ą
1496
- ć
1497
- Č
1498
- č
1499
- Đ
1500
- đ
1501
- ē
1502
- ė
1503
- ę
1504
- ě
1505
- ĝ
1506
- ğ
1507
- ħ
1508
- ī
1509
- į
1510
- İ
1511
- ı
1512
- Ł
1513
- ł
1514
- ń
1515
- ņ
1516
- ň
1517
- ŋ
1518
- Ō
1519
- ō
1520
- ő
1521
- œ
1522
- ř
1523
- Ś
1524
- ś
1525
- Ş
1526
- ş
1527
- Š
1528
- š
1529
- Ť
1530
- ť
1531
- ũ
1532
- ū
1533
- ź
1534
- Ż
1535
- ż
1536
- Ž
1537
- ž
1538
- ơ
1539
- ư
1540
- ǎ
1541
- ǐ
1542
- ǒ
1543
- ǔ
1544
- ǚ
1545
- ș
1546
- ț
1547
- ɑ
1548
- ɔ
1549
- ɕ
1550
- ə
1551
- ɛ
1552
- ɜ
1553
- ɡ
1554
- ɣ
1555
- ɪ
1556
- ɫ
1557
- ɴ
1558
- ɹ
1559
- ɾ
1560
- ʃ
1561
- ʊ
1562
- ʌ
1563
- ʒ
1564
- ʔ
1565
- ʰ
1566
- ʷ
1567
- ʻ
1568
- ʾ
1569
- ʿ
1570
- ˈ
1571
- ː
1572
- ˙
1573
- ˜
1574
- ˢ
1575
- ́
1576
- ̅
1577
- Α
1578
- Β
1579
- Δ
1580
- Ε
1581
- Θ
1582
- Κ
1583
- Λ
1584
- Μ
1585
- Ξ
1586
- Π
1587
- Σ
1588
- Τ
1589
- Φ
1590
- Χ
1591
- Ψ
1592
- Ω
1593
- ά
1594
- έ
1595
- ή
1596
- ί
1597
- α
1598
- β
1599
- γ
1600
- δ
1601
- ε
1602
- ζ
1603
- η
1604
- θ
1605
- ι
1606
- κ
1607
- λ
1608
- μ
1609
- ν
1610
- ξ
1611
- ο
1612
- π
1613
- ρ
1614
- ς
1615
- σ
1616
- τ
1617
- υ
1618
- φ
1619
- χ
1620
- ψ
1621
- ω
1622
- ϊ
1623
- ό
1624
- ύ
1625
- ώ
1626
- ϕ
1627
- ϵ
1628
- Ё
1629
- А
1630
- Б
1631
- В
1632
- Г
1633
- Д
1634
- Е
1635
- Ж
1636
- З
1637
- И
1638
- Й
1639
- К
1640
- Л
1641
- М
1642
- Н
1643
- О
1644
- П
1645
- Р
1646
- С
1647
- Т
1648
- У
1649
- Ф
1650
- Х
1651
- Ц
1652
- Ч
1653
- Ш
1654
- Щ
1655
- Ы
1656
- Ь
1657
- Э
1658
- Ю
1659
- Я
1660
- а
1661
- б
1662
- в
1663
- г
1664
- д
1665
- е
1666
- ж
1667
- з
1668
- и
1669
- й
1670
- к
1671
- л
1672
- м
1673
- н
1674
- о
1675
- п
1676
- р
1677
- с
1678
- т
1679
- у
1680
- ф
1681
- х
1682
- ц
1683
- ч
1684
- ш
1685
- щ
1686
- ъ
1687
- ы
1688
- ь
1689
- э
1690
- ю
1691
- я
1692
- ё
1693
- і
1694
- ְ
1695
- ִ
1696
- ֵ
1697
- ֶ
1698
- ַ
1699
- ָ
1700
- ֹ
1701
- ּ
1702
- ־
1703
- ׁ
1704
- א
1705
- ב
1706
- ג
1707
- ד
1708
- ה
1709
- ו
1710
- ז
1711
- ח
1712
- ט
1713
- י
1714
- כ
1715
- ל
1716
- ם
1717
- מ
1718
- ן
1719
- נ
1720
- ס
1721
- ע
1722
- פ
1723
- ק
1724
- ר
1725
- ש
1726
- ת
1727
- أ
1728
- ب
1729
- ة
1730
- ت
1731
- ج
1732
- ح
1733
- د
1734
- ر
1735
- ز
1736
- س
1737
- ص
1738
- ط
1739
- ع
1740
- ق
1741
- ك
1742
- ل
1743
- م
1744
- ن
1745
- ه
1746
- و
1747
- ي
1748
- َ
1749
- ُ
1750
- ِ
1751
- ْ
1752
-
1753
-
1754
-
1755
-
1756
-
1757
-
1758
-
1759
-
1760
-
1761
-
1762
-
1763
-
1764
-
1765
-
1766
-
1767
-
1768
-
1769
-
1770
-
1771
-
1772
-
1773
-
1774
-
1775
-
1776
-
1777
-
1778
-
1779
-
1780
-
1781
-
1782
-
1783
-
1784
-
1785
-
1786
-
1787
-
1788
-
1789
-
1790
-
1791
-
1792
-
1793
-
1794
-
1795
-
1796
-
1797
-
1798
-
1799
-
1800
- ế
1801
-
1802
-
1803
-
1804
-
1805
-
1806
-
1807
-
1808
-
1809
-
1810
-
1811
-
1812
-
1813
-
1814
-
1815
-
1816
-
1817
-
1818
-
1819
-
1820
-
1821
-
1822
-
1823
-
1824
-
1825
-
1826
-
1827
-
1828
-
1829
-
1830
-
1831
-
1832
-
1833
-
1834
-
1835
-
1836
-
1837
-
1838
-
1839
-
1840
-
1841
-
1842
-
1843
-
1844
-
1845
-
1846
-
1847
-
1848
-
1849
-
1850
-
1851
-
1852
-
1853
-
1854
-
1855
-
1856
-
1857
-
1858
-
1859
-
1860
-
1861
-
1862
-
1863
-
1864
-
1865
-
1866
-
1867
-
1868
-
1869
-
1870
-
1871
-
1872
-
1873
-
1874
-
1875
-
1876
-
1877
-
1878
-
1879
-
1880
-
1881
-
1882
-
1883
-
1884
-
1885
-
1886
-
1887
-
1888
-
1889
-
1890
-
1891
-
1892
-
1893
-
1894
-
1895
-
1896
-
1897
-
1898
-
1899
-
1900
-
1901
-
1902
-
1903
-
1904
-
1905
-
1906
-
1907
-
1908
-
1909
-
1910
-
1911
-
1912
-
1913
-
1914
-
1915
-
1916
-
1917
-
1918
-
1919
-
1920
-
1921
-
1922
-
1923
-
1924
-
1925
-
1926
-
1927
-
1928
-
1929
-
1930
-
1931
-
1932
-
1933
-
1934
-
1935
-
1936
-
1937
-
1938
-
1939
-
1940
-
1941
-
1942
-
1943
-
1944
-
1945
-
1946
-
1947
-
1948
-
1949
-
1950
-
1951
-
1952
-
1953
-
1954
-
1955
-
1956
-
1957
-
1958
-
1959
-
1960
-
1961
-
1962
-
1963
-
1964
-
1965
-
1966
-
1967
-
1968
-
1969
-
1970
-
1971
-
1972
-
1973
-
1974
-
1975
-
1976
-
1977
-
1978
-
1979
-
1980
-
1981
-
1982
-
1983
-
1984
-
1985
-
1986
-
1987
-
1988
-
1989
-
1990
-
1991
-
1992
-
1993
-
1994
-
1995
-
1996
-
1997
-
1998
-
1999
-
2000
-
2001
-
2002
-
2003
-
2004
-
2005
-
2006
-
2007
-
2008
-
2009
-
2010
-
2011
-
2012
-
2013
-
2014
-
2015
-
2016
-
2017
-
2018
-
2019
-
2020
-
2021
-
2022
-
2023
-
2024
-
2025
-
2026
-
2027
-
2028
-
2029
-
2030
-
2031
-
2032
-
2033
-
2034
-
2035
-
2036
-
2037
-
2038
-
2039
-
2040
-
2041
-
2042
-
2043
-
2044
-
2045
-
2046
-
2047
-
2048
-
2049
-
2050
-
2051
-
2052
-
2053
-
2054
-
2055
-
2056
-
2057
-
2058
-
2059
-
2060
-
2061
-
2062
-
2063
-
2064
-
2065
-
2066
-
2067
-
2068
-
2069
-
2070
-
2071
-
2072
-
2073
-
2074
-
2075
-
2076
-
2077
-
2078
-
2079
-
2080
-
2081
-
2082
-
2083
-
2084
-
2085
-
2086
-
2087
-
2088
-
2089
-
2090
-
2091
-
2092
-
2093
-
2094
-
2095
-
2096
-
2097
-
2098
-
2099
-
2100
-
2101
-
2102
-
2103
-
2104
-
2105
-
2106
-
2107
-
2108
-
2109
-
2110
-
2111
-
2112
-
2113
-
2114
-
2115
-
2116
-
2117
-
2118
-
2119
-
2120
-
2121
-
2122
-
2123
-
2124
-
2125
-
2126
-
2127
-
2128
-
2129
-
2130
-
2131
-
2132
-
2133
-
2134
-
2135
-
2136
-
2137
-
2138
-
2139
-
2140
-
2141
-
2142
-
2143
-
2144
-
2145
-
2146
-
2147
-
2148
-
2149
-
2150
-
2151
-
2152
-
2153
-
2154
-
2155
-
2156
-
2157
-
2158
-
2159
-
2160
-
2161
-
2162
-
2163
-
2164
-
2165
-
2166
-
2167
-
2168
-
2169
-
2170
-
2171
-
2172
-
2173
-
2174
-
2175
-
2176
-
2177
-
2178
-
2179
-
2180
-
2181
-
2182
-
2183
-
2184
-
2185
-
2186
-
2187
-
2188
-
2189
-
2190
-
2191
-
2192
-
2193
-
2194
-
2195
-
2196
-
2197
-
2198
-
2199
-
2200
-
2201
-
2202
-
2203
-
2204
-
2205
-
2206
-
2207
-
2208
-
2209
-
2210
-
2211
-
2212
-
2213
-
2214
-
2215
-
2216
-
2217
-
2218
-
2219
-
2220
-
2221
-
2222
-
2223
-
2224
-
2225
-
2226
-
2227
-
2228
-
2229
-
2230
-
2231
-
2232
-
2233
-
2234
-
2235
-
2236
-
2237
-
2238
-
2239
-
2240
-
2241
-
2242
-
2243
-
2244
-
2245
-
2246
-
2247
-
2248
-
2249
-
2250
-
2251
-
2252
-
2253
-
2254
-
2255
-
2256
-
2257
-
2258
-
2259
-
2260
-
2261
-
2262
-
2263
-
2264
-
2265
-
2266
-
2267
-
2268
-
2269
-
2270
-
2271
-
2272
-
2273
-
2274
-
2275
-
2276
-
2277
-
2278
-
2279
-
2280
-
2281
-
2282
-
2283
-
2284
-
2285
-
2286
-
2287
-
2288
-
2289
-
2290
-
2291
-
2292
-
2293
-
2294
-
2295
-
2296
-
2297
-
2298
-
2299
-
2300
-
2301
-
2302
-
2303
-
2304
-
2305
-
2306
-
2307
-
2308
-
2309
-
2310
-
2311
-
2312
-
2313
-
2314
-
2315
-
2316
-
2317
-
2318
-
2319
-
2320
-
2321
-
2322
-
2323
-
2324
-
2325
-
2326
-
2327
-
2328
-
2329
-
2330
-
2331
-
2332
-
2333
-
2334
-
2335
-
2336
-
2337
-
2338
-
2339
-
2340
-
2341
-
2342
-
2343
-
2344
-
2345
-
2346
-
2347
-
2348
-
2349
-
2350
-
2351
-
2352
-
2353
-
2354
-
2355
-
2356
-
2357
-
2358
-
2359
-
2360
-
2361
-
2362
-
2363
-
2364
-
2365
-
2366
-
2367
-
2368
-
2369
-
2370
-
2371
-
2372
-
2373
-
2374
-
2375
-
2376
-
2377
-
2378
-
2379
-
2380
-
2381
-
2382
-
2383
-
2384
-
2385
-
2386
-
2387
-
2388
-
2389
-
2390
-
2391
-
2392
-
2393
-
2394
-
2395
-
2396
-
2397
-
2398
-
2399
-
2400
-
2401
-
2402
-
2403
-
2404
-
2405
-
2406
-
2407
-
2408
-
2409
-
2410
-
2411
-
2412
-
2413
-
2414
-
2415
-
2416
-
2417
-
2418
-
2419
-
2420
-
2421
-
2422
-
2423
-
2424
-
2425
-
2426
-
2427
-
2428
-
2429
-
2430
-
2431
-
2432
-
2433
-
2434
-
2435
-
2436
-
2437
-
2438
-
2439
-
2440
-
2441
-
2442
-
2443
-
2444
-
2445
-
2446
-
2447
-
2448
-
2449
-
2450
-
2451
-
2452
-
2453
-
2454
-
2455
-
2456
-
2457
-
2458
-
2459
-
2460
-
2461
-
2462
-
2463
-
2464
-
2465
-
2466
-
2467
-
2468
-
2469
-
2470
-
2471
-
2472
-
2473
-
2474
-
2475
-
2476
-
2477
-
2478
-
2479
-
2480
-
2481
-
2482
-
2483
-
2484
-
2485
-
2486
-
2487
-
2488
-
2489
-
2490
-
2491
-
2492
-
2493
-
2494
-
2495
-
2496
-
2497
-
2498
-
2499
-
2500
-
2501
-
2502
-
2503
-
2504
-
2505
-
2506
-
2507
-
2508
-
2509
-
2510
-
2511
-
2512
-
2513
-
2514
-
2515
-
2516
-
2517
-
2518
-
2519
-
2520
-
2521
-
2522
-
2523
-
2524
-
2525
-
2526
-
2527
-
2528
-
2529
-
2530
-
2531
-
2532
-
2533
-
2534
-
2535
-
2536
-
2537
-
2538
-
2539
-
2540
-
2541
-
2542
-
2543
-
2544
-
2545
- 𠮶
 
1
+
2
+ !
3
+ "
4
+ #
5
+ $
6
+ %
7
+ &
8
+ '
9
+ (
10
+ )
11
+ *
12
+ +
13
+ ,
14
+ -
15
+ .
16
+ /
17
+ 0
18
+ 1
19
+ 2
20
+ 3
21
+ 4
22
+ 5
23
+ 6
24
+ 7
25
+ 8
26
+ 9
27
+ :
28
+ ;
29
+ =
30
+ >
31
+ ?
32
+ @
33
+ A
34
+ B
35
+ C
36
+ D
37
+ E
38
+ F
39
+ G
40
+ H
41
+ I
42
+ J
43
+ K
44
+ L
45
+ M
46
+ N
47
+ O
48
+ P
49
+ Q
50
+ R
51
+ S
52
+ T
53
+ U
54
+ V
55
+ W
56
+ X
57
+ Y
58
+ Z
59
+ [
60
+ \
61
+ ]
62
+ _
63
+ a
64
+ a1
65
+ ai1
66
+ ai2
67
+ ai3
68
+ ai4
69
+ an1
70
+ an3
71
+ an4
72
+ ang1
73
+ ang2
74
+ ang4
75
+ ao1
76
+ ao2
77
+ ao3
78
+ ao4
79
+ b
80
+ ba
81
+ ba1
82
+ ba2
83
+ ba3
84
+ ba4
85
+ bai1
86
+ bai2
87
+ bai3
88
+ bai4
89
+ ban1
90
+ ban2
91
+ ban3
92
+ ban4
93
+ bang1
94
+ bang2
95
+ bang3
96
+ bang4
97
+ bao1
98
+ bao2
99
+ bao3
100
+ bao4
101
+ bei
102
+ bei1
103
+ bei2
104
+ bei3
105
+ bei4
106
+ ben1
107
+ ben2
108
+ ben3
109
+ ben4
110
+ beng
111
+ beng1
112
+ beng2
113
+ beng3
114
+ beng4
115
+ bi1
116
+ bi2
117
+ bi3
118
+ bi4
119
+ bian1
120
+ bian2
121
+ bian3
122
+ bian4
123
+ biao1
124
+ biao2
125
+ biao3
126
+ bie1
127
+ bie2
128
+ bie3
129
+ bie4
130
+ bin1
131
+ bin4
132
+ bing1
133
+ bing2
134
+ bing3
135
+ bing4
136
+ bo
137
+ bo1
138
+ bo2
139
+ bo3
140
+ bo4
141
+ bu2
142
+ bu3
143
+ bu4
144
+ c
145
+ ca1
146
+ cai1
147
+ cai2
148
+ cai3
149
+ cai4
150
+ can1
151
+ can2
152
+ can3
153
+ can4
154
+ cang1
155
+ cang2
156
+ cao1
157
+ cao2
158
+ cao3
159
+ ce4
160
+ cen1
161
+ cen2
162
+ ceng1
163
+ ceng2
164
+ ceng4
165
+ cha1
166
+ cha2
167
+ cha3
168
+ cha4
169
+ chai1
170
+ chai2
171
+ chan1
172
+ chan2
173
+ chan3
174
+ chan4
175
+ chang1
176
+ chang2
177
+ chang3
178
+ chang4
179
+ chao1
180
+ chao2
181
+ chao3
182
+ che1
183
+ che2
184
+ che3
185
+ che4
186
+ chen1
187
+ chen2
188
+ chen3
189
+ chen4
190
+ cheng1
191
+ cheng2
192
+ cheng3
193
+ cheng4
194
+ chi1
195
+ chi2
196
+ chi3
197
+ chi4
198
+ chong1
199
+ chong2
200
+ chong3
201
+ chong4
202
+ chou1
203
+ chou2
204
+ chou3
205
+ chou4
206
+ chu1
207
+ chu2
208
+ chu3
209
+ chu4
210
+ chua1
211
+ chuai1
212
+ chuai2
213
+ chuai3
214
+ chuai4
215
+ chuan1
216
+ chuan2
217
+ chuan3
218
+ chuan4
219
+ chuang1
220
+ chuang2
221
+ chuang3
222
+ chuang4
223
+ chui1
224
+ chui2
225
+ chun1
226
+ chun2
227
+ chun3
228
+ chuo1
229
+ chuo4
230
+ ci1
231
+ ci2
232
+ ci3
233
+ ci4
234
+ cong1
235
+ cong2
236
+ cou4
237
+ cu1
238
+ cu4
239
+ cuan1
240
+ cuan2
241
+ cuan4
242
+ cui1
243
+ cui3
244
+ cui4
245
+ cun1
246
+ cun2
247
+ cun4
248
+ cuo1
249
+ cuo2
250
+ cuo4
251
+ d
252
+ da
253
+ da1
254
+ da2
255
+ da3
256
+ da4
257
+ dai1
258
+ dai2
259
+ dai3
260
+ dai4
261
+ dan1
262
+ dan2
263
+ dan3
264
+ dan4
265
+ dang1
266
+ dang2
267
+ dang3
268
+ dang4
269
+ dao1
270
+ dao2
271
+ dao3
272
+ dao4
273
+ de
274
+ de1
275
+ de2
276
+ dei3
277
+ den4
278
+ deng1
279
+ deng2
280
+ deng3
281
+ deng4
282
+ di1
283
+ di2
284
+ di3
285
+ di4
286
+ dia3
287
+ dian1
288
+ dian2
289
+ dian3
290
+ dian4
291
+ diao1
292
+ diao3
293
+ diao4
294
+ die1
295
+ die2
296
+ die4
297
+ ding1
298
+ ding2
299
+ ding3
300
+ ding4
301
+ diu1
302
+ dong1
303
+ dong3
304
+ dong4
305
+ dou1
306
+ dou2
307
+ dou3
308
+ dou4
309
+ du1
310
+ du2
311
+ du3
312
+ du4
313
+ duan1
314
+ duan2
315
+ duan3
316
+ duan4
317
+ dui1
318
+ dui4
319
+ dun1
320
+ dun3
321
+ dun4
322
+ duo1
323
+ duo2
324
+ duo3
325
+ duo4
326
+ e
327
+ e1
328
+ e2
329
+ e3
330
+ e4
331
+ ei2
332
+ en1
333
+ en4
334
+ er
335
+ er2
336
+ er3
337
+ er4
338
+ f
339
+ fa1
340
+ fa2
341
+ fa3
342
+ fa4
343
+ fan1
344
+ fan2
345
+ fan3
346
+ fan4
347
+ fang1
348
+ fang2
349
+ fang3
350
+ fang4
351
+ fei1
352
+ fei2
353
+ fei3
354
+ fei4
355
+ fen1
356
+ fen2
357
+ fen3
358
+ fen4
359
+ feng1
360
+ feng2
361
+ feng3
362
+ feng4
363
+ fo2
364
+ fou2
365
+ fou3
366
+ fu1
367
+ fu2
368
+ fu3
369
+ fu4
370
+ g
371
+ ga1
372
+ ga2
373
+ ga3
374
+ ga4
375
+ gai1
376
+ gai2
377
+ gai3
378
+ gai4
379
+ gan1
380
+ gan2
381
+ gan3
382
+ gan4
383
+ gang1
384
+ gang2
385
+ gang3
386
+ gang4
387
+ gao1
388
+ gao2
389
+ gao3
390
+ gao4
391
+ ge1
392
+ ge2
393
+ ge3
394
+ ge4
395
+ gei2
396
+ gei3
397
+ gen1
398
+ gen2
399
+ gen3
400
+ gen4
401
+ geng1
402
+ geng3
403
+ geng4
404
+ gong1
405
+ gong3
406
+ gong4
407
+ gou1
408
+ gou2
409
+ gou3
410
+ gou4
411
+ gu
412
+ gu1
413
+ gu2
414
+ gu3
415
+ gu4
416
+ gua1
417
+ gua2
418
+ gua3
419
+ gua4
420
+ guai1
421
+ guai2
422
+ guai3
423
+ guai4
424
+ guan1
425
+ guan2
426
+ guan3
427
+ guan4
428
+ guang1
429
+ guang2
430
+ guang3
431
+ guang4
432
+ gui1
433
+ gui2
434
+ gui3
435
+ gui4
436
+ gun3
437
+ gun4
438
+ guo1
439
+ guo2
440
+ guo3
441
+ guo4
442
+ h
443
+ ha1
444
+ ha2
445
+ ha3
446
+ hai1
447
+ hai2
448
+ hai3
449
+ hai4
450
+ han1
451
+ han2
452
+ han3
453
+ han4
454
+ hang1
455
+ hang2
456
+ hang4
457
+ hao1
458
+ hao2
459
+ hao3
460
+ hao4
461
+ he1
462
+ he2
463
+ he4
464
+ hei1
465
+ hen2
466
+ hen3
467
+ hen4
468
+ heng1
469
+ heng2
470
+ heng4
471
+ hong1
472
+ hong2
473
+ hong3
474
+ hong4
475
+ hou1
476
+ hou2
477
+ hou3
478
+ hou4
479
+ hu1
480
+ hu2
481
+ hu3
482
+ hu4
483
+ hua1
484
+ hua2
485
+ hua4
486
+ huai2
487
+ huai4
488
+ huan1
489
+ huan2
490
+ huan3
491
+ huan4
492
+ huang1
493
+ huang2
494
+ huang3
495
+ huang4
496
+ hui1
497
+ hui2
498
+ hui3
499
+ hui4
500
+ hun1
501
+ hun2
502
+ hun4
503
+ huo
504
+ huo1
505
+ huo2
506
+ huo3
507
+ huo4
508
+ i
509
+ j
510
+ ji1
511
+ ji2
512
+ ji3
513
+ ji4
514
+ jia
515
+ jia1
516
+ jia2
517
+ jia3
518
+ jia4
519
+ jian1
520
+ jian2
521
+ jian3
522
+ jian4
523
+ jiang1
524
+ jiang2
525
+ jiang3
526
+ jiang4
527
+ jiao1
528
+ jiao2
529
+ jiao3
530
+ jiao4
531
+ jie1
532
+ jie2
533
+ jie3
534
+ jie4
535
+ jin1
536
+ jin2
537
+ jin3
538
+ jin4
539
+ jing1
540
+ jing2
541
+ jing3
542
+ jing4
543
+ jiong3
544
+ jiu1
545
+ jiu2
546
+ jiu3
547
+ jiu4
548
+ ju1
549
+ ju2
550
+ ju3
551
+ ju4
552
+ juan1
553
+ juan2
554
+ juan3
555
+ juan4
556
+ jue1
557
+ jue2
558
+ jue4
559
+ jun1
560
+ jun4
561
+ k
562
+ ka1
563
+ ka2
564
+ ka3
565
+ kai1
566
+ kai2
567
+ kai3
568
+ kai4
569
+ kan1
570
+ kan2
571
+ kan3
572
+ kan4
573
+ kang1
574
+ kang2
575
+ kang4
576
+ kao1
577
+ kao2
578
+ kao3
579
+ kao4
580
+ ke1
581
+ ke2
582
+ ke3
583
+ ke4
584
+ ken3
585
+ keng1
586
+ kong1
587
+ kong3
588
+ kong4
589
+ kou1
590
+ kou2
591
+ kou3
592
+ kou4
593
+ ku1
594
+ ku2
595
+ ku3
596
+ ku4
597
+ kua1
598
+ kua3
599
+ kua4
600
+ kuai3
601
+ kuai4
602
+ kuan1
603
+ kuan2
604
+ kuan3
605
+ kuang1
606
+ kuang2
607
+ kuang4
608
+ kui1
609
+ kui2
610
+ kui3
611
+ kui4
612
+ kun1
613
+ kun3
614
+ kun4
615
+ kuo4
616
+ l
617
+ la
618
+ la1
619
+ la2
620
+ la3
621
+ la4
622
+ lai2
623
+ lai4
624
+ lan2
625
+ lan3
626
+ lan4
627
+ lang1
628
+ lang2
629
+ lang3
630
+ lang4
631
+ lao1
632
+ lao2
633
+ lao3
634
+ lao4
635
+ le
636
+ le1
637
+ le4
638
+ lei
639
+ lei1
640
+ lei2
641
+ lei3
642
+ lei4
643
+ leng1
644
+ leng2
645
+ leng3
646
+ leng4
647
+ li
648
+ li1
649
+ li2
650
+ li3
651
+ li4
652
+ lia3
653
+ lian2
654
+ lian3
655
+ lian4
656
+ liang2
657
+ liang3
658
+ liang4
659
+ liao1
660
+ liao2
661
+ liao3
662
+ liao4
663
+ lie1
664
+ lie2
665
+ lie3
666
+ lie4
667
+ lin1
668
+ lin2
669
+ lin3
670
+ lin4
671
+ ling2
672
+ ling3
673
+ ling4
674
+ liu1
675
+ liu2
676
+ liu3
677
+ liu4
678
+ long1
679
+ long2
680
+ long3
681
+ long4
682
+ lou1
683
+ lou2
684
+ lou3
685
+ lou4
686
+ lu1
687
+ lu2
688
+ lu3
689
+ lu4
690
+ luan2
691
+ luan3
692
+ luan4
693
+ lun1
694
+ lun2
695
+ lun4
696
+ luo1
697
+ luo2
698
+ luo3
699
+ luo4
700
+ lv2
701
+ lv3
702
+ lv4
703
+ lve3
704
+ lve4
705
+ m
706
+ ma
707
+ ma1
708
+ ma2
709
+ ma3
710
+ ma4
711
+ mai2
712
+ mai3
713
+ mai4
714
+ man1
715
+ man2
716
+ man3
717
+ man4
718
+ mang2
719
+ mang3
720
+ mao1
721
+ mao2
722
+ mao3
723
+ mao4
724
+ me
725
+ mei2
726
+ mei3
727
+ mei4
728
+ men
729
+ men1
730
+ men2
731
+ men4
732
+ meng
733
+ meng1
734
+ meng2
735
+ meng3
736
+ meng4
737
+ mi1
738
+ mi2
739
+ mi3
740
+ mi4
741
+ mian2
742
+ mian3
743
+ mian4
744
+ miao1
745
+ miao2
746
+ miao3
747
+ miao4
748
+ mie1
749
+ mie4
750
+ min2
751
+ min3
752
+ ming2
753
+ ming3
754
+ ming4
755
+ miu4
756
+ mo1
757
+ mo2
758
+ mo3
759
+ mo4
760
+ mou1
761
+ mou2
762
+ mou3
763
+ mu2
764
+ mu3
765
+ mu4
766
+ n
767
+ n2
768
+ na1
769
+ na2
770
+ na3
771
+ na4
772
+ nai2
773
+ nai3
774
+ nai4
775
+ nan1
776
+ nan2
777
+ nan3
778
+ nan4
779
+ nang1
780
+ nang2
781
+ nang3
782
+ nao1
783
+ nao2
784
+ nao3
785
+ nao4
786
+ ne
787
+ ne2
788
+ ne4
789
+ nei3
790
+ nei4
791
+ nen4
792
+ neng2
793
+ ni1
794
+ ni2
795
+ ni3
796
+ ni4
797
+ nian1
798
+ nian2
799
+ nian3
800
+ nian4
801
+ niang2
802
+ niang4
803
+ niao2
804
+ niao3
805
+ niao4
806
+ nie1
807
+ nie4
808
+ nin2
809
+ ning2
810
+ ning3
811
+ ning4
812
+ niu1
813
+ niu2
814
+ niu3
815
+ niu4
816
+ nong2
817
+ nong4
818
+ nou4
819
+ nu2
820
+ nu3
821
+ nu4
822
+ nuan3
823
+ nuo2
824
+ nuo4
825
+ nv2
826
+ nv3
827
+ nve4
828
+ o
829
+ o1
830
+ o2
831
+ ou1
832
+ ou2
833
+ ou3
834
+ ou4
835
+ p
836
+ pa1
837
+ pa2
838
+ pa4
839
+ pai1
840
+ pai2
841
+ pai3
842
+ pai4
843
+ pan1
844
+ pan2
845
+ pan4
846
+ pang1
847
+ pang2
848
+ pang4
849
+ pao1
850
+ pao2
851
+ pao3
852
+ pao4
853
+ pei1
854
+ pei2
855
+ pei4
856
+ pen1
857
+ pen2
858
+ pen4
859
+ peng1
860
+ peng2
861
+ peng3
862
+ peng4
863
+ pi1
864
+ pi2
865
+ pi3
866
+ pi4
867
+ pian1
868
+ pian2
869
+ pian4
870
+ piao1
871
+ piao2
872
+ piao3
873
+ piao4
874
+ pie1
875
+ pie2
876
+ pie3
877
+ pin1
878
+ pin2
879
+ pin3
880
+ pin4
881
+ ping1
882
+ ping2
883
+ po1
884
+ po2
885
+ po3
886
+ po4
887
+ pou1
888
+ pu1
889
+ pu2
890
+ pu3
891
+ pu4
892
+ q
893
+ qi1
894
+ qi2
895
+ qi3
896
+ qi4
897
+ qia1
898
+ qia3
899
+ qia4
900
+ qian1
901
+ qian2
902
+ qian3
903
+ qian4
904
+ qiang1
905
+ qiang2
906
+ qiang3
907
+ qiang4
908
+ qiao1
909
+ qiao2
910
+ qiao3
911
+ qiao4
912
+ qie1
913
+ qie2
914
+ qie3
915
+ qie4
916
+ qin1
917
+ qin2
918
+ qin3
919
+ qin4
920
+ qing1
921
+ qing2
922
+ qing3
923
+ qing4
924
+ qiong1
925
+ qiong2
926
+ qiu1
927
+ qiu2
928
+ qiu3
929
+ qu1
930
+ qu2
931
+ qu3
932
+ qu4
933
+ quan1
934
+ quan2
935
+ quan3
936
+ quan4
937
+ que1
938
+ que2
939
+ que4
940
+ qun2
941
+ r
942
+ ran2
943
+ ran3
944
+ rang1
945
+ rang2
946
+ rang3
947
+ rang4
948
+ rao2
949
+ rao3
950
+ rao4
951
+ re2
952
+ re3
953
+ re4
954
+ ren2
955
+ ren3
956
+ ren4
957
+ reng1
958
+ reng2
959
+ ri4
960
+ rong1
961
+ rong2
962
+ rong3
963
+ rou2
964
+ rou4
965
+ ru2
966
+ ru3
967
+ ru4
968
+ ruan2
969
+ ruan3
970
+ rui3
971
+ rui4
972
+ run4
973
+ ruo4
974
+ s
975
+ sa1
976
+ sa2
977
+ sa3
978
+ sa4
979
+ sai1
980
+ sai4
981
+ san1
982
+ san2
983
+ san3
984
+ san4
985
+ sang1
986
+ sang3
987
+ sang4
988
+ sao1
989
+ sao2
990
+ sao3
991
+ sao4
992
+ se4
993
+ sen1
994
+ seng1
995
+ sha1
996
+ sha2
997
+ sha3
998
+ sha4
999
+ shai1
1000
+ shai2
1001
+ shai3
1002
+ shai4
1003
+ shan1
1004
+ shan3
1005
+ shan4
1006
+ shang
1007
+ shang1
1008
+ shang3
1009
+ shang4
1010
+ shao1
1011
+ shao2
1012
+ shao3
1013
+ shao4
1014
+ she1
1015
+ she2
1016
+ she3
1017
+ she4
1018
+ shei2
1019
+ shen1
1020
+ shen2
1021
+ shen3
1022
+ shen4
1023
+ sheng1
1024
+ sheng2
1025
+ sheng3
1026
+ sheng4
1027
+ shi
1028
+ shi1
1029
+ shi2
1030
+ shi3
1031
+ shi4
1032
+ shou1
1033
+ shou2
1034
+ shou3
1035
+ shou4
1036
+ shu1
1037
+ shu2
1038
+ shu3
1039
+ shu4
1040
+ shua1
1041
+ shua2
1042
+ shua3
1043
+ shua4
1044
+ shuai1
1045
+ shuai3
1046
+ shuai4
1047
+ shuan1
1048
+ shuan4
1049
+ shuang1
1050
+ shuang3
1051
+ shui2
1052
+ shui3
1053
+ shui4
1054
+ shun3
1055
+ shun4
1056
+ shuo1
1057
+ shuo4
1058
+ si1
1059
+ si2
1060
+ si3
1061
+ si4
1062
+ song1
1063
+ song3
1064
+ song4
1065
+ sou1
1066
+ sou3
1067
+ sou4
1068
+ su1
1069
+ su2
1070
+ su4
1071
+ suan1
1072
+ suan4
1073
+ sui1
1074
+ sui2
1075
+ sui3
1076
+ sui4
1077
+ sun1
1078
+ sun3
1079
+ suo
1080
+ suo1
1081
+ suo2
1082
+ suo3
1083
+ t
1084
+ ta1
1085
+ ta2
1086
+ ta3
1087
+ ta4
1088
+ tai1
1089
+ tai2
1090
+ tai4
1091
+ tan1
1092
+ tan2
1093
+ tan3
1094
+ tan4
1095
+ tang1
1096
+ tang2
1097
+ tang3
1098
+ tang4
1099
+ tao1
1100
+ tao2
1101
+ tao3
1102
+ tao4
1103
+ te4
1104
+ teng2
1105
+ ti1
1106
+ ti2
1107
+ ti3
1108
+ ti4
1109
+ tian1
1110
+ tian2
1111
+ tian3
1112
+ tiao1
1113
+ tiao2
1114
+ tiao3
1115
+ tiao4
1116
+ tie1
1117
+ tie2
1118
+ tie3
1119
+ tie4
1120
+ ting1
1121
+ ting2
1122
+ ting3
1123
+ tong1
1124
+ tong2
1125
+ tong3
1126
+ tong4
1127
+ tou
1128
+ tou1
1129
+ tou2
1130
+ tou4
1131
+ tu1
1132
+ tu2
1133
+ tu3
1134
+ tu4
1135
+ tuan1
1136
+ tuan2
1137
+ tui1
1138
+ tui2
1139
+ tui3
1140
+ tui4
1141
+ tun1
1142
+ tun2
1143
+ tun4
1144
+ tuo1
1145
+ tuo2
1146
+ tuo3
1147
+ tuo4
1148
+ u
1149
+ v
1150
+ w
1151
+ wa
1152
+ wa1
1153
+ wa2
1154
+ wa3
1155
+ wa4
1156
+ wai1
1157
+ wai3
1158
+ wai4
1159
+ wan1
1160
+ wan2
1161
+ wan3
1162
+ wan4
1163
+ wang1
1164
+ wang2
1165
+ wang3
1166
+ wang4
1167
+ wei1
1168
+ wei2
1169
+ wei3
1170
+ wei4
1171
+ wen1
1172
+ wen2
1173
+ wen3
1174
+ wen4
1175
+ weng1
1176
+ weng4
1177
+ wo1
1178
+ wo2
1179
+ wo3
1180
+ wo4
1181
+ wu1
1182
+ wu2
1183
+ wu3
1184
+ wu4
1185
+ x
1186
+ xi1
1187
+ xi2
1188
+ xi3
1189
+ xi4
1190
+ xia1
1191
+ xia2
1192
+ xia4
1193
+ xian1
1194
+ xian2
1195
+ xian3
1196
+ xian4
1197
+ xiang1
1198
+ xiang2
1199
+ xiang3
1200
+ xiang4
1201
+ xiao1
1202
+ xiao2
1203
+ xiao3
1204
+ xiao4
1205
+ xie1
1206
+ xie2
1207
+ xie3
1208
+ xie4
1209
+ xin1
1210
+ xin2
1211
+ xin4
1212
+ xing1
1213
+ xing2
1214
+ xing3
1215
+ xing4
1216
+ xiong1
1217
+ xiong2
1218
+ xiu1
1219
+ xiu3
1220
+ xiu4
1221
+ xu
1222
+ xu1
1223
+ xu2
1224
+ xu3
1225
+ xu4
1226
+ xuan1
1227
+ xuan2
1228
+ xuan3
1229
+ xuan4
1230
+ xue1
1231
+ xue2
1232
+ xue3
1233
+ xue4
1234
+ xun1
1235
+ xun2
1236
+ xun4
1237
+ y
1238
+ ya
1239
+ ya1
1240
+ ya2
1241
+ ya3
1242
+ ya4
1243
+ yan1
1244
+ yan2
1245
+ yan3
1246
+ yan4
1247
+ yang1
1248
+ yang2
1249
+ yang3
1250
+ yang4
1251
+ yao1
1252
+ yao2
1253
+ yao3
1254
+ yao4
1255
+ ye1
1256
+ ye2
1257
+ ye3
1258
+ ye4
1259
+ yi
1260
+ yi1
1261
+ yi2
1262
+ yi3
1263
+ yi4
1264
+ yin1
1265
+ yin2
1266
+ yin3
1267
+ yin4
1268
+ ying1
1269
+ ying2
1270
+ ying3
1271
+ ying4
1272
+ yo1
1273
+ yong1
1274
+ yong2
1275
+ yong3
1276
+ yong4
1277
+ you1
1278
+ you2
1279
+ you3
1280
+ you4
1281
+ yu1
1282
+ yu2
1283
+ yu3
1284
+ yu4
1285
+ yuan1
1286
+ yuan2
1287
+ yuan3
1288
+ yuan4
1289
+ yue1
1290
+ yue4
1291
+ yun1
1292
+ yun2
1293
+ yun3
1294
+ yun4
1295
+ z
1296
+ za1
1297
+ za2
1298
+ za3
1299
+ zai1
1300
+ zai3
1301
+ zai4
1302
+ zan1
1303
+ zan2
1304
+ zan3
1305
+ zan4
1306
+ zang1
1307
+ zang4
1308
+ zao1
1309
+ zao2
1310
+ zao3
1311
+ zao4
1312
+ ze2
1313
+ ze4
1314
+ zei2
1315
+ zen3
1316
+ zeng1
1317
+ zeng4
1318
+ zha1
1319
+ zha2
1320
+ zha3
1321
+ zha4
1322
+ zhai1
1323
+ zhai2
1324
+ zhai3
1325
+ zhai4
1326
+ zhan1
1327
+ zhan2
1328
+ zhan3
1329
+ zhan4
1330
+ zhang1
1331
+ zhang2
1332
+ zhang3
1333
+ zhang4
1334
+ zhao1
1335
+ zhao2
1336
+ zhao3
1337
+ zhao4
1338
+ zhe
1339
+ zhe1
1340
+ zhe2
1341
+ zhe3
1342
+ zhe4
1343
+ zhen1
1344
+ zhen2
1345
+ zhen3
1346
+ zhen4
1347
+ zheng1
1348
+ zheng2
1349
+ zheng3
1350
+ zheng4
1351
+ zhi1
1352
+ zhi2
1353
+ zhi3
1354
+ zhi4
1355
+ zhong1
1356
+ zhong2
1357
+ zhong3
1358
+ zhong4
1359
+ zhou1
1360
+ zhou2
1361
+ zhou3
1362
+ zhou4
1363
+ zhu1
1364
+ zhu2
1365
+ zhu3
1366
+ zhu4
1367
+ zhua1
1368
+ zhua2
1369
+ zhua3
1370
+ zhuai1
1371
+ zhuai3
1372
+ zhuai4
1373
+ zhuan1
1374
+ zhuan2
1375
+ zhuan3
1376
+ zhuan4
1377
+ zhuang1
1378
+ zhuang4
1379
+ zhui1
1380
+ zhui4
1381
+ zhun1
1382
+ zhun2
1383
+ zhun3
1384
+ zhuo1
1385
+ zhuo2
1386
+ zi
1387
+ zi1
1388
+ zi2
1389
+ zi3
1390
+ zi4
1391
+ zong1
1392
+ zong2
1393
+ zong3
1394
+ zong4
1395
+ zou1
1396
+ zou2
1397
+ zou3
1398
+ zou4
1399
+ zu1
1400
+ zu2
1401
+ zu3
1402
+ zuan1
1403
+ zuan3
1404
+ zuan4
1405
+ zui2
1406
+ zui3
1407
+ zui4
1408
+ zun1
1409
+ zuo
1410
+ zuo1
1411
+ zuo2
1412
+ zuo3
1413
+ zuo4
1414
+ {
1415
+ ~
1416
+ ¡
1417
+ ¢
1418
+ £
1419
+ ¥
1420
+ §
1421
+ ¨
1422
+ ©
1423
+ «
1424
+ ®
1425
+ ¯
1426
+ °
1427
+ ±
1428
+ ²
1429
+ ³
1430
+ ´
1431
+ µ
1432
+ ·
1433
+ ¹
1434
+ º
1435
+ »
1436
+ ¼
1437
+ ½
1438
+ ¾
1439
+ ¿
1440
+ À
1441
+ Á
1442
+ Â
1443
+ Ã
1444
+ Ä
1445
+ Å
1446
+ Æ
1447
+ Ç
1448
+ È
1449
+ É
1450
+ Ê
1451
+ Í
1452
+ Î
1453
+ Ñ
1454
+ Ó
1455
+ Ö
1456
+ ×
1457
+ Ø
1458
+ Ú
1459
+ Ü
1460
+ Ý
1461
+ Þ
1462
+ ß
1463
+ à
1464
+ á
1465
+ â
1466
+ ã
1467
+ ä
1468
+ å
1469
+ æ
1470
+ ç
1471
+ è
1472
+ é
1473
+ ê
1474
+ ë
1475
+ ì
1476
+ í
1477
+ î
1478
+ ï
1479
+ ð
1480
+ ñ
1481
+ ò
1482
+ ó
1483
+ ô
1484
+ õ
1485
+ ö
1486
+ ø
1487
+ ù
1488
+ ú
1489
+ û
1490
+ ü
1491
+ ý
1492
+ Ā
1493
+ ā
1494
+ ă
1495
+ ą
1496
+ ć
1497
+ Č
1498
+ č
1499
+ Đ
1500
+ đ
1501
+ ē
1502
+ ė
1503
+ ę
1504
+ ě
1505
+ ĝ
1506
+ ğ
1507
+ ħ
1508
+ ī
1509
+ į
1510
+ İ
1511
+ ı
1512
+ Ł
1513
+ ł
1514
+ ń
1515
+ ņ
1516
+ ň
1517
+ ŋ
1518
+ Ō
1519
+ ō
1520
+ ő
1521
+ œ
1522
+ ř
1523
+ Ś
1524
+ ś
1525
+ Ş
1526
+ ş
1527
+ Š
1528
+ š
1529
+ Ť
1530
+ ť
1531
+ ũ
1532
+ ū
1533
+ ź
1534
+ Ż
1535
+ ż
1536
+ Ž
1537
+ ž
1538
+ ơ
1539
+ ư
1540
+ ǎ
1541
+ ǐ
1542
+ ǒ
1543
+ ǔ
1544
+ ǚ
1545
+ ș
1546
+ ț
1547
+ ɑ
1548
+ ɔ
1549
+ ɕ
1550
+ ə
1551
+ ɛ
1552
+ ɜ
1553
+ ɡ
1554
+ ɣ
1555
+ ɪ
1556
+ ɫ
1557
+ ɴ
1558
+ ɹ
1559
+ ɾ
1560
+ ʃ
1561
+ ʊ
1562
+ ʌ
1563
+ ʒ
1564
+ ʔ
1565
+ ʰ
1566
+ ʷ
1567
+ ʻ
1568
+ ʾ
1569
+ ʿ
1570
+ ˈ
1571
+ ː
1572
+ ˙
1573
+ ˜
1574
+ ˢ
1575
+ ́
1576
+ ̅
1577
+ Α
1578
+ Β
1579
+ Δ
1580
+ Ε
1581
+ Θ
1582
+ Κ
1583
+ Λ
1584
+ Μ
1585
+ Ξ
1586
+ Π
1587
+ Σ
1588
+ Τ
1589
+ Φ
1590
+ Χ
1591
+ Ψ
1592
+ Ω
1593
+ ά
1594
+ έ
1595
+ ή
1596
+ ί
1597
+ α
1598
+ β
1599
+ γ
1600
+ δ
1601
+ ε
1602
+ ζ
1603
+ η
1604
+ θ
1605
+ ι
1606
+ κ
1607
+ λ
1608
+ μ
1609
+ ν
1610
+ ξ
1611
+ ο
1612
+ π
1613
+ ρ
1614
+ ς
1615
+ σ
1616
+ τ
1617
+ υ
1618
+ φ
1619
+ χ
1620
+ ψ
1621
+ ω
1622
+ ϊ
1623
+ ό
1624
+ ύ
1625
+ ώ
1626
+ ϕ
1627
+ ϵ
1628
+ Ё
1629
+ А
1630
+ Б
1631
+ В
1632
+ Г
1633
+ Д
1634
+ Е
1635
+ Ж
1636
+ З
1637
+ И
1638
+ Й
1639
+ К
1640
+ Л
1641
+ М
1642
+ Н
1643
+ О
1644
+ П
1645
+ Р
1646
+ С
1647
+ Т
1648
+ У
1649
+ Ф
1650
+ Х
1651
+ Ц
1652
+ Ч
1653
+ Ш
1654
+ Щ
1655
+ Ы
1656
+ Ь
1657
+ Э
1658
+ Ю
1659
+ Я
1660
+ а
1661
+ б
1662
+ в
1663
+ г
1664
+ д
1665
+ е
1666
+ ж
1667
+ з
1668
+ и
1669
+ й
1670
+ к
1671
+ л
1672
+ м
1673
+ н
1674
+ о
1675
+ п
1676
+ р
1677
+ с
1678
+ т
1679
+ у
1680
+ ф
1681
+ х
1682
+ ц
1683
+ ч
1684
+ ш
1685
+ щ
1686
+ ъ
1687
+ ы
1688
+ ь
1689
+ э
1690
+ ю
1691
+ я
1692
+ ё
1693
+ і
1694
+ ְ
1695
+ ִ
1696
+ ֵ
1697
+ ֶ
1698
+ ַ
1699
+ ָ
1700
+ ֹ
1701
+ ּ
1702
+ ־
1703
+ ׁ
1704
+ א
1705
+ ב
1706
+ ג
1707
+ ד
1708
+ ה
1709
+ ו
1710
+ ז
1711
+ ח
1712
+ ט
1713
+ י
1714
+ כ
1715
+ ל
1716
+ ם
1717
+ מ
1718
+ ן
1719
+ נ
1720
+ ס
1721
+ ע
1722
+ פ
1723
+ ק
1724
+ ר
1725
+ ש
1726
+ ת
1727
+ أ
1728
+ ب
1729
+ ة
1730
+ ت
1731
+ ج
1732
+ ح
1733
+ د
1734
+ ر
1735
+ ز
1736
+ س
1737
+ ص
1738
+ ط
1739
+ ع
1740
+ ق
1741
+ ك
1742
+ ل
1743
+ م
1744
+ ن
1745
+ ه
1746
+ و
1747
+ ي
1748
+ َ
1749
+ ُ
1750
+ ِ
1751
+ ْ
1752
+
1753
+
1754
+
1755
+
1756
+
1757
+
1758
+
1759
+
1760
+
1761
+
1762
+
1763
+
1764
+
1765
+
1766
+
1767
+
1768
+
1769
+
1770
+
1771
+
1772
+
1773
+
1774
+
1775
+
1776
+
1777
+
1778
+
1779
+
1780
+
1781
+
1782
+
1783
+
1784
+
1785
+
1786
+
1787
+
1788
+
1789
+
1790
+
1791
+
1792
+
1793
+
1794
+
1795
+
1796
+
1797
+
1798
+
1799
+
1800
+ ế
1801
+
1802
+
1803
+
1804
+
1805
+
1806
+
1807
+
1808
+
1809
+
1810
+
1811
+
1812
+
1813
+
1814
+
1815
+
1816
+
1817
+
1818
+
1819
+
1820
+
1821
+
1822
+
1823
+
1824
+
1825
+
1826
+
1827
+
1828
+
1829
+
1830
+
1831
+
1832
+
1833
+
1834
+
1835
+
1836
+
1837
+
1838
+
1839
+
1840
+
1841
+
1842
+
1843
+
1844
+
1845
+
1846
+
1847
+
1848
+
1849
+
1850
+
1851
+
1852
+
1853
+
1854
+
1855
+
1856
+
1857
+
1858
+
1859
+
1860
+
1861
+
1862
+
1863
+
1864
+
1865
+
1866
+
1867
+
1868
+
1869
+
1870
+
1871
+
1872
+
1873
+
1874
+
1875
+
1876
+
1877
+
1878
+
1879
+
1880
+
1881
+
1882
+
1883
+
1884
+
1885
+
1886
+
1887
+
1888
+
1889
+
1890
+
1891
+
1892
+
1893
+
1894
+
1895
+
1896
+
1897
+
1898
+
1899
+
1900
+
1901
+
1902
+
1903
+
1904
+
1905
+
1906
+
1907
+
1908
+
1909
+
1910
+
1911
+
1912
+
1913
+
1914
+
1915
+
1916
+
1917
+
1918
+
1919
+
1920
+
1921
+
1922
+
1923
+
1924
+
1925
+
1926
+
1927
+
1928
+
1929
+
1930
+
1931
+
1932
+
1933
+
1934
+
1935
+
1936
+
1937
+
1938
+
1939
+
1940
+
1941
+
1942
+
1943
+
1944
+
1945
+
1946
+
1947
+
1948
+
1949
+
1950
+
1951
+
1952
+
1953
+
1954
+
1955
+
1956
+
1957
+
1958
+
1959
+
1960
+
1961
+
1962
+
1963
+
1964
+
1965
+
1966
+
1967
+
1968
+
1969
+
1970
+
1971
+
1972
+
1973
+
1974
+
1975
+
1976
+
1977
+
1978
+
1979
+
1980
+
1981
+
1982
+
1983
+
1984
+
1985
+
1986
+
1987
+
1988
+
1989
+
1990
+
1991
+
1992
+
1993
+
1994
+
1995
+
1996
+
1997
+
1998
+
1999
+
2000
+
2001
+
2002
+
2003
+
2004
+
2005
+
2006
+
2007
+
2008
+
2009
+
2010
+
2011
+
2012
+
2013
+
2014
+
2015
+
2016
+
2017
+
2018
+
2019
+
2020
+
2021
+
2022
+
2023
+
2024
+
2025
+
2026
+
2027
+
2028
+
2029
+
2030
+
2031
+
2032
+
2033
+
2034
+
2035
+
2036
+
2037
+
2038
+
2039
+
2040
+
2041
+
2042
+
2043
+
2044
+
2045
+
2046
+
2047
+
2048
+
2049
+
2050
+
2051
+
2052
+
2053
+
2054
+
2055
+
2056
+
2057
+
2058
+
2059
+
2060
+
2061
+
2062
+
2063
+
2064
+
2065
+
2066
+
2067
+
2068
+
2069
+
2070
+
2071
+
2072
+
2073
+
2074
+
2075
+
2076
+
2077
+
2078
+
2079
+
2080
+
2081
+
2082
+
2083
+
2084
+
2085
+
2086
+
2087
+
2088
+
2089
+
2090
+
2091
+
2092
+
2093
+
2094
+
2095
+
2096
+
2097
+
2098
+
2099
+
2100
+
2101
+
2102
+
2103
+
2104
+
2105
+
2106
+
2107
+
2108
+
2109
+
2110
+
2111
+
2112
+
2113
+
2114
+
2115
+
2116
+
2117
+
2118
+
2119
+
2120
+
2121
+
2122
+
2123
+
2124
+
2125
+
2126
+
2127
+
2128
+
2129
+
2130
+
2131
+
2132
+
2133
+
2134
+
2135
+
2136
+
2137
+
2138
+
2139
+
2140
+
2141
+
2142
+
2143
+
2144
+
2145
+
2146
+
2147
+
2148
+
2149
+
2150
+
2151
+
2152
+
2153
+
2154
+
2155
+
2156
+
2157
+
2158
+
2159
+
2160
+
2161
+
2162
+
2163
+
2164
+
2165
+
2166
+
2167
+
2168
+
2169
+
2170
+
2171
+
2172
+
2173
+
2174
+
2175
+
2176
+
2177
+
2178
+
2179
+
2180
+
2181
+
2182
+
2183
+
2184
+
2185
+
2186
+
2187
+
2188
+
2189
+
2190
+
2191
+
2192
+
2193
+
2194
+
2195
+
2196
+
2197
+
2198
+
2199
+
2200
+
2201
+
2202
+
2203
+
2204
+
2205
+
2206
+
2207
+
2208
+
2209
+
2210
+
2211
+
2212
+
2213
+
2214
+
2215
+
2216
+
2217
+
2218
+
2219
+
2220
+
2221
+
2222
+
2223
+
2224
+
2225
+
2226
+
2227
+
2228
+
2229
+
2230
+
2231
+
2232
+
2233
+
2234
+
2235
+
2236
+
2237
+
2238
+
2239
+
2240
+
2241
+
2242
+
2243
+
2244
+
2245
+
2246
+
2247
+
2248
+
2249
+
2250
+
2251
+
2252
+
2253
+
2254
+
2255
+
2256
+
2257
+
2258
+
2259
+
2260
+
2261
+
2262
+
2263
+
2264
+
2265
+
2266
+
2267
+
2268
+
2269
+
2270
+
2271
+
2272
+
2273
+
2274
+
2275
+
2276
+
2277
+
2278
+
2279
+
2280
+
2281
+
2282
+
2283
+
2284
+
2285
+
2286
+
2287
+
2288
+
2289
+
2290
+
2291
+
2292
+
2293
+
2294
+
2295
+
2296
+
2297
+
2298
+
2299
+
2300
+
2301
+
2302
+
2303
+
2304
+
2305
+
2306
+
2307
+
2308
+
2309
+
2310
+
2311
+
2312
+
2313
+
2314
+
2315
+
2316
+
2317
+
2318
+
2319
+
2320
+
2321
+
2322
+
2323
+
2324
+
2325
+
2326
+
2327
+
2328
+
2329
+
2330
+
2331
+
2332
+
2333
+
2334
+
2335
+
2336
+
2337
+
2338
+
2339
+
2340
+
2341
+
2342
+
2343
+
2344
+
2345
+
2346
+
2347
+
2348
+
2349
+
2350
+
2351
+
2352
+
2353
+
2354
+
2355
+
2356
+
2357
+
2358
+
2359
+
2360
+
2361
+
2362
+
2363
+
2364
+
2365
+
2366
+
2367
+
2368
+
2369
+
2370
+
2371
+
2372
+
2373
+
2374
+
2375
+
2376
+
2377
+
2378
+
2379
+
2380
+
2381
+
2382
+
2383
+
2384
+
2385
+
2386
+
2387
+
2388
+
2389
+
2390
+
2391
+
2392
+
2393
+
2394
+
2395
+
2396
+
2397
+
2398
+
2399
+
2400
+
2401
+
2402
+
2403
+
2404
+
2405
+
2406
+
2407
+
2408
+
2409
+
2410
+
2411
+
2412
+
2413
+
2414
+
2415
+
2416
+
2417
+
2418
+
2419
+
2420
+
2421
+
2422
+
2423
+
2424
+
2425
+
2426
+
2427
+
2428
+
2429
+
2430
+
2431
+
2432
+
2433
+
2434
+
2435
+
2436
+
2437
+
2438
+
2439
+
2440
+
2441
+
2442
+
2443
+
2444
+
2445
+
2446
+
2447
+
2448
+
2449
+
2450
+
2451
+
2452
+
2453
+
2454
+
2455
+
2456
+
2457
+
2458
+
2459
+
2460
+
2461
+
2462
+
2463
+
2464
+
2465
+
2466
+
2467
+
2468
+
2469
+
2470
+
2471
+
2472
+
2473
+
2474
+
2475
+
2476
+
2477
+
2478
+
2479
+
2480
+
2481
+
2482
+
2483
+
2484
+
2485
+
2486
+
2487
+
2488
+
2489
+
2490
+
2491
+
2492
+
2493
+
2494
+
2495
+
2496
+
2497
+
2498
+
2499
+
2500
+
2501
+
2502
+
2503
+
2504
+
2505
+
2506
+
2507
+
2508
+
2509
+
2510
+
2511
+
2512
+
2513
+
2514
+
2515
+
2516
+
2517
+
2518
+
2519
+
2520
+
2521
+
2522
+
2523
+
2524
+
2525
+
2526
+
2527
+
2528
+
2529
+
2530
+
2531
+
2532
+
2533
+
2534
+
2535
+
2536
+
2537
+
2538
+
2539
+
2540
+
2541
+
2542
+
2543
+
2544
+
2545
+ 𠮶
data/librispeech_pc_test_clean_cross_sentence.lst CHANGED
The diff for this file is too large to render. See raw diff
 
pyproject.toml ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools >= 61.0", "setuptools-scm>=8.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "f5-tts"
7
+ dynamic = ["version"]
8
+ description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
9
+ readme = "README.md"
10
+ license = {text = "MIT License"}
11
+ classifiers = [
12
+ "License :: OSI Approved :: MIT License",
13
+ "Operating System :: OS Independent",
14
+ "Programming Language :: Python :: 3",
15
+ ]
16
+ dependencies = [
17
+ "accelerate>=0.33.0",
18
+ "bitsandbytes>0.37.0",
19
+ "cached_path",
20
+ "click",
21
+ "datasets",
22
+ "ema_pytorch>=0.5.2",
23
+ "gradio>=3.45.2",
24
+ "jieba",
25
+ "librosa",
26
+ "matplotlib",
27
+ "numpy<=1.26.4",
28
+ "pydub",
29
+ "pypinyin",
30
+ "safetensors",
31
+ "soundfile",
32
+ "tomli",
33
+ "torch>=2.0.0",
34
+ "torchaudio>=2.0.0",
35
+ "torchdiffeq",
36
+ "tqdm>=4.65.0",
37
+ "transformers",
38
+ "transformers_stream_generator",
39
+ "vocos",
40
+ "wandb",
41
+ "x_transformers>=1.31.14",
42
+ ]
43
+
44
+ [project.optional-dependencies]
45
+ eval = [
46
+ "faster_whisper==0.10.1",
47
+ "funasr",
48
+ "jiwer",
49
+ "modelscope",
50
+ "zhconv",
51
+ "zhon",
52
+ ]
53
+
54
+ [project.urls]
55
+ Homepage = "https://github.com/SWivid/F5-TTS"
56
+
57
+ [project.scripts]
58
+ "f5-tts_infer-cli" = "f5_tts.infer.infer_cli:main"
59
+ "f5-tts_infer-gradio" = "f5_tts.infer.infer_gradio:main"
src/f5_tts/api.py ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+ import sys
3
+ import tqdm
4
+ from importlib.resources import files
5
+
6
+ import soundfile as sf
7
+ import torch
8
+ from cached_path import cached_path
9
+
10
+ from f5_tts.model import DiT, UNetT
11
+ from f5_tts.model.utils import seed_everything
12
+ from f5_tts.infer.utils_infer import (
13
+ load_vocoder,
14
+ load_model,
15
+ infer_process,
16
+ remove_silence_for_generated_wav,
17
+ save_spectrogram,
18
+ )
19
+
20
+
21
+ class F5TTS:
22
+ def __init__(
23
+ self,
24
+ model_type="F5-TTS",
25
+ ckpt_file="",
26
+ vocab_file="",
27
+ ode_method="euler",
28
+ use_ema=True,
29
+ local_path=None,
30
+ device=None,
31
+ ):
32
+ # Initialize parameters
33
+ self.final_wave = None
34
+ self.target_sample_rate = 24000
35
+ self.n_mel_channels = 100
36
+ self.hop_length = 256
37
+ self.target_rms = 0.1
38
+ self.seed = -1
39
+
40
+ # Set device
41
+ self.device = device or (
42
+ "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
43
+ )
44
+
45
+ # Load models
46
+ self.load_vocoder_model(local_path)
47
+ self.load_ema_model(model_type, ckpt_file, vocab_file, ode_method, use_ema)
48
+
49
+ def load_vocoder_model(self, local_path):
50
+ self.vocos = load_vocoder(local_path is not None, local_path, self.device)
51
+
52
+ def load_ema_model(self, model_type, ckpt_file, vocab_file, ode_method, use_ema):
53
+ if model_type == "F5-TTS":
54
+ if not ckpt_file:
55
+ ckpt_file = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors"))
56
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
57
+ model_cls = DiT
58
+ elif model_type == "E2-TTS":
59
+ if not ckpt_file:
60
+ ckpt_file = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors"))
61
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
62
+ model_cls = UNetT
63
+ else:
64
+ raise ValueError(f"Unknown model type: {model_type}")
65
+
66
+ self.ema_model = load_model(model_cls, model_cfg, ckpt_file, vocab_file, ode_method, use_ema, self.device)
67
+
68
+ def export_wav(self, wav, file_wave, remove_silence=False):
69
+ sf.write(file_wave, wav, self.target_sample_rate)
70
+
71
+ if remove_silence:
72
+ remove_silence_for_generated_wav(file_wave)
73
+
74
+ def export_spectrogram(self, spect, file_spect):
75
+ save_spectrogram(spect, file_spect)
76
+
77
+ def infer(
78
+ self,
79
+ ref_file,
80
+ ref_text,
81
+ gen_text,
82
+ show_info=print,
83
+ progress=tqdm,
84
+ target_rms=0.1,
85
+ cross_fade_duration=0.15,
86
+ sway_sampling_coef=-1,
87
+ cfg_strength=2,
88
+ nfe_step=32,
89
+ speed=1.0,
90
+ fix_duration=None,
91
+ remove_silence=False,
92
+ file_wave=None,
93
+ file_spect=None,
94
+ seed=-1,
95
+ ):
96
+ if seed == -1:
97
+ seed = random.randint(0, sys.maxsize)
98
+ seed_everything(seed)
99
+ self.seed = seed
100
+ wav, sr, spect = infer_process(
101
+ ref_file,
102
+ ref_text,
103
+ gen_text,
104
+ self.ema_model,
105
+ show_info=show_info,
106
+ progress=progress,
107
+ target_rms=target_rms,
108
+ cross_fade_duration=cross_fade_duration,
109
+ nfe_step=nfe_step,
110
+ cfg_strength=cfg_strength,
111
+ sway_sampling_coef=sway_sampling_coef,
112
+ speed=speed,
113
+ fix_duration=fix_duration,
114
+ device=self.device,
115
+ )
116
+
117
+ if file_wave is not None:
118
+ self.export_wav(wav, file_wave, remove_silence)
119
+
120
+ if file_spect is not None:
121
+ self.export_spectrogram(spect, file_spect)
122
+
123
+ return wav, sr, spect
124
+
125
+
126
+ if __name__ == "__main__":
127
+ f5tts = F5TTS()
128
+
129
+ wav, sr, spect = f5tts.infer(
130
+ ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
131
+ ref_text="some call me nature, others call me mother nature.",
132
+ gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
133
+ file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
134
+ file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
135
+ seed=-1, # random seed = -1
136
+ )
137
+
138
+ print("seed :", f5tts.seed)
src/f5_tts/eval/README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Evaluation
3
+
4
+ Install packages for evaluation:
5
+
6
+ ```bash
7
+ pip install -e .[eval]
8
+ ```
9
+
10
+ ## Generating Samples for Evaluation
11
+
12
+ ### Prepare Test Datasets
13
+
14
+ 1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
15
+ 2. *LibriSpeech test-clean*: Download from [OpenSLR](http://www.openslr.org/12/).
16
+ 3. Unzip the downloaded datasets and place them in the `data/` directory.
17
+ 4. Update the path for *LibriSpeech test-clean* data in `src/f5_tts/eval/eval_infer_batch.py`
18
+ 5. Our filtered LibriSpeech-PC 4-10s subset: `data/librispeech_pc_test_clean_cross_sentence.lst`
19
+
20
+ ### Batch Inference for Test Set
21
+
22
+ To run batch inference for evaluations, execute the following commands:
23
+
24
+ ```bash
25
+ # batch inference for evaluations
26
+ accelerate config # if not set before
27
+ bash src/f5_tts/eval/eval_infer_batch.sh
28
+ ```
29
+
30
+ ## Objective Evaluation on Generated Results
31
+
32
+ ### Download Evaluation Model Checkpoints
33
+
34
+ 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
35
+ 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
36
+ 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
37
+
38
+ Then update in the following scripts with the paths you put evaluation model ckpts to.
39
+
40
+ ### Objective Evaluation
41
+
42
+ Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
43
+ ```bash
44
+ # Evaluation for Seed-TTS test set
45
+ python src/f5_tts/eval/eval_seedtts_testset.py
46
+
47
+ # Evaluation for LibriSpeech-PC test-clean (cross-sentence)
48
+ python src/f5_tts/eval/eval_librispeech_test_clean.py
49
+ ```
src/f5_tts/eval/ecapa_tdnn.py ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # just for speaker similarity evaluation, third-party code
2
+
3
+ # From https://github.com/microsoft/UniSpeech/blob/main/downstreams/speaker_verification/models/
4
+ # part of the code is borrowed from https://github.com/lawlict/ECAPA-TDNN
5
+
6
+ import os
7
+ import torch
8
+ import torch.nn as nn
9
+ import torch.nn.functional as F
10
+
11
+
12
+ """ Res2Conv1d + BatchNorm1d + ReLU
13
+ """
14
+
15
+
16
+ class Res2Conv1dReluBn(nn.Module):
17
+ """
18
+ in_channels == out_channels == channels
19
+ """
20
+
21
+ def __init__(self, channels, kernel_size=1, stride=1, padding=0, dilation=1, bias=True, scale=4):
22
+ super().__init__()
23
+ assert channels % scale == 0, "{} % {} != 0".format(channels, scale)
24
+ self.scale = scale
25
+ self.width = channels // scale
26
+ self.nums = scale if scale == 1 else scale - 1
27
+
28
+ self.convs = []
29
+ self.bns = []
30
+ for i in range(self.nums):
31
+ self.convs.append(nn.Conv1d(self.width, self.width, kernel_size, stride, padding, dilation, bias=bias))
32
+ self.bns.append(nn.BatchNorm1d(self.width))
33
+ self.convs = nn.ModuleList(self.convs)
34
+ self.bns = nn.ModuleList(self.bns)
35
+
36
+ def forward(self, x):
37
+ out = []
38
+ spx = torch.split(x, self.width, 1)
39
+ for i in range(self.nums):
40
+ if i == 0:
41
+ sp = spx[i]
42
+ else:
43
+ sp = sp + spx[i]
44
+ # Order: conv -> relu -> bn
45
+ sp = self.convs[i](sp)
46
+ sp = self.bns[i](F.relu(sp))
47
+ out.append(sp)
48
+ if self.scale != 1:
49
+ out.append(spx[self.nums])
50
+ out = torch.cat(out, dim=1)
51
+
52
+ return out
53
+
54
+
55
+ """ Conv1d + BatchNorm1d + ReLU
56
+ """
57
+
58
+
59
+ class Conv1dReluBn(nn.Module):
60
+ def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, padding=0, dilation=1, bias=True):
61
+ super().__init__()
62
+ self.conv = nn.Conv1d(in_channels, out_channels, kernel_size, stride, padding, dilation, bias=bias)
63
+ self.bn = nn.BatchNorm1d(out_channels)
64
+
65
+ def forward(self, x):
66
+ return self.bn(F.relu(self.conv(x)))
67
+
68
+
69
+ """ The SE connection of 1D case.
70
+ """
71
+
72
+
73
+ class SE_Connect(nn.Module):
74
+ def __init__(self, channels, se_bottleneck_dim=128):
75
+ super().__init__()
76
+ self.linear1 = nn.Linear(channels, se_bottleneck_dim)
77
+ self.linear2 = nn.Linear(se_bottleneck_dim, channels)
78
+
79
+ def forward(self, x):
80
+ out = x.mean(dim=2)
81
+ out = F.relu(self.linear1(out))
82
+ out = torch.sigmoid(self.linear2(out))
83
+ out = x * out.unsqueeze(2)
84
+
85
+ return out
86
+
87
+
88
+ """ SE-Res2Block of the ECAPA-TDNN architecture.
89
+ """
90
+
91
+ # def SE_Res2Block(channels, kernel_size, stride, padding, dilation, scale):
92
+ # return nn.Sequential(
93
+ # Conv1dReluBn(channels, 512, kernel_size=1, stride=1, padding=0),
94
+ # Res2Conv1dReluBn(512, kernel_size, stride, padding, dilation, scale=scale),
95
+ # Conv1dReluBn(512, channels, kernel_size=1, stride=1, padding=0),
96
+ # SE_Connect(channels)
97
+ # )
98
+
99
+
100
+ class SE_Res2Block(nn.Module):
101
+ def __init__(self, in_channels, out_channels, kernel_size, stride, padding, dilation, scale, se_bottleneck_dim):
102
+ super().__init__()
103
+ self.Conv1dReluBn1 = Conv1dReluBn(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
104
+ self.Res2Conv1dReluBn = Res2Conv1dReluBn(out_channels, kernel_size, stride, padding, dilation, scale=scale)
105
+ self.Conv1dReluBn2 = Conv1dReluBn(out_channels, out_channels, kernel_size=1, stride=1, padding=0)
106
+ self.SE_Connect = SE_Connect(out_channels, se_bottleneck_dim)
107
+
108
+ self.shortcut = None
109
+ if in_channels != out_channels:
110
+ self.shortcut = nn.Conv1d(
111
+ in_channels=in_channels,
112
+ out_channels=out_channels,
113
+ kernel_size=1,
114
+ )
115
+
116
+ def forward(self, x):
117
+ residual = x
118
+ if self.shortcut:
119
+ residual = self.shortcut(x)
120
+
121
+ x = self.Conv1dReluBn1(x)
122
+ x = self.Res2Conv1dReluBn(x)
123
+ x = self.Conv1dReluBn2(x)
124
+ x = self.SE_Connect(x)
125
+
126
+ return x + residual
127
+
128
+
129
+ """ Attentive weighted mean and standard deviation pooling.
130
+ """
131
+
132
+
133
+ class AttentiveStatsPool(nn.Module):
134
+ def __init__(self, in_dim, attention_channels=128, global_context_att=False):
135
+ super().__init__()
136
+ self.global_context_att = global_context_att
137
+
138
+ # Use Conv1d with stride == 1 rather than Linear, then we don't need to transpose inputs.
139
+ if global_context_att:
140
+ self.linear1 = nn.Conv1d(in_dim * 3, attention_channels, kernel_size=1) # equals W and b in the paper
141
+ else:
142
+ self.linear1 = nn.Conv1d(in_dim, attention_channels, kernel_size=1) # equals W and b in the paper
143
+ self.linear2 = nn.Conv1d(attention_channels, in_dim, kernel_size=1) # equals V and k in the paper
144
+
145
+ def forward(self, x):
146
+ if self.global_context_att:
147
+ context_mean = torch.mean(x, dim=-1, keepdim=True).expand_as(x)
148
+ context_std = torch.sqrt(torch.var(x, dim=-1, keepdim=True) + 1e-10).expand_as(x)
149
+ x_in = torch.cat((x, context_mean, context_std), dim=1)
150
+ else:
151
+ x_in = x
152
+
153
+ # DON'T use ReLU here! In experiments, I find ReLU hard to converge.
154
+ alpha = torch.tanh(self.linear1(x_in))
155
+ # alpha = F.relu(self.linear1(x_in))
156
+ alpha = torch.softmax(self.linear2(alpha), dim=2)
157
+ mean = torch.sum(alpha * x, dim=2)
158
+ residuals = torch.sum(alpha * (x**2), dim=2) - mean**2
159
+ std = torch.sqrt(residuals.clamp(min=1e-9))
160
+ return torch.cat([mean, std], dim=1)
161
+
162
+
163
+ class ECAPA_TDNN(nn.Module):
164
+ def __init__(
165
+ self,
166
+ feat_dim=80,
167
+ channels=512,
168
+ emb_dim=192,
169
+ global_context_att=False,
170
+ feat_type="wavlm_large",
171
+ sr=16000,
172
+ feature_selection="hidden_states",
173
+ update_extract=False,
174
+ config_path=None,
175
+ ):
176
+ super().__init__()
177
+
178
+ self.feat_type = feat_type
179
+ self.feature_selection = feature_selection
180
+ self.update_extract = update_extract
181
+ self.sr = sr
182
+
183
+ torch.hub._validate_not_a_forked_repo = lambda a, b, c: True
184
+ try:
185
+ local_s3prl_path = os.path.expanduser("~/.cache/torch/hub/s3prl_s3prl_main")
186
+ self.feature_extract = torch.hub.load(local_s3prl_path, feat_type, source="local", config_path=config_path)
187
+ except: # noqa: E722
188
+ self.feature_extract = torch.hub.load("s3prl/s3prl", feat_type)
189
+
190
+ if len(self.feature_extract.model.encoder.layers) == 24 and hasattr(
191
+ self.feature_extract.model.encoder.layers[23].self_attn, "fp32_attention"
192
+ ):
193
+ self.feature_extract.model.encoder.layers[23].self_attn.fp32_attention = False
194
+ if len(self.feature_extract.model.encoder.layers) == 24 and hasattr(
195
+ self.feature_extract.model.encoder.layers[11].self_attn, "fp32_attention"
196
+ ):
197
+ self.feature_extract.model.encoder.layers[11].self_attn.fp32_attention = False
198
+
199
+ self.feat_num = self.get_feat_num()
200
+ self.feature_weight = nn.Parameter(torch.zeros(self.feat_num))
201
+
202
+ if feat_type != "fbank" and feat_type != "mfcc":
203
+ freeze_list = ["final_proj", "label_embs_concat", "mask_emb", "project_q", "quantizer"]
204
+ for name, param in self.feature_extract.named_parameters():
205
+ for freeze_val in freeze_list:
206
+ if freeze_val in name:
207
+ param.requires_grad = False
208
+ break
209
+
210
+ if not self.update_extract:
211
+ for param in self.feature_extract.parameters():
212
+ param.requires_grad = False
213
+
214
+ self.instance_norm = nn.InstanceNorm1d(feat_dim)
215
+ # self.channels = [channels] * 4 + [channels * 3]
216
+ self.channels = [channels] * 4 + [1536]
217
+
218
+ self.layer1 = Conv1dReluBn(feat_dim, self.channels[0], kernel_size=5, padding=2)
219
+ self.layer2 = SE_Res2Block(
220
+ self.channels[0],
221
+ self.channels[1],
222
+ kernel_size=3,
223
+ stride=1,
224
+ padding=2,
225
+ dilation=2,
226
+ scale=8,
227
+ se_bottleneck_dim=128,
228
+ )
229
+ self.layer3 = SE_Res2Block(
230
+ self.channels[1],
231
+ self.channels[2],
232
+ kernel_size=3,
233
+ stride=1,
234
+ padding=3,
235
+ dilation=3,
236
+ scale=8,
237
+ se_bottleneck_dim=128,
238
+ )
239
+ self.layer4 = SE_Res2Block(
240
+ self.channels[2],
241
+ self.channels[3],
242
+ kernel_size=3,
243
+ stride=1,
244
+ padding=4,
245
+ dilation=4,
246
+ scale=8,
247
+ se_bottleneck_dim=128,
248
+ )
249
+
250
+ # self.conv = nn.Conv1d(self.channels[-1], self.channels[-1], kernel_size=1)
251
+ cat_channels = channels * 3
252
+ self.conv = nn.Conv1d(cat_channels, self.channels[-1], kernel_size=1)
253
+ self.pooling = AttentiveStatsPool(
254
+ self.channels[-1], attention_channels=128, global_context_att=global_context_att
255
+ )
256
+ self.bn = nn.BatchNorm1d(self.channels[-1] * 2)
257
+ self.linear = nn.Linear(self.channels[-1] * 2, emb_dim)
258
+
259
+ def get_feat_num(self):
260
+ self.feature_extract.eval()
261
+ wav = [torch.randn(self.sr).to(next(self.feature_extract.parameters()).device)]
262
+ with torch.no_grad():
263
+ features = self.feature_extract(wav)
264
+ select_feature = features[self.feature_selection]
265
+ if isinstance(select_feature, (list, tuple)):
266
+ return len(select_feature)
267
+ else:
268
+ return 1
269
+
270
+ def get_feat(self, x):
271
+ if self.update_extract:
272
+ x = self.feature_extract([sample for sample in x])
273
+ else:
274
+ with torch.no_grad():
275
+ if self.feat_type == "fbank" or self.feat_type == "mfcc":
276
+ x = self.feature_extract(x) + 1e-6 # B x feat_dim x time_len
277
+ else:
278
+ x = self.feature_extract([sample for sample in x])
279
+
280
+ if self.feat_type == "fbank":
281
+ x = x.log()
282
+
283
+ if self.feat_type != "fbank" and self.feat_type != "mfcc":
284
+ x = x[self.feature_selection]
285
+ if isinstance(x, (list, tuple)):
286
+ x = torch.stack(x, dim=0)
287
+ else:
288
+ x = x.unsqueeze(0)
289
+ norm_weights = F.softmax(self.feature_weight, dim=-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
290
+ x = (norm_weights * x).sum(dim=0)
291
+ x = torch.transpose(x, 1, 2) + 1e-6
292
+
293
+ x = self.instance_norm(x)
294
+ return x
295
+
296
+ def forward(self, x):
297
+ x = self.get_feat(x)
298
+
299
+ out1 = self.layer1(x)
300
+ out2 = self.layer2(out1)
301
+ out3 = self.layer3(out2)
302
+ out4 = self.layer4(out3)
303
+
304
+ out = torch.cat([out2, out3, out4], dim=1)
305
+ out = F.relu(self.conv(out))
306
+ out = self.bn(self.pooling(out))
307
+ out = self.linear(out)
308
+
309
+ return out
310
+
311
+
312
+ def ECAPA_TDNN_SMALL(
313
+ feat_dim,
314
+ emb_dim=256,
315
+ feat_type="wavlm_large",
316
+ sr=16000,
317
+ feature_selection="hidden_states",
318
+ update_extract=False,
319
+ config_path=None,
320
+ ):
321
+ return ECAPA_TDNN(
322
+ feat_dim=feat_dim,
323
+ channels=512,
324
+ emb_dim=emb_dim,
325
+ feat_type=feat_type,
326
+ sr=sr,
327
+ feature_selection=feature_selection,
328
+ update_extract=update_extract,
329
+ config_path=config_path,
330
+ )
src/f5_tts/eval/eval_infer_batch.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+
4
+ sys.path.append(os.getcwd())
5
+
6
+ import time
7
+ from tqdm import tqdm
8
+ import argparse
9
+ from importlib.resources import files
10
+
11
+ import torch
12
+ import torchaudio
13
+ from accelerate import Accelerator
14
+ from vocos import Vocos
15
+
16
+ from f5_tts.model import CFM, UNetT, DiT
17
+ from f5_tts.model.utils import get_tokenizer
18
+ from f5_tts.infer.utils_infer import load_checkpoint
19
+ from f5_tts.eval.utils_eval import (
20
+ get_seedtts_testset_metainfo,
21
+ get_librispeech_test_clean_metainfo,
22
+ get_inference_prompt,
23
+ )
24
+
25
+ accelerator = Accelerator()
26
+ device = f"cuda:{accelerator.process_index}"
27
+
28
+
29
+ # --------------------- Dataset Settings -------------------- #
30
+
31
+ target_sample_rate = 24000
32
+ n_mel_channels = 100
33
+ hop_length = 256
34
+ target_rms = 0.1
35
+
36
+ tokenizer = "pinyin"
37
+ rel_path = str(files("f5_tts").joinpath("../../"))
38
+
39
+
40
+ def main():
41
+ # ---------------------- infer setting ---------------------- #
42
+
43
+ parser = argparse.ArgumentParser(description="batch inference")
44
+
45
+ parser.add_argument("-s", "--seed", default=None, type=int)
46
+ parser.add_argument("-d", "--dataset", default="Emilia_ZH_EN")
47
+ parser.add_argument("-n", "--expname", required=True)
48
+ parser.add_argument("-c", "--ckptstep", default=1200000, type=int)
49
+
50
+ parser.add_argument("-nfe", "--nfestep", default=32, type=int)
51
+ parser.add_argument("-o", "--odemethod", default="euler")
52
+ parser.add_argument("-ss", "--swaysampling", default=-1, type=float)
53
+
54
+ parser.add_argument("-t", "--testset", required=True)
55
+
56
+ args = parser.parse_args()
57
+
58
+ seed = args.seed
59
+ dataset_name = args.dataset
60
+ exp_name = args.expname
61
+ ckpt_step = args.ckptstep
62
+ ckpt_path = rel_path + f"/ckpts/{exp_name}/model_{ckpt_step}.pt"
63
+
64
+ nfe_step = args.nfestep
65
+ ode_method = args.odemethod
66
+ sway_sampling_coef = args.swaysampling
67
+
68
+ testset = args.testset
69
+
70
+ infer_batch_size = 1 # max frames. 1 for ddp single inference (recommended)
71
+ cfg_strength = 2.0
72
+ speed = 1.0
73
+ use_truth_duration = False
74
+ no_ref_audio = False
75
+
76
+ if exp_name == "F5TTS_Base":
77
+ model_cls = DiT
78
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
79
+
80
+ elif exp_name == "E2TTS_Base":
81
+ model_cls = UNetT
82
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
83
+
84
+ if testset == "ls_pc_test_clean":
85
+ metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
86
+ librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
87
+ metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
88
+
89
+ elif testset == "seedtts_test_zh":
90
+ metalst = rel_path + "/data/seedtts_testset/zh/meta.lst"
91
+ metainfo = get_seedtts_testset_metainfo(metalst)
92
+
93
+ elif testset == "seedtts_test_en":
94
+ metalst = rel_path + "/data/seedtts_testset/en/meta.lst"
95
+ metainfo = get_seedtts_testset_metainfo(metalst)
96
+
97
+ # path to save genereted wavs
98
+ output_dir = (
99
+ f"{rel_path}/"
100
+ f"results/{exp_name}_{ckpt_step}/{testset}/"
101
+ f"seed{seed}_{ode_method}_nfe{nfe_step}"
102
+ f"{f'_ss{sway_sampling_coef}' if sway_sampling_coef else ''}"
103
+ f"_cfg{cfg_strength}_speed{speed}"
104
+ f"{'_gt-dur' if use_truth_duration else ''}"
105
+ f"{'_no-ref-audio' if no_ref_audio else ''}"
106
+ )
107
+
108
+ # -------------------------------------------------#
109
+
110
+ use_ema = True
111
+
112
+ prompts_all = get_inference_prompt(
113
+ metainfo,
114
+ speed=speed,
115
+ tokenizer=tokenizer,
116
+ target_sample_rate=target_sample_rate,
117
+ n_mel_channels=n_mel_channels,
118
+ hop_length=hop_length,
119
+ target_rms=target_rms,
120
+ use_truth_duration=use_truth_duration,
121
+ infer_batch_size=infer_batch_size,
122
+ )
123
+
124
+ # Vocoder model
125
+ local = False
126
+ if local:
127
+ vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
128
+ vocos = Vocos.from_hparams(f"{vocos_local_path}/config.yaml")
129
+ state_dict = torch.load(f"{vocos_local_path}/pytorch_model.bin", weights_only=True, map_location=device)
130
+ vocos.load_state_dict(state_dict)
131
+ vocos.eval()
132
+ else:
133
+ vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
134
+
135
+ # Tokenizer
136
+ vocab_char_map, vocab_size = get_tokenizer(dataset_name, tokenizer)
137
+
138
+ # Model
139
+ model = CFM(
140
+ transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
141
+ mel_spec_kwargs=dict(
142
+ target_sample_rate=target_sample_rate,
143
+ n_mel_channels=n_mel_channels,
144
+ hop_length=hop_length,
145
+ ),
146
+ odeint_kwargs=dict(
147
+ method=ode_method,
148
+ ),
149
+ vocab_char_map=vocab_char_map,
150
+ ).to(device)
151
+
152
+ model = load_checkpoint(model, ckpt_path, device, use_ema=use_ema)
153
+
154
+ if not os.path.exists(output_dir) and accelerator.is_main_process:
155
+ os.makedirs(output_dir)
156
+
157
+ # start batch inference
158
+ accelerator.wait_for_everyone()
159
+ start = time.time()
160
+
161
+ with accelerator.split_between_processes(prompts_all) as prompts:
162
+ for prompt in tqdm(prompts, disable=not accelerator.is_local_main_process):
163
+ utts, ref_rms_list, ref_mels, ref_mel_lens, total_mel_lens, final_text_list = prompt
164
+ ref_mels = ref_mels.to(device)
165
+ ref_mel_lens = torch.tensor(ref_mel_lens, dtype=torch.long).to(device)
166
+ total_mel_lens = torch.tensor(total_mel_lens, dtype=torch.long).to(device)
167
+
168
+ # Inference
169
+ with torch.inference_mode():
170
+ generated, _ = model.sample(
171
+ cond=ref_mels,
172
+ text=final_text_list,
173
+ duration=total_mel_lens,
174
+ lens=ref_mel_lens,
175
+ steps=nfe_step,
176
+ cfg_strength=cfg_strength,
177
+ sway_sampling_coef=sway_sampling_coef,
178
+ no_ref_audio=no_ref_audio,
179
+ seed=seed,
180
+ )
181
+ # Final result
182
+ for i, gen in enumerate(generated):
183
+ gen = gen[ref_mel_lens[i] : total_mel_lens[i], :].unsqueeze(0)
184
+ gen_mel_spec = gen.permute(0, 2, 1)
185
+ generated_wave = vocos.decode(gen_mel_spec.cpu())
186
+ if ref_rms_list[i] < target_rms:
187
+ generated_wave = generated_wave * ref_rms_list[i] / target_rms
188
+ torchaudio.save(f"{output_dir}/{utts[i]}.wav", generated_wave, target_sample_rate)
189
+
190
+ accelerator.wait_for_everyone()
191
+ if accelerator.is_main_process:
192
+ timediff = time.time() - start
193
+ print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
194
+
195
+
196
+ if __name__ == "__main__":
197
+ main()
src/f5_tts/eval/eval_infer_batch.sh ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # e.g. F5-TTS, 16 NFE
4
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_zh" -nfe 16
5
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_en" -nfe 16
6
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "ls_pc_test_clean" -nfe 16
7
+
8
+ # e.g. Vanilla E2 TTS, 32 NFE
9
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_zh" -o "midpoint" -ss 0
10
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_en" -o "midpoint" -ss 0
11
+ accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "ls_pc_test_clean" -o "midpoint" -ss 0
12
+
13
+ # etc.
src/f5_tts/eval/eval_librispeech_test_clean.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluate with Librispeech test-clean, ~3s prompt to generate 4-10s audio (the way of valle/voicebox evaluation)
2
+
3
+ import sys
4
+ import os
5
+
6
+ sys.path.append(os.getcwd())
7
+
8
+ import multiprocessing as mp
9
+ from importlib.resources import files
10
+
11
+ import numpy as np
12
+
13
+ from f5_tts.eval.utils_eval import (
14
+ get_librispeech_test,
15
+ run_asr_wer,
16
+ run_sim,
17
+ )
18
+
19
+ rel_path = str(files("f5_tts").joinpath("../../"))
20
+
21
+
22
+ eval_task = "wer" # sim | wer
23
+ lang = "en"
24
+ metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
25
+ librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
26
+ gen_wav_dir = "PATH_TO_GENERATED" # generated wavs
27
+
28
+ gpus = [0, 1, 2, 3, 4, 5, 6, 7]
29
+ test_set = get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path)
30
+
31
+ ## In LibriSpeech, some speakers utilized varying voice characteristics for different characters in the book,
32
+ ## leading to a low similarity for the ground truth in some cases.
33
+ # test_set = get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path, eval_ground_truth = True) # eval ground truth
34
+
35
+ local = False
36
+ if local: # use local custom checkpoint dir
37
+ asr_ckpt_dir = "../checkpoints/Systran/faster-whisper-large-v3"
38
+ else:
39
+ asr_ckpt_dir = "" # auto download to cache dir
40
+
41
+ wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
42
+
43
+
44
+ # --------------------------- WER ---------------------------
45
+
46
+ if eval_task == "wer":
47
+ wers = []
48
+
49
+ with mp.Pool(processes=len(gpus)) as pool:
50
+ args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
51
+ results = pool.map(run_asr_wer, args)
52
+ for wers_ in results:
53
+ wers.extend(wers_)
54
+
55
+ wer = round(np.mean(wers) * 100, 3)
56
+ print(f"\nTotal {len(wers)} samples")
57
+ print(f"WER : {wer}%")
58
+
59
+
60
+ # --------------------------- SIM ---------------------------
61
+
62
+ if eval_task == "sim":
63
+ sim_list = []
64
+
65
+ with mp.Pool(processes=len(gpus)) as pool:
66
+ args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
67
+ results = pool.map(run_sim, args)
68
+ for sim_ in results:
69
+ sim_list.extend(sim_)
70
+
71
+ sim = round(sum(sim_list) / len(sim_list), 3)
72
+ print(f"\nTotal {len(sim_list)} samples")
73
+ print(f"SIM : {sim}")
src/f5_tts/eval/eval_seedtts_testset.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluate with Seed-TTS testset
2
+
3
+ import sys
4
+ import os
5
+
6
+ sys.path.append(os.getcwd())
7
+
8
+ import multiprocessing as mp
9
+ from importlib.resources import files
10
+
11
+ import numpy as np
12
+
13
+ from f5_tts.eval.utils_eval import (
14
+ get_seed_tts_test,
15
+ run_asr_wer,
16
+ run_sim,
17
+ )
18
+
19
+ rel_path = str(files("f5_tts").joinpath("../../"))
20
+
21
+
22
+ eval_task = "wer" # sim | wer
23
+ lang = "zh" # zh | en
24
+ metalst = rel_path + f"/data/seedtts_testset/{lang}/meta.lst" # seed-tts testset
25
+ # gen_wav_dir = rel_path + f"/data/seedtts_testset/{lang}/wavs" # ground truth wavs
26
+ gen_wav_dir = "PATH_TO_GENERATED" # generated wavs
27
+
28
+
29
+ # NOTE. paraformer-zh result will be slightly different according to the number of gpus, cuz batchsize is different
30
+ # zh 1.254 seems a result of 4 workers wer_seed_tts
31
+ gpus = [0, 1, 2, 3, 4, 5, 6, 7]
32
+ test_set = get_seed_tts_test(metalst, gen_wav_dir, gpus)
33
+
34
+ local = False
35
+ if local: # use local custom checkpoint dir
36
+ if lang == "zh":
37
+ asr_ckpt_dir = "../checkpoints/funasr" # paraformer-zh dir under funasr
38
+ elif lang == "en":
39
+ asr_ckpt_dir = "../checkpoints/Systran/faster-whisper-large-v3"
40
+ else:
41
+ asr_ckpt_dir = "" # auto download to cache dir
42
+
43
+ wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
44
+
45
+
46
+ # --------------------------- WER ---------------------------
47
+
48
+ if eval_task == "wer":
49
+ wers = []
50
+
51
+ with mp.Pool(processes=len(gpus)) as pool:
52
+ args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
53
+ results = pool.map(run_asr_wer, args)
54
+ for wers_ in results:
55
+ wers.extend(wers_)
56
+
57
+ wer = round(np.mean(wers) * 100, 3)
58
+ print(f"\nTotal {len(wers)} samples")
59
+ print(f"WER : {wer}%")
60
+
61
+
62
+ # --------------------------- SIM ---------------------------
63
+
64
+ if eval_task == "sim":
65
+ sim_list = []
66
+
67
+ with mp.Pool(processes=len(gpus)) as pool:
68
+ args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
69
+ results = pool.map(run_sim, args)
70
+ for sim_ in results:
71
+ sim_list.extend(sim_)
72
+
73
+ sim = round(sum(sim_list) / len(sim_list), 3)
74
+ print(f"\nTotal {len(sim_list)} samples")
75
+ print(f"SIM : {sim}")
src/f5_tts/eval/utils_eval.py ADDED
@@ -0,0 +1,397 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import os
3
+ import random
4
+ import string
5
+ from tqdm import tqdm
6
+
7
+ import torch
8
+ import torch.nn.functional as F
9
+ import torchaudio
10
+
11
+ from f5_tts.model.modules import MelSpec
12
+ from f5_tts.model.utils import convert_char_to_pinyin
13
+ from f5_tts.eval.ecapa_tdnn import ECAPA_TDNN_SMALL
14
+
15
+
16
+ # seedtts testset metainfo: utt, prompt_text, prompt_wav, gt_text, gt_wav
17
+ def get_seedtts_testset_metainfo(metalst):
18
+ f = open(metalst)
19
+ lines = f.readlines()
20
+ f.close()
21
+ metainfo = []
22
+ for line in lines:
23
+ if len(line.strip().split("|")) == 5:
24
+ utt, prompt_text, prompt_wav, gt_text, gt_wav = line.strip().split("|")
25
+ elif len(line.strip().split("|")) == 4:
26
+ utt, prompt_text, prompt_wav, gt_text = line.strip().split("|")
27
+ gt_wav = os.path.join(os.path.dirname(metalst), "wavs", utt + ".wav")
28
+ if not os.path.isabs(prompt_wav):
29
+ prompt_wav = os.path.join(os.path.dirname(metalst), prompt_wav)
30
+ metainfo.append((utt, prompt_text, prompt_wav, gt_text, gt_wav))
31
+ return metainfo
32
+
33
+
34
+ # librispeech test-clean metainfo: gen_utt, ref_txt, ref_wav, gen_txt, gen_wav
35
+ def get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path):
36
+ f = open(metalst)
37
+ lines = f.readlines()
38
+ f.close()
39
+ metainfo = []
40
+ for line in lines:
41
+ ref_utt, ref_dur, ref_txt, gen_utt, gen_dur, gen_txt = line.strip().split("\t")
42
+
43
+ # ref_txt = ref_txt[0] + ref_txt[1:].lower() + '.' # if use librispeech test-clean (no-pc)
44
+ ref_spk_id, ref_chaptr_id, _ = ref_utt.split("-")
45
+ ref_wav = os.path.join(librispeech_test_clean_path, ref_spk_id, ref_chaptr_id, ref_utt + ".flac")
46
+
47
+ # gen_txt = gen_txt[0] + gen_txt[1:].lower() + '.' # if use librispeech test-clean (no-pc)
48
+ gen_spk_id, gen_chaptr_id, _ = gen_utt.split("-")
49
+ gen_wav = os.path.join(librispeech_test_clean_path, gen_spk_id, gen_chaptr_id, gen_utt + ".flac")
50
+
51
+ metainfo.append((gen_utt, ref_txt, ref_wav, " " + gen_txt, gen_wav))
52
+
53
+ return metainfo
54
+
55
+
56
+ # padded to max length mel batch
57
+ def padded_mel_batch(ref_mels):
58
+ max_mel_length = torch.LongTensor([mel.shape[-1] for mel in ref_mels]).amax()
59
+ padded_ref_mels = []
60
+ for mel in ref_mels:
61
+ padded_ref_mel = F.pad(mel, (0, max_mel_length - mel.shape[-1]), value=0)
62
+ padded_ref_mels.append(padded_ref_mel)
63
+ padded_ref_mels = torch.stack(padded_ref_mels)
64
+ padded_ref_mels = padded_ref_mels.permute(0, 2, 1)
65
+ return padded_ref_mels
66
+
67
+
68
+ # get prompts from metainfo containing: utt, prompt_text, prompt_wav, gt_text, gt_wav
69
+
70
+
71
+ def get_inference_prompt(
72
+ metainfo,
73
+ speed=1.0,
74
+ tokenizer="pinyin",
75
+ polyphone=True,
76
+ target_sample_rate=24000,
77
+ n_mel_channels=100,
78
+ hop_length=256,
79
+ target_rms=0.1,
80
+ use_truth_duration=False,
81
+ infer_batch_size=1,
82
+ num_buckets=200,
83
+ min_secs=3,
84
+ max_secs=40,
85
+ ):
86
+ prompts_all = []
87
+
88
+ min_tokens = min_secs * target_sample_rate // hop_length
89
+ max_tokens = max_secs * target_sample_rate // hop_length
90
+
91
+ batch_accum = [0] * num_buckets
92
+ utts, ref_rms_list, ref_mels, ref_mel_lens, total_mel_lens, final_text_list = (
93
+ [[] for _ in range(num_buckets)] for _ in range(6)
94
+ )
95
+
96
+ mel_spectrogram = MelSpec(
97
+ target_sample_rate=target_sample_rate, n_mel_channels=n_mel_channels, hop_length=hop_length
98
+ )
99
+
100
+ for utt, prompt_text, prompt_wav, gt_text, gt_wav in tqdm(metainfo, desc="Processing prompts..."):
101
+ # Audio
102
+ ref_audio, ref_sr = torchaudio.load(prompt_wav)
103
+ ref_rms = torch.sqrt(torch.mean(torch.square(ref_audio)))
104
+ if ref_rms < target_rms:
105
+ ref_audio = ref_audio * target_rms / ref_rms
106
+ assert ref_audio.shape[-1] > 5000, f"Empty prompt wav: {prompt_wav}, or torchaudio backend issue."
107
+ if ref_sr != target_sample_rate:
108
+ resampler = torchaudio.transforms.Resample(ref_sr, target_sample_rate)
109
+ ref_audio = resampler(ref_audio)
110
+
111
+ # Text
112
+ if len(prompt_text[-1].encode("utf-8")) == 1:
113
+ prompt_text = prompt_text + " "
114
+ text = [prompt_text + gt_text]
115
+ if tokenizer == "pinyin":
116
+ text_list = convert_char_to_pinyin(text, polyphone=polyphone)
117
+ else:
118
+ text_list = text
119
+
120
+ # Duration, mel frame length
121
+ ref_mel_len = ref_audio.shape[-1] // hop_length
122
+ if use_truth_duration:
123
+ gt_audio, gt_sr = torchaudio.load(gt_wav)
124
+ if gt_sr != target_sample_rate:
125
+ resampler = torchaudio.transforms.Resample(gt_sr, target_sample_rate)
126
+ gt_audio = resampler(gt_audio)
127
+ total_mel_len = ref_mel_len + int(gt_audio.shape[-1] / hop_length / speed)
128
+
129
+ # # test vocoder resynthesis
130
+ # ref_audio = gt_audio
131
+ else:
132
+ ref_text_len = len(prompt_text.encode("utf-8"))
133
+ gen_text_len = len(gt_text.encode("utf-8"))
134
+ total_mel_len = ref_mel_len + int(ref_mel_len / ref_text_len * gen_text_len / speed)
135
+
136
+ # to mel spectrogram
137
+ ref_mel = mel_spectrogram(ref_audio)
138
+ ref_mel = ref_mel.squeeze(0)
139
+
140
+ # deal with batch
141
+ assert infer_batch_size > 0, "infer_batch_size should be greater than 0."
142
+ assert (
143
+ min_tokens <= total_mel_len <= max_tokens
144
+ ), f"Audio {utt} has duration {total_mel_len*hop_length//target_sample_rate}s out of range [{min_secs}, {max_secs}]."
145
+ bucket_i = math.floor((total_mel_len - min_tokens) / (max_tokens - min_tokens + 1) * num_buckets)
146
+
147
+ utts[bucket_i].append(utt)
148
+ ref_rms_list[bucket_i].append(ref_rms)
149
+ ref_mels[bucket_i].append(ref_mel)
150
+ ref_mel_lens[bucket_i].append(ref_mel_len)
151
+ total_mel_lens[bucket_i].append(total_mel_len)
152
+ final_text_list[bucket_i].extend(text_list)
153
+
154
+ batch_accum[bucket_i] += total_mel_len
155
+
156
+ if batch_accum[bucket_i] >= infer_batch_size:
157
+ # print(f"\n{len(ref_mels[bucket_i][0][0])}\n{ref_mel_lens[bucket_i]}\n{total_mel_lens[bucket_i]}")
158
+ prompts_all.append(
159
+ (
160
+ utts[bucket_i],
161
+ ref_rms_list[bucket_i],
162
+ padded_mel_batch(ref_mels[bucket_i]),
163
+ ref_mel_lens[bucket_i],
164
+ total_mel_lens[bucket_i],
165
+ final_text_list[bucket_i],
166
+ )
167
+ )
168
+ batch_accum[bucket_i] = 0
169
+ (
170
+ utts[bucket_i],
171
+ ref_rms_list[bucket_i],
172
+ ref_mels[bucket_i],
173
+ ref_mel_lens[bucket_i],
174
+ total_mel_lens[bucket_i],
175
+ final_text_list[bucket_i],
176
+ ) = [], [], [], [], [], []
177
+
178
+ # add residual
179
+ for bucket_i, bucket_frames in enumerate(batch_accum):
180
+ if bucket_frames > 0:
181
+ prompts_all.append(
182
+ (
183
+ utts[bucket_i],
184
+ ref_rms_list[bucket_i],
185
+ padded_mel_batch(ref_mels[bucket_i]),
186
+ ref_mel_lens[bucket_i],
187
+ total_mel_lens[bucket_i],
188
+ final_text_list[bucket_i],
189
+ )
190
+ )
191
+ # not only leave easy work for last workers
192
+ random.seed(666)
193
+ random.shuffle(prompts_all)
194
+
195
+ return prompts_all
196
+
197
+
198
+ # get wav_res_ref_text of seed-tts test metalst
199
+ # https://github.com/BytedanceSpeech/seed-tts-eval
200
+
201
+
202
+ def get_seed_tts_test(metalst, gen_wav_dir, gpus):
203
+ f = open(metalst)
204
+ lines = f.readlines()
205
+ f.close()
206
+
207
+ test_set_ = []
208
+ for line in tqdm(lines):
209
+ if len(line.strip().split("|")) == 5:
210
+ utt, prompt_text, prompt_wav, gt_text, gt_wav = line.strip().split("|")
211
+ elif len(line.strip().split("|")) == 4:
212
+ utt, prompt_text, prompt_wav, gt_text = line.strip().split("|")
213
+
214
+ if not os.path.exists(os.path.join(gen_wav_dir, utt + ".wav")):
215
+ continue
216
+ gen_wav = os.path.join(gen_wav_dir, utt + ".wav")
217
+ if not os.path.isabs(prompt_wav):
218
+ prompt_wav = os.path.join(os.path.dirname(metalst), prompt_wav)
219
+
220
+ test_set_.append((gen_wav, prompt_wav, gt_text))
221
+
222
+ num_jobs = len(gpus)
223
+ if num_jobs == 1:
224
+ return [(gpus[0], test_set_)]
225
+
226
+ wav_per_job = len(test_set_) // num_jobs + 1
227
+ test_set = []
228
+ for i in range(num_jobs):
229
+ test_set.append((gpus[i], test_set_[i * wav_per_job : (i + 1) * wav_per_job]))
230
+
231
+ return test_set
232
+
233
+
234
+ # get librispeech test-clean cross sentence test
235
+
236
+
237
+ def get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path, eval_ground_truth=False):
238
+ f = open(metalst)
239
+ lines = f.readlines()
240
+ f.close()
241
+
242
+ test_set_ = []
243
+ for line in tqdm(lines):
244
+ ref_utt, ref_dur, ref_txt, gen_utt, gen_dur, gen_txt = line.strip().split("\t")
245
+
246
+ if eval_ground_truth:
247
+ gen_spk_id, gen_chaptr_id, _ = gen_utt.split("-")
248
+ gen_wav = os.path.join(librispeech_test_clean_path, gen_spk_id, gen_chaptr_id, gen_utt + ".flac")
249
+ else:
250
+ if not os.path.exists(os.path.join(gen_wav_dir, gen_utt + ".wav")):
251
+ raise FileNotFoundError(f"Generated wav not found: {gen_utt}")
252
+ gen_wav = os.path.join(gen_wav_dir, gen_utt + ".wav")
253
+
254
+ ref_spk_id, ref_chaptr_id, _ = ref_utt.split("-")
255
+ ref_wav = os.path.join(librispeech_test_clean_path, ref_spk_id, ref_chaptr_id, ref_utt + ".flac")
256
+
257
+ test_set_.append((gen_wav, ref_wav, gen_txt))
258
+
259
+ num_jobs = len(gpus)
260
+ if num_jobs == 1:
261
+ return [(gpus[0], test_set_)]
262
+
263
+ wav_per_job = len(test_set_) // num_jobs + 1
264
+ test_set = []
265
+ for i in range(num_jobs):
266
+ test_set.append((gpus[i], test_set_[i * wav_per_job : (i + 1) * wav_per_job]))
267
+
268
+ return test_set
269
+
270
+
271
+ # load asr model
272
+
273
+
274
+ def load_asr_model(lang, ckpt_dir=""):
275
+ if lang == "zh":
276
+ from funasr import AutoModel
277
+
278
+ model = AutoModel(
279
+ model=os.path.join(ckpt_dir, "paraformer-zh"),
280
+ # vad_model = os.path.join(ckpt_dir, "fsmn-vad"),
281
+ # punc_model = os.path.join(ckpt_dir, "ct-punc"),
282
+ # spk_model = os.path.join(ckpt_dir, "cam++"),
283
+ disable_update=True,
284
+ ) # following seed-tts setting
285
+ elif lang == "en":
286
+ from faster_whisper import WhisperModel
287
+
288
+ model_size = "large-v3" if ckpt_dir == "" else ckpt_dir
289
+ model = WhisperModel(model_size, device="cuda", compute_type="float16")
290
+ return model
291
+
292
+
293
+ # WER Evaluation, the way Seed-TTS does
294
+
295
+
296
+ def run_asr_wer(args):
297
+ rank, lang, test_set, ckpt_dir = args
298
+
299
+ if lang == "zh":
300
+ import zhconv
301
+
302
+ torch.cuda.set_device(rank)
303
+ elif lang == "en":
304
+ os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
305
+ else:
306
+ raise NotImplementedError(
307
+ "lang support only 'zh' (funasr paraformer-zh), 'en' (faster-whisper-large-v3), for now."
308
+ )
309
+
310
+ asr_model = load_asr_model(lang, ckpt_dir=ckpt_dir)
311
+
312
+ from zhon.hanzi import punctuation
313
+
314
+ punctuation_all = punctuation + string.punctuation
315
+ wers = []
316
+
317
+ from jiwer import compute_measures
318
+
319
+ for gen_wav, prompt_wav, truth in tqdm(test_set):
320
+ if lang == "zh":
321
+ res = asr_model.generate(input=gen_wav, batch_size_s=300, disable_pbar=True)
322
+ hypo = res[0]["text"]
323
+ hypo = zhconv.convert(hypo, "zh-cn")
324
+ elif lang == "en":
325
+ segments, _ = asr_model.transcribe(gen_wav, beam_size=5, language="en")
326
+ hypo = ""
327
+ for segment in segments:
328
+ hypo = hypo + " " + segment.text
329
+
330
+ # raw_truth = truth
331
+ # raw_hypo = hypo
332
+
333
+ for x in punctuation_all:
334
+ truth = truth.replace(x, "")
335
+ hypo = hypo.replace(x, "")
336
+
337
+ truth = truth.replace(" ", " ")
338
+ hypo = hypo.replace(" ", " ")
339
+
340
+ if lang == "zh":
341
+ truth = " ".join([x for x in truth])
342
+ hypo = " ".join([x for x in hypo])
343
+ elif lang == "en":
344
+ truth = truth.lower()
345
+ hypo = hypo.lower()
346
+
347
+ measures = compute_measures(truth, hypo)
348
+ wer = measures["wer"]
349
+
350
+ # ref_list = truth.split(" ")
351
+ # subs = measures["substitutions"] / len(ref_list)
352
+ # dele = measures["deletions"] / len(ref_list)
353
+ # inse = measures["insertions"] / len(ref_list)
354
+
355
+ wers.append(wer)
356
+
357
+ return wers
358
+
359
+
360
+ # SIM Evaluation
361
+
362
+
363
+ def run_sim(args):
364
+ rank, test_set, ckpt_dir = args
365
+ device = f"cuda:{rank}"
366
+
367
+ model = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type="wavlm_large", config_path=None)
368
+ state_dict = torch.load(ckpt_dir, weights_only=True, map_location=lambda storage, loc: storage)
369
+ model.load_state_dict(state_dict["model"], strict=False)
370
+
371
+ use_gpu = True if torch.cuda.is_available() else False
372
+ if use_gpu:
373
+ model = model.cuda(device)
374
+ model.eval()
375
+
376
+ sim_list = []
377
+ for wav1, wav2, truth in tqdm(test_set):
378
+ wav1, sr1 = torchaudio.load(wav1)
379
+ wav2, sr2 = torchaudio.load(wav2)
380
+
381
+ resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000)
382
+ resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000)
383
+ wav1 = resample1(wav1)
384
+ wav2 = resample2(wav2)
385
+
386
+ if use_gpu:
387
+ wav1 = wav1.cuda(device)
388
+ wav2 = wav2.cuda(device)
389
+ with torch.no_grad():
390
+ emb1 = model(wav1)
391
+ emb2 = model(wav2)
392
+
393
+ sim = F.cosine_similarity(emb1, emb2)[0].item()
394
+ # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).")
395
+ sim_list.append(sim)
396
+
397
+ return sim_list
src/f5_tts/infer/README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inference
2
+
3
+ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
4
+
5
+ Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
6
+
7
+ To avoid possible inference failures, make sure you have seen through the following instructions.
8
+
9
+ - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
10
+ - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
11
+ - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
12
+
13
+
14
+ ## Gradio App
15
+
16
+ Currently supported features:
17
+
18
+ - Basic TTS with Chunk Inference
19
+ - Multi-Style / Multi-Speaker Generation
20
+ - Voice Chat powered by Qwen2.5-3B-Instruct
21
+
22
+ The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
23
+
24
+ The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
25
+
26
+ Could also be used as a component for larger application.
27
+ ```python
28
+ import gradio as gr
29
+ from f5_tts.infer.infer_gradio import app
30
+
31
+ with gr.Blocks() as main_app:
32
+ gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
33
+
34
+ # ... other Gradio components
35
+
36
+ app.render()
37
+
38
+ main_app.launch()
39
+ ```
40
+
41
+
42
+ ## CLI Inference
43
+
44
+ The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
45
+
46
+ The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
47
+
48
+ For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
49
+
50
+ Basically you can inference with flags:
51
+ ```bash
52
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
53
+ f5-tts_infer-cli \
54
+ --model "F5-TTS" \
55
+ --ref_audio "ref_audio.wav" \
56
+ --ref_text "The content, subtitle or transcription of reference audio." \
57
+ --gen_text "Some text you want TTS model generate for you."
58
+ ```
59
+
60
+ And a `.toml` file would help with more flexible usage.
61
+
62
+ ```bash
63
+ f5-tts_infer-cli -c custom.toml
64
+ ```
65
+
66
+ For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
67
+
68
+ ```toml
69
+ # F5-TTS | E2-TTS
70
+ model = "F5-TTS"
71
+ ref_audio = "infer/examples/basic/basic_ref_en.wav"
72
+ # If an empty "", transcribes the reference audio automatically.
73
+ ref_text = "Some call me nature, others call me mother nature."
74
+ gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
75
+ # File with text to generate. Ignores the text above.
76
+ gen_file = ""
77
+ remove_silence = false
78
+ output_dir = "tests"
79
+ ```
80
+
81
+ You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
82
+
83
+ ```toml
84
+ # F5-TTS | E2-TTS
85
+ model = "F5-TTS"
86
+ ref_audio = "infer/examples/multi/main.flac"
87
+ # If an empty "", transcribes the reference audio automatically.
88
+ ref_text = ""
89
+ gen_text = ""
90
+ # File with text to generate. Ignores the text above.
91
+ gen_file = "infer/examples/multi/story.txt"
92
+ remove_silence = true
93
+ output_dir = "tests"
94
+
95
+ [voices.town]
96
+ ref_audio = "infer/examples/multi/town.flac"
97
+ ref_text = ""
98
+
99
+ [voices.country]
100
+ ref_audio = "infer/examples/multi/country.flac"
101
+ ref_text = ""
102
+ ```
103
+ You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
104
+
105
+ ## Speech Editing
106
+
107
+ To test speech editing capabilities, use the following command:
108
+
109
+ ```bash
110
+ python src/f5_tts/infer/speech_edit.py
111
+ ```
src/f5_tts/infer/examples/basic/basic.toml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # F5-TTS | E2-TTS
2
+ model = "F5-TTS"
3
+ ref_audio = "infer/examples/basic/basic_ref_en.wav"
4
+ # If an empty "", transcribes the reference audio automatically.
5
+ ref_text = "Some call me nature, others call me mother nature."
6
+ gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
7
+ # File with text to generate. Ignores the text above.
8
+ gen_file = ""
9
+ remove_silence = false
10
+ output_dir = "tests"
src/f5_tts/infer/examples/basic/basic_ref_en.wav ADDED
Binary file (256 kB). View file
 
src/f5_tts/infer/examples/basic/basic_ref_zh.wav ADDED
Binary file (325 kB). View file
 
src/f5_tts/infer/examples/multi/country.flac ADDED
Binary file (180 kB). View file
 
src/f5_tts/infer/examples/multi/main.flac ADDED
Binary file (279 kB). View file
 
src/f5_tts/infer/examples/multi/story.toml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F5-TTS | E2-TTS
2
+ model = "F5-TTS"
3
+ ref_audio = "infer/examples/multi/main.flac"
4
+ # If an empty "", transcribes the reference audio automatically.
5
+ ref_text = ""
6
+ gen_text = ""
7
+ # File with text to generate. Ignores the text above.
8
+ gen_file = "infer/examples/multi/story.txt"
9
+ remove_silence = true
10
+ output_dir = "tests"
11
+
12
+ [voices.town]
13
+ ref_audio = "infer/examples/multi/town.flac"
14
+ ref_text = ""
15
+
16
+ [voices.country]
17
+ ref_audio = "infer/examples/multi/country.flac"
18
+ ref_text = ""
19
+
src/f5_tts/infer/examples/multi/story.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] “My poor dear friend, you live here no better than the ants. Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land.” [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] “Goodbye,” [main] said he, [country] “I’m off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace.”
src/f5_tts/infer/examples/multi/town.flac ADDED
Binary file (229 kB). View file
 
src/f5_tts/infer/examples/vocab.txt ADDED
@@ -0,0 +1,2545 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ !
3
+ "
4
+ #
5
+ $
6
+ %
7
+ &
8
+ '
9
+ (
10
+ )
11
+ *
12
+ +
13
+ ,
14
+ -
15
+ .
16
+ /
17
+ 0
18
+ 1
19
+ 2
20
+ 3
21
+ 4
22
+ 5
23
+ 6
24
+ 7
25
+ 8
26
+ 9
27
+ :
28
+ ;
29
+ =
30
+ >
31
+ ?
32
+ @
33
+ A
34
+ B
35
+ C
36
+ D
37
+ E
38
+ F
39
+ G
40
+ H
41
+ I
42
+ J
43
+ K
44
+ L
45
+ M
46
+ N
47
+ O
48
+ P
49
+ Q
50
+ R
51
+ S
52
+ T
53
+ U
54
+ V
55
+ W
56
+ X
57
+ Y
58
+ Z
59
+ [
60
+ \
61
+ ]
62
+ _
63
+ a
64
+ a1
65
+ ai1
66
+ ai2
67
+ ai3
68
+ ai4
69
+ an1
70
+ an3
71
+ an4
72
+ ang1
73
+ ang2
74
+ ang4
75
+ ao1
76
+ ao2
77
+ ao3
78
+ ao4
79
+ b
80
+ ba
81
+ ba1
82
+ ba2
83
+ ba3
84
+ ba4
85
+ bai1
86
+ bai2
87
+ bai3
88
+ bai4
89
+ ban1
90
+ ban2
91
+ ban3
92
+ ban4
93
+ bang1
94
+ bang2
95
+ bang3
96
+ bang4
97
+ bao1
98
+ bao2
99
+ bao3
100
+ bao4
101
+ bei
102
+ bei1
103
+ bei2
104
+ bei3
105
+ bei4
106
+ ben1
107
+ ben2
108
+ ben3
109
+ ben4
110
+ beng
111
+ beng1
112
+ beng2
113
+ beng3
114
+ beng4
115
+ bi1
116
+ bi2
117
+ bi3
118
+ bi4
119
+ bian1
120
+ bian2
121
+ bian3
122
+ bian4
123
+ biao1
124
+ biao2
125
+ biao3
126
+ bie1
127
+ bie2
128
+ bie3
129
+ bie4
130
+ bin1
131
+ bin4
132
+ bing1
133
+ bing2
134
+ bing3
135
+ bing4
136
+ bo
137
+ bo1
138
+ bo2
139
+ bo3
140
+ bo4
141
+ bu2
142
+ bu3
143
+ bu4
144
+ c
145
+ ca1
146
+ cai1
147
+ cai2
148
+ cai3
149
+ cai4
150
+ can1
151
+ can2
152
+ can3
153
+ can4
154
+ cang1
155
+ cang2
156
+ cao1
157
+ cao2
158
+ cao3
159
+ ce4
160
+ cen1
161
+ cen2
162
+ ceng1
163
+ ceng2
164
+ ceng4
165
+ cha1
166
+ cha2
167
+ cha3
168
+ cha4
169
+ chai1
170
+ chai2
171
+ chan1
172
+ chan2
173
+ chan3
174
+ chan4
175
+ chang1
176
+ chang2
177
+ chang3
178
+ chang4
179
+ chao1
180
+ chao2
181
+ chao3
182
+ che1
183
+ che2
184
+ che3
185
+ che4
186
+ chen1
187
+ chen2
188
+ chen3
189
+ chen4
190
+ cheng1
191
+ cheng2
192
+ cheng3
193
+ cheng4
194
+ chi1
195
+ chi2
196
+ chi3
197
+ chi4
198
+ chong1
199
+ chong2
200
+ chong3
201
+ chong4
202
+ chou1
203
+ chou2
204
+ chou3
205
+ chou4
206
+ chu1
207
+ chu2
208
+ chu3
209
+ chu4
210
+ chua1
211
+ chuai1
212
+ chuai2
213
+ chuai3
214
+ chuai4
215
+ chuan1
216
+ chuan2
217
+ chuan3
218
+ chuan4
219
+ chuang1
220
+ chuang2
221
+ chuang3
222
+ chuang4
223
+ chui1
224
+ chui2
225
+ chun1
226
+ chun2
227
+ chun3
228
+ chuo1
229
+ chuo4
230
+ ci1
231
+ ci2
232
+ ci3
233
+ ci4
234
+ cong1
235
+ cong2
236
+ cou4
237
+ cu1
238
+ cu4
239
+ cuan1
240
+ cuan2
241
+ cuan4
242
+ cui1
243
+ cui3
244
+ cui4
245
+ cun1
246
+ cun2
247
+ cun4
248
+ cuo1
249
+ cuo2
250
+ cuo4
251
+ d
252
+ da
253
+ da1
254
+ da2
255
+ da3
256
+ da4
257
+ dai1
258
+ dai2
259
+ dai3
260
+ dai4
261
+ dan1
262
+ dan2
263
+ dan3
264
+ dan4
265
+ dang1
266
+ dang2
267
+ dang3
268
+ dang4
269
+ dao1
270
+ dao2
271
+ dao3
272
+ dao4
273
+ de
274
+ de1
275
+ de2
276
+ dei3
277
+ den4
278
+ deng1
279
+ deng2
280
+ deng3
281
+ deng4
282
+ di1
283
+ di2
284
+ di3
285
+ di4
286
+ dia3
287
+ dian1
288
+ dian2
289
+ dian3
290
+ dian4
291
+ diao1
292
+ diao3
293
+ diao4
294
+ die1
295
+ die2
296
+ die4
297
+ ding1
298
+ ding2
299
+ ding3
300
+ ding4
301
+ diu1
302
+ dong1
303
+ dong3
304
+ dong4
305
+ dou1
306
+ dou2
307
+ dou3
308
+ dou4
309
+ du1
310
+ du2
311
+ du3
312
+ du4
313
+ duan1
314
+ duan2
315
+ duan3
316
+ duan4
317
+ dui1
318
+ dui4
319
+ dun1
320
+ dun3
321
+ dun4
322
+ duo1
323
+ duo2
324
+ duo3
325
+ duo4
326
+ e
327
+ e1
328
+ e2
329
+ e3
330
+ e4
331
+ ei2
332
+ en1
333
+ en4
334
+ er
335
+ er2
336
+ er3
337
+ er4
338
+ f
339
+ fa1
340
+ fa2
341
+ fa3
342
+ fa4
343
+ fan1
344
+ fan2
345
+ fan3
346
+ fan4
347
+ fang1
348
+ fang2
349
+ fang3
350
+ fang4
351
+ fei1
352
+ fei2
353
+ fei3
354
+ fei4
355
+ fen1
356
+ fen2
357
+ fen3
358
+ fen4
359
+ feng1
360
+ feng2
361
+ feng3
362
+ feng4
363
+ fo2
364
+ fou2
365
+ fou3
366
+ fu1
367
+ fu2
368
+ fu3
369
+ fu4
370
+ g
371
+ ga1
372
+ ga2
373
+ ga3
374
+ ga4
375
+ gai1
376
+ gai2
377
+ gai3
378
+ gai4
379
+ gan1
380
+ gan2
381
+ gan3
382
+ gan4
383
+ gang1
384
+ gang2
385
+ gang3
386
+ gang4
387
+ gao1
388
+ gao2
389
+ gao3
390
+ gao4
391
+ ge1
392
+ ge2
393
+ ge3
394
+ ge4
395
+ gei2
396
+ gei3
397
+ gen1
398
+ gen2
399
+ gen3
400
+ gen4
401
+ geng1
402
+ geng3
403
+ geng4
404
+ gong1
405
+ gong3
406
+ gong4
407
+ gou1
408
+ gou2
409
+ gou3
410
+ gou4
411
+ gu
412
+ gu1
413
+ gu2
414
+ gu3
415
+ gu4
416
+ gua1
417
+ gua2
418
+ gua3
419
+ gua4
420
+ guai1
421
+ guai2
422
+ guai3
423
+ guai4
424
+ guan1
425
+ guan2
426
+ guan3
427
+ guan4
428
+ guang1
429
+ guang2
430
+ guang3
431
+ guang4
432
+ gui1
433
+ gui2
434
+ gui3
435
+ gui4
436
+ gun3
437
+ gun4
438
+ guo1
439
+ guo2
440
+ guo3
441
+ guo4
442
+ h
443
+ ha1
444
+ ha2
445
+ ha3
446
+ hai1
447
+ hai2
448
+ hai3
449
+ hai4
450
+ han1
451
+ han2
452
+ han3
453
+ han4
454
+ hang1
455
+ hang2
456
+ hang4
457
+ hao1
458
+ hao2
459
+ hao3
460
+ hao4
461
+ he1
462
+ he2
463
+ he4
464
+ hei1
465
+ hen2
466
+ hen3
467
+ hen4
468
+ heng1
469
+ heng2
470
+ heng4
471
+ hong1
472
+ hong2
473
+ hong3
474
+ hong4
475
+ hou1
476
+ hou2
477
+ hou3
478
+ hou4
479
+ hu1
480
+ hu2
481
+ hu3
482
+ hu4
483
+ hua1
484
+ hua2
485
+ hua4
486
+ huai2
487
+ huai4
488
+ huan1
489
+ huan2
490
+ huan3
491
+ huan4
492
+ huang1
493
+ huang2
494
+ huang3
495
+ huang4
496
+ hui1
497
+ hui2
498
+ hui3
499
+ hui4
500
+ hun1
501
+ hun2
502
+ hun4
503
+ huo
504
+ huo1
505
+ huo2
506
+ huo3
507
+ huo4
508
+ i
509
+ j
510
+ ji1
511
+ ji2
512
+ ji3
513
+ ji4
514
+ jia
515
+ jia1
516
+ jia2
517
+ jia3
518
+ jia4
519
+ jian1
520
+ jian2
521
+ jian3
522
+ jian4
523
+ jiang1
524
+ jiang2
525
+ jiang3
526
+ jiang4
527
+ jiao1
528
+ jiao2
529
+ jiao3
530
+ jiao4
531
+ jie1
532
+ jie2
533
+ jie3
534
+ jie4
535
+ jin1
536
+ jin2
537
+ jin3
538
+ jin4
539
+ jing1
540
+ jing2
541
+ jing3
542
+ jing4
543
+ jiong3
544
+ jiu1
545
+ jiu2
546
+ jiu3
547
+ jiu4
548
+ ju1
549
+ ju2
550
+ ju3
551
+ ju4
552
+ juan1
553
+ juan2
554
+ juan3
555
+ juan4
556
+ jue1
557
+ jue2
558
+ jue4
559
+ jun1
560
+ jun4
561
+ k
562
+ ka1
563
+ ka2
564
+ ka3
565
+ kai1
566
+ kai2
567
+ kai3
568
+ kai4
569
+ kan1
570
+ kan2
571
+ kan3
572
+ kan4
573
+ kang1
574
+ kang2
575
+ kang4
576
+ kao1
577
+ kao2
578
+ kao3
579
+ kao4
580
+ ke1
581
+ ke2
582
+ ke3
583
+ ke4
584
+ ken3
585
+ keng1
586
+ kong1
587
+ kong3
588
+ kong4
589
+ kou1
590
+ kou2
591
+ kou3
592
+ kou4
593
+ ku1
594
+ ku2
595
+ ku3
596
+ ku4
597
+ kua1
598
+ kua3
599
+ kua4
600
+ kuai3
601
+ kuai4
602
+ kuan1
603
+ kuan2
604
+ kuan3
605
+ kuang1
606
+ kuang2
607
+ kuang4
608
+ kui1
609
+ kui2
610
+ kui3
611
+ kui4
612
+ kun1
613
+ kun3
614
+ kun4
615
+ kuo4
616
+ l
617
+ la
618
+ la1
619
+ la2
620
+ la3
621
+ la4
622
+ lai2
623
+ lai4
624
+ lan2
625
+ lan3
626
+ lan4
627
+ lang1
628
+ lang2
629
+ lang3
630
+ lang4
631
+ lao1
632
+ lao2
633
+ lao3
634
+ lao4
635
+ le
636
+ le1
637
+ le4
638
+ lei
639
+ lei1
640
+ lei2
641
+ lei3
642
+ lei4
643
+ leng1
644
+ leng2
645
+ leng3
646
+ leng4
647
+ li
648
+ li1
649
+ li2
650
+ li3
651
+ li4
652
+ lia3
653
+ lian2
654
+ lian3
655
+ lian4
656
+ liang2
657
+ liang3
658
+ liang4
659
+ liao1
660
+ liao2
661
+ liao3
662
+ liao4
663
+ lie1
664
+ lie2
665
+ lie3
666
+ lie4
667
+ lin1
668
+ lin2
669
+ lin3
670
+ lin4
671
+ ling2
672
+ ling3
673
+ ling4
674
+ liu1
675
+ liu2
676
+ liu3
677
+ liu4
678
+ long1
679
+ long2
680
+ long3
681
+ long4
682
+ lou1
683
+ lou2
684
+ lou3
685
+ lou4
686
+ lu1
687
+ lu2
688
+ lu3
689
+ lu4
690
+ luan2
691
+ luan3
692
+ luan4
693
+ lun1
694
+ lun2
695
+ lun4
696
+ luo1
697
+ luo2
698
+ luo3
699
+ luo4
700
+ lv2
701
+ lv3
702
+ lv4
703
+ lve3
704
+ lve4
705
+ m
706
+ ma
707
+ ma1
708
+ ma2
709
+ ma3
710
+ ma4
711
+ mai2
712
+ mai3
713
+ mai4
714
+ man1
715
+ man2
716
+ man3
717
+ man4
718
+ mang2
719
+ mang3
720
+ mao1
721
+ mao2
722
+ mao3
723
+ mao4
724
+ me
725
+ mei2
726
+ mei3
727
+ mei4
728
+ men
729
+ men1
730
+ men2
731
+ men4
732
+ meng
733
+ meng1
734
+ meng2
735
+ meng3
736
+ meng4
737
+ mi1
738
+ mi2
739
+ mi3
740
+ mi4
741
+ mian2
742
+ mian3
743
+ mian4
744
+ miao1
745
+ miao2
746
+ miao3
747
+ miao4
748
+ mie1
749
+ mie4
750
+ min2
751
+ min3
752
+ ming2
753
+ ming3
754
+ ming4
755
+ miu4
756
+ mo1
757
+ mo2
758
+ mo3
759
+ mo4
760
+ mou1
761
+ mou2
762
+ mou3
763
+ mu2
764
+ mu3
765
+ mu4
766
+ n
767
+ n2
768
+ na1
769
+ na2
770
+ na3
771
+ na4
772
+ nai2
773
+ nai3
774
+ nai4
775
+ nan1
776
+ nan2
777
+ nan3
778
+ nan4
779
+ nang1
780
+ nang2
781
+ nang3
782
+ nao1
783
+ nao2
784
+ nao3
785
+ nao4
786
+ ne
787
+ ne2
788
+ ne4
789
+ nei3
790
+ nei4
791
+ nen4
792
+ neng2
793
+ ni1
794
+ ni2
795
+ ni3
796
+ ni4
797
+ nian1
798
+ nian2
799
+ nian3
800
+ nian4
801
+ niang2
802
+ niang4
803
+ niao2
804
+ niao3
805
+ niao4
806
+ nie1
807
+ nie4
808
+ nin2
809
+ ning2
810
+ ning3
811
+ ning4
812
+ niu1
813
+ niu2
814
+ niu3
815
+ niu4
816
+ nong2
817
+ nong4
818
+ nou4
819
+ nu2
820
+ nu3
821
+ nu4
822
+ nuan3
823
+ nuo2
824
+ nuo4
825
+ nv2
826
+ nv3
827
+ nve4
828
+ o
829
+ o1
830
+ o2
831
+ ou1
832
+ ou2
833
+ ou3
834
+ ou4
835
+ p
836
+ pa1
837
+ pa2
838
+ pa4
839
+ pai1
840
+ pai2
841
+ pai3
842
+ pai4
843
+ pan1
844
+ pan2
845
+ pan4
846
+ pang1
847
+ pang2
848
+ pang4
849
+ pao1
850
+ pao2
851
+ pao3
852
+ pao4
853
+ pei1
854
+ pei2
855
+ pei4
856
+ pen1
857
+ pen2
858
+ pen4
859
+ peng1
860
+ peng2
861
+ peng3
862
+ peng4
863
+ pi1
864
+ pi2
865
+ pi3
866
+ pi4
867
+ pian1
868
+ pian2
869
+ pian4
870
+ piao1
871
+ piao2
872
+ piao3
873
+ piao4
874
+ pie1
875
+ pie2
876
+ pie3
877
+ pin1
878
+ pin2
879
+ pin3
880
+ pin4
881
+ ping1
882
+ ping2
883
+ po1
884
+ po2
885
+ po3
886
+ po4
887
+ pou1
888
+ pu1
889
+ pu2
890
+ pu3
891
+ pu4
892
+ q
893
+ qi1
894
+ qi2
895
+ qi3
896
+ qi4
897
+ qia1
898
+ qia3
899
+ qia4
900
+ qian1
901
+ qian2
902
+ qian3
903
+ qian4
904
+ qiang1
905
+ qiang2
906
+ qiang3
907
+ qiang4
908
+ qiao1
909
+ qiao2
910
+ qiao3
911
+ qiao4
912
+ qie1
913
+ qie2
914
+ qie3
915
+ qie4
916
+ qin1
917
+ qin2
918
+ qin3
919
+ qin4
920
+ qing1
921
+ qing2
922
+ qing3
923
+ qing4
924
+ qiong1
925
+ qiong2
926
+ qiu1
927
+ qiu2
928
+ qiu3
929
+ qu1
930
+ qu2
931
+ qu3
932
+ qu4
933
+ quan1
934
+ quan2
935
+ quan3
936
+ quan4
937
+ que1
938
+ que2
939
+ que4
940
+ qun2
941
+ r
942
+ ran2
943
+ ran3
944
+ rang1
945
+ rang2
946
+ rang3
947
+ rang4
948
+ rao2
949
+ rao3
950
+ rao4
951
+ re2
952
+ re3
953
+ re4
954
+ ren2
955
+ ren3
956
+ ren4
957
+ reng1
958
+ reng2
959
+ ri4
960
+ rong1
961
+ rong2
962
+ rong3
963
+ rou2
964
+ rou4
965
+ ru2
966
+ ru3
967
+ ru4
968
+ ruan2
969
+ ruan3
970
+ rui3
971
+ rui4
972
+ run4
973
+ ruo4
974
+ s
975
+ sa1
976
+ sa2
977
+ sa3
978
+ sa4
979
+ sai1
980
+ sai4
981
+ san1
982
+ san2
983
+ san3
984
+ san4
985
+ sang1
986
+ sang3
987
+ sang4
988
+ sao1
989
+ sao2
990
+ sao3
991
+ sao4
992
+ se4
993
+ sen1
994
+ seng1
995
+ sha1
996
+ sha2
997
+ sha3
998
+ sha4
999
+ shai1
1000
+ shai2
1001
+ shai3
1002
+ shai4
1003
+ shan1
1004
+ shan3
1005
+ shan4
1006
+ shang
1007
+ shang1
1008
+ shang3
1009
+ shang4
1010
+ shao1
1011
+ shao2
1012
+ shao3
1013
+ shao4
1014
+ she1
1015
+ she2
1016
+ she3
1017
+ she4
1018
+ shei2
1019
+ shen1
1020
+ shen2
1021
+ shen3
1022
+ shen4
1023
+ sheng1
1024
+ sheng2
1025
+ sheng3
1026
+ sheng4
1027
+ shi
1028
+ shi1
1029
+ shi2
1030
+ shi3
1031
+ shi4
1032
+ shou1
1033
+ shou2
1034
+ shou3
1035
+ shou4
1036
+ shu1
1037
+ shu2
1038
+ shu3
1039
+ shu4
1040
+ shua1
1041
+ shua2
1042
+ shua3
1043
+ shua4
1044
+ shuai1
1045
+ shuai3
1046
+ shuai4
1047
+ shuan1
1048
+ shuan4
1049
+ shuang1
1050
+ shuang3
1051
+ shui2
1052
+ shui3
1053
+ shui4
1054
+ shun3
1055
+ shun4
1056
+ shuo1
1057
+ shuo4
1058
+ si1
1059
+ si2
1060
+ si3
1061
+ si4
1062
+ song1
1063
+ song3
1064
+ song4
1065
+ sou1
1066
+ sou3
1067
+ sou4
1068
+ su1
1069
+ su2
1070
+ su4
1071
+ suan1
1072
+ suan4
1073
+ sui1
1074
+ sui2
1075
+ sui3
1076
+ sui4
1077
+ sun1
1078
+ sun3
1079
+ suo
1080
+ suo1
1081
+ suo2
1082
+ suo3
1083
+ t
1084
+ ta1
1085
+ ta2
1086
+ ta3
1087
+ ta4
1088
+ tai1
1089
+ tai2
1090
+ tai4
1091
+ tan1
1092
+ tan2
1093
+ tan3
1094
+ tan4
1095
+ tang1
1096
+ tang2
1097
+ tang3
1098
+ tang4
1099
+ tao1
1100
+ tao2
1101
+ tao3
1102
+ tao4
1103
+ te4
1104
+ teng2
1105
+ ti1
1106
+ ti2
1107
+ ti3
1108
+ ti4
1109
+ tian1
1110
+ tian2
1111
+ tian3
1112
+ tiao1
1113
+ tiao2
1114
+ tiao3
1115
+ tiao4
1116
+ tie1
1117
+ tie2
1118
+ tie3
1119
+ tie4
1120
+ ting1
1121
+ ting2
1122
+ ting3
1123
+ tong1
1124
+ tong2
1125
+ tong3
1126
+ tong4
1127
+ tou
1128
+ tou1
1129
+ tou2
1130
+ tou4
1131
+ tu1
1132
+ tu2
1133
+ tu3
1134
+ tu4
1135
+ tuan1
1136
+ tuan2
1137
+ tui1
1138
+ tui2
1139
+ tui3
1140
+ tui4
1141
+ tun1
1142
+ tun2
1143
+ tun4
1144
+ tuo1
1145
+ tuo2
1146
+ tuo3
1147
+ tuo4
1148
+ u
1149
+ v
1150
+ w
1151
+ wa
1152
+ wa1
1153
+ wa2
1154
+ wa3
1155
+ wa4
1156
+ wai1
1157
+ wai3
1158
+ wai4
1159
+ wan1
1160
+ wan2
1161
+ wan3
1162
+ wan4
1163
+ wang1
1164
+ wang2
1165
+ wang3
1166
+ wang4
1167
+ wei1
1168
+ wei2
1169
+ wei3
1170
+ wei4
1171
+ wen1
1172
+ wen2
1173
+ wen3
1174
+ wen4
1175
+ weng1
1176
+ weng4
1177
+ wo1
1178
+ wo2
1179
+ wo3
1180
+ wo4
1181
+ wu1
1182
+ wu2
1183
+ wu3
1184
+ wu4
1185
+ x
1186
+ xi1
1187
+ xi2
1188
+ xi3
1189
+ xi4
1190
+ xia1
1191
+ xia2
1192
+ xia4
1193
+ xian1
1194
+ xian2
1195
+ xian3
1196
+ xian4
1197
+ xiang1
1198
+ xiang2
1199
+ xiang3
1200
+ xiang4
1201
+ xiao1
1202
+ xiao2
1203
+ xiao3
1204
+ xiao4
1205
+ xie1
1206
+ xie2
1207
+ xie3
1208
+ xie4
1209
+ xin1
1210
+ xin2
1211
+ xin4
1212
+ xing1
1213
+ xing2
1214
+ xing3
1215
+ xing4
1216
+ xiong1
1217
+ xiong2
1218
+ xiu1
1219
+ xiu3
1220
+ xiu4
1221
+ xu
1222
+ xu1
1223
+ xu2
1224
+ xu3
1225
+ xu4
1226
+ xuan1
1227
+ xuan2
1228
+ xuan3
1229
+ xuan4
1230
+ xue1
1231
+ xue2
1232
+ xue3
1233
+ xue4
1234
+ xun1
1235
+ xun2
1236
+ xun4
1237
+ y
1238
+ ya
1239
+ ya1
1240
+ ya2
1241
+ ya3
1242
+ ya4
1243
+ yan1
1244
+ yan2
1245
+ yan3
1246
+ yan4
1247
+ yang1
1248
+ yang2
1249
+ yang3
1250
+ yang4
1251
+ yao1
1252
+ yao2
1253
+ yao3
1254
+ yao4
1255
+ ye1
1256
+ ye2
1257
+ ye3
1258
+ ye4
1259
+ yi
1260
+ yi1
1261
+ yi2
1262
+ yi3
1263
+ yi4
1264
+ yin1
1265
+ yin2
1266
+ yin3
1267
+ yin4
1268
+ ying1
1269
+ ying2
1270
+ ying3
1271
+ ying4
1272
+ yo1
1273
+ yong1
1274
+ yong2
1275
+ yong3
1276
+ yong4
1277
+ you1
1278
+ you2
1279
+ you3
1280
+ you4
1281
+ yu1
1282
+ yu2
1283
+ yu3
1284
+ yu4
1285
+ yuan1
1286
+ yuan2
1287
+ yuan3
1288
+ yuan4
1289
+ yue1
1290
+ yue4
1291
+ yun1
1292
+ yun2
1293
+ yun3
1294
+ yun4
1295
+ z
1296
+ za1
1297
+ za2
1298
+ za3
1299
+ zai1
1300
+ zai3
1301
+ zai4
1302
+ zan1
1303
+ zan2
1304
+ zan3
1305
+ zan4
1306
+ zang1
1307
+ zang4
1308
+ zao1
1309
+ zao2
1310
+ zao3
1311
+ zao4
1312
+ ze2
1313
+ ze4
1314
+ zei2
1315
+ zen3
1316
+ zeng1
1317
+ zeng4
1318
+ zha1
1319
+ zha2
1320
+ zha3
1321
+ zha4
1322
+ zhai1
1323
+ zhai2
1324
+ zhai3
1325
+ zhai4
1326
+ zhan1
1327
+ zhan2
1328
+ zhan3
1329
+ zhan4
1330
+ zhang1
1331
+ zhang2
1332
+ zhang3
1333
+ zhang4
1334
+ zhao1
1335
+ zhao2
1336
+ zhao3
1337
+ zhao4
1338
+ zhe
1339
+ zhe1
1340
+ zhe2
1341
+ zhe3
1342
+ zhe4
1343
+ zhen1
1344
+ zhen2
1345
+ zhen3
1346
+ zhen4
1347
+ zheng1
1348
+ zheng2
1349
+ zheng3
1350
+ zheng4
1351
+ zhi1
1352
+ zhi2
1353
+ zhi3
1354
+ zhi4
1355
+ zhong1
1356
+ zhong2
1357
+ zhong3
1358
+ zhong4
1359
+ zhou1
1360
+ zhou2
1361
+ zhou3
1362
+ zhou4
1363
+ zhu1
1364
+ zhu2
1365
+ zhu3
1366
+ zhu4
1367
+ zhua1
1368
+ zhua2
1369
+ zhua3
1370
+ zhuai1
1371
+ zhuai3
1372
+ zhuai4
1373
+ zhuan1
1374
+ zhuan2
1375
+ zhuan3
1376
+ zhuan4
1377
+ zhuang1
1378
+ zhuang4
1379
+ zhui1
1380
+ zhui4
1381
+ zhun1
1382
+ zhun2
1383
+ zhun3
1384
+ zhuo1
1385
+ zhuo2
1386
+ zi
1387
+ zi1
1388
+ zi2
1389
+ zi3
1390
+ zi4
1391
+ zong1
1392
+ zong2
1393
+ zong3
1394
+ zong4
1395
+ zou1
1396
+ zou2
1397
+ zou3
1398
+ zou4
1399
+ zu1
1400
+ zu2
1401
+ zu3
1402
+ zuan1
1403
+ zuan3
1404
+ zuan4
1405
+ zui2
1406
+ zui3
1407
+ zui4
1408
+ zun1
1409
+ zuo
1410
+ zuo1
1411
+ zuo2
1412
+ zuo3
1413
+ zuo4
1414
+ {
1415
+ ~
1416
+ ¡
1417
+ ¢
1418
+ £
1419
+ ¥
1420
+ §
1421
+ ¨
1422
+ ©
1423
+ «
1424
+ ®
1425
+ ¯
1426
+ °
1427
+ ±
1428
+ ²
1429
+ ³
1430
+ ´
1431
+ µ
1432
+ ·
1433
+ ¹
1434
+ º
1435
+ »
1436
+ ¼
1437
+ ½
1438
+ ¾
1439
+ ¿
1440
+ À
1441
+ Á
1442
+ Â
1443
+ Ã
1444
+ Ä
1445
+ Å
1446
+ Æ
1447
+ Ç
1448
+ È
1449
+ É
1450
+ Ê
1451
+ Í
1452
+ Î
1453
+ Ñ
1454
+ Ó
1455
+ Ö
1456
+ ×
1457
+ Ø
1458
+ Ú
1459
+ Ü
1460
+ Ý
1461
+ Þ
1462
+ ß
1463
+ à
1464
+ á
1465
+ â
1466
+ ã
1467
+ ä
1468
+ å
1469
+ æ
1470
+ ç
1471
+ è
1472
+ é
1473
+ ê
1474
+ ë
1475
+ ì
1476
+ í
1477
+ î
1478
+ ï
1479
+ ð
1480
+ ñ
1481
+ ò
1482
+ ó
1483
+ ô
1484
+ õ
1485
+ ö
1486
+ ø
1487
+ ù
1488
+ ú
1489
+ û
1490
+ ü
1491
+ ý
1492
+ Ā
1493
+ ā
1494
+ ă
1495
+ ą
1496
+ ć
1497
+ Č
1498
+ č
1499
+ Đ
1500
+ đ
1501
+ ē
1502
+ ė
1503
+ ę
1504
+ ě
1505
+ ĝ
1506
+ ğ
1507
+ ħ
1508
+ ī
1509
+ į
1510
+ İ
1511
+ ı
1512
+ Ł
1513
+ ł
1514
+ ń
1515
+ ņ
1516
+ ň
1517
+ ŋ
1518
+ Ō
1519
+ ō
1520
+ ő
1521
+ œ
1522
+ ř
1523
+ Ś
1524
+ ś
1525
+ Ş
1526
+ ş
1527
+ Š
1528
+ š
1529
+ Ť
1530
+ ť
1531
+ ũ
1532
+ ū
1533
+ ź
1534
+ Ż
1535
+ ż
1536
+ Ž
1537
+ ž
1538
+ ơ
1539
+ ư
1540
+ ǎ
1541
+ ǐ
1542
+ ǒ
1543
+ ǔ
1544
+ ǚ
1545
+ ș
1546
+ ț
1547
+ ɑ
1548
+ ɔ
1549
+ ɕ
1550
+ ə
1551
+ ɛ
1552
+ ɜ
1553
+ ɡ
1554
+ ɣ
1555
+ ɪ
1556
+ ɫ
1557
+ ɴ
1558
+ ɹ
1559
+ ɾ
1560
+ ʃ
1561
+ ʊ
1562
+ ʌ
1563
+ ʒ
1564
+ ʔ
1565
+ ʰ
1566
+ ʷ
1567
+ ʻ
1568
+ ʾ
1569
+ ʿ
1570
+ ˈ
1571
+ ː
1572
+ ˙
1573
+ ˜
1574
+ ˢ
1575
+ ́
1576
+ ̅
1577
+ Α
1578
+ Β
1579
+ Δ
1580
+ Ε
1581
+ Θ
1582
+ Κ
1583
+ Λ
1584
+ Μ
1585
+ Ξ
1586
+ Π
1587
+ Σ
1588
+ Τ
1589
+ Φ
1590
+ Χ
1591
+ Ψ
1592
+ Ω
1593
+ ά
1594
+ έ
1595
+ ή
1596
+ ί
1597
+ α
1598
+ β
1599
+ γ
1600
+ δ
1601
+ ε
1602
+ ζ
1603
+ η
1604
+ θ
1605
+ ι
1606
+ κ
1607
+ λ
1608
+ μ
1609
+ ν
1610
+ ξ
1611
+ ο
1612
+ π
1613
+ ρ
1614
+ ς
1615
+ σ
1616
+ τ
1617
+ υ
1618
+ φ
1619
+ χ
1620
+ ψ
1621
+ ω
1622
+ ϊ
1623
+ ό
1624
+ ύ
1625
+ ώ
1626
+ ϕ
1627
+ ϵ
1628
+ Ё
1629
+ А
1630
+ Б
1631
+ В
1632
+ Г
1633
+ Д
1634
+ Е
1635
+ Ж
1636
+ З
1637
+ И
1638
+ Й
1639
+ К
1640
+ Л
1641
+ М
1642
+ Н
1643
+ О
1644
+ П
1645
+ Р
1646
+ С
1647
+ Т
1648
+ У
1649
+ Ф
1650
+ Х
1651
+ Ц
1652
+ Ч
1653
+ Ш
1654
+ Щ
1655
+ Ы
1656
+ Ь
1657
+ Э
1658
+ Ю
1659
+ Я
1660
+ а
1661
+ б
1662
+ в
1663
+ г
1664
+ д
1665
+ е
1666
+ ж
1667
+ з
1668
+ и
1669
+ й
1670
+ к
1671
+ л
1672
+ м
1673
+ н
1674
+ о
1675
+ п
1676
+ р
1677
+ с
1678
+ т
1679
+ у
1680
+ ф
1681
+ х
1682
+ ц
1683
+ ч
1684
+ ш
1685
+ щ
1686
+ ъ
1687
+ ы
1688
+ ь
1689
+ э
1690
+ ю
1691
+ я
1692
+ ё
1693
+ і
1694
+ ְ
1695
+ ִ
1696
+ ֵ
1697
+ ֶ
1698
+ ַ
1699
+ ָ
1700
+ ֹ
1701
+ ּ
1702
+ ־
1703
+ ׁ
1704
+ א
1705
+ ב
1706
+ ג
1707
+ ד
1708
+ ה
1709
+ ו
1710
+ ז
1711
+ ח
1712
+ ט
1713
+ י
1714
+ כ
1715
+ ל
1716
+ ם
1717
+ מ
1718
+ ן
1719
+ נ
1720
+ ס
1721
+ ע
1722
+ פ
1723
+ ק
1724
+ ר
1725
+ ש
1726
+ ת
1727
+ أ
1728
+ ب
1729
+ ة
1730
+ ت
1731
+ ج
1732
+ ح
1733
+ د
1734
+ ر
1735
+ ز
1736
+ س
1737
+ ص
1738
+ ط
1739
+ ع
1740
+ ق
1741
+ ك
1742
+ ل
1743
+ م
1744
+ ن
1745
+ ه
1746
+ و
1747
+ ي
1748
+ َ
1749
+ ُ
1750
+ ِ
1751
+ ْ
1752
+
1753
+
1754
+
1755
+
1756
+
1757
+
1758
+
1759
+
1760
+
1761
+
1762
+
1763
+
1764
+
1765
+
1766
+
1767
+
1768
+
1769
+
1770
+
1771
+
1772
+
1773
+
1774
+
1775
+
1776
+
1777
+
1778
+
1779
+
1780
+
1781
+
1782
+
1783
+
1784
+
1785
+
1786
+
1787
+
1788
+
1789
+
1790
+
1791
+
1792
+
1793
+
1794
+
1795
+
1796
+
1797
+
1798
+
1799
+
1800
+ ế
1801
+
1802
+
1803
+
1804
+
1805
+
1806
+
1807
+
1808
+
1809
+
1810
+
1811
+
1812
+
1813
+
1814
+
1815
+
1816
+
1817
+
1818
+
1819
+
1820
+
1821
+
1822
+
1823
+
1824
+
1825
+
1826
+
1827
+
1828
+
1829
+
1830
+ ���
1831
+
1832
+
1833
+
1834
+
1835
+
1836
+
1837
+
1838
+
1839
+
1840
+
1841
+
1842
+
1843
+
1844
+
1845
+
1846
+
1847
+
1848
+
1849
+
1850
+
1851
+
1852
+
1853
+
1854
+
1855
+
1856
+
1857
+
1858
+
1859
+
1860
+
1861
+
1862
+
1863
+
1864
+
1865
+
1866
+
1867
+
1868
+
1869
+
1870
+
1871
+
1872
+
1873
+
1874
+
1875
+
1876
+
1877
+
1878
+
1879
+
1880
+
1881
+
1882
+
1883
+
1884
+
1885
+
1886
+
1887
+
1888
+
1889
+
1890
+
1891
+
1892
+
1893
+
1894
+
1895
+
1896
+
1897
+
1898
+
1899
+
1900
+
1901
+
1902
+
1903
+
1904
+
1905
+
1906
+
1907
+
1908
+
1909
+
1910
+
1911
+
1912
+
1913
+
1914
+
1915
+
1916
+
1917
+
1918
+
1919
+
1920
+
1921
+
1922
+
1923
+
1924
+
1925
+
1926
+
1927
+
1928
+
1929
+
1930
+
1931
+
1932
+
1933
+
1934
+
1935
+
1936
+
1937
+
1938
+
1939
+
1940
+
1941
+
1942
+
1943
+
1944
+
1945
+
1946
+
1947
+
1948
+
1949
+
1950
+
1951
+
1952
+
1953
+
1954
+
1955
+
1956
+
1957
+
1958
+
1959
+
1960
+
1961
+
1962
+
1963
+
1964
+
1965
+
1966
+
1967
+
1968
+
1969
+
1970
+
1971
+
1972
+
1973
+
1974
+
1975
+
1976
+
1977
+
1978
+
1979
+
1980
+
1981
+
1982
+
1983
+
1984
+
1985
+
1986
+
1987
+
1988
+
1989
+
1990
+
1991
+
1992
+
1993
+
1994
+
1995
+
1996
+
1997
+
1998
+
1999
+
2000
+
2001
+
2002
+
2003
+
2004
+
2005
+
2006
+
2007
+
2008
+
2009
+
2010
+
2011
+
2012
+
2013
+
2014
+
2015
+
2016
+
2017
+
2018
+
2019
+
2020
+
2021
+
2022
+
2023
+
2024
+
2025
+
2026
+
2027
+
2028
+
2029
+
2030
+
2031
+
2032
+
2033
+
2034
+
2035
+
2036
+
2037
+
2038
+
2039
+
2040
+
2041
+
2042
+
2043
+
2044
+
2045
+
2046
+
2047
+
2048
+
2049
+
2050
+
2051
+
2052
+
2053
+
2054
+
2055
+
2056
+
2057
+
2058
+
2059
+
2060
+
2061
+
2062
+
2063
+
2064
+
2065
+
2066
+
2067
+
2068
+
2069
+
2070
+
2071
+
2072
+
2073
+
2074
+
2075
+
2076
+
2077
+
2078
+
2079
+
2080
+
2081
+
2082
+
2083
+
2084
+
2085
+
2086
+
2087
+
2088
+
2089
+
2090
+
2091
+
2092
+
2093
+
2094
+
2095
+
2096
+
2097
+
2098
+
2099
+
2100
+
2101
+
2102
+
2103
+
2104
+
2105
+
2106
+
2107
+
2108
+
2109
+
2110
+
2111
+
2112
+
2113
+
2114
+
2115
+
2116
+
2117
+
2118
+
2119
+
2120
+
2121
+
2122
+
2123
+
2124
+
2125
+
2126
+
2127
+
2128
+
2129
+
2130
+
2131
+
2132
+
2133
+
2134
+
2135
+
2136
+
2137
+
2138
+
2139
+
2140
+
2141
+
2142
+
2143
+
2144
+
2145
+
2146
+
2147
+
2148
+
2149
+
2150
+
2151
+
2152
+
2153
+
2154
+
2155
+
2156
+
2157
+
2158
+
2159
+
2160
+
2161
+
2162
+
2163
+
2164
+
2165
+
2166
+
2167
+
2168
+
2169
+
2170
+
2171
+
2172
+
2173
+
2174
+
2175
+
2176
+
2177
+
2178
+
2179
+
2180
+
2181
+
2182
+
2183
+
2184
+
2185
+
2186
+
2187
+
2188
+
2189
+
2190
+
2191
+
2192
+
2193
+
2194
+
2195
+
2196
+
2197
+
2198
+
2199
+
2200
+
2201
+
2202
+
2203
+
2204
+
2205
+
2206
+
2207
+
2208
+
2209
+
2210
+
2211
+
2212
+
2213
+
2214
+
2215
+
2216
+
2217
+
2218
+
2219
+
2220
+
2221
+
2222
+
2223
+
2224
+
2225
+
2226
+
2227
+
2228
+
2229
+
2230
+
2231
+
2232
+
2233
+
2234
+
2235
+
2236
+
2237
+
2238
+
2239
+
2240
+
2241
+
2242
+
2243
+
2244
+
2245
+
2246
+
2247
+
2248
+
2249
+
2250
+
2251
+
2252
+
2253
+
2254
+
2255
+
2256
+
2257
+
2258
+
2259
+
2260
+
2261
+
2262
+
2263
+
2264
+
2265
+
2266
+
2267
+
2268
+
2269
+
2270
+
2271
+
2272
+
2273
+
2274
+
2275
+
2276
+
2277
+
2278
+
2279
+
2280
+
2281
+
2282
+
2283
+
2284
+
2285
+
2286
+
2287
+
2288
+
2289
+
2290
+
2291
+
2292
+
2293
+
2294
+
2295
+
2296
+
2297
+
2298
+
2299
+
2300
+
2301
+
2302
+
2303
+
2304
+
2305
+
2306
+
2307
+
2308
+
2309
+
2310
+
2311
+
2312
+
2313
+
2314
+
2315
+
2316
+
2317
+
2318
+
2319
+
2320
+
2321
+
2322
+
2323
+
2324
+
2325
+
2326
+
2327
+
2328
+
2329
+
2330
+
2331
+
2332
+
2333
+
2334
+
2335
+
2336
+
2337
+
2338
+
2339
+
2340
+
2341
+
2342
+
2343
+
2344
+
2345
+
2346
+
2347
+
2348
+
2349
+
2350
+
2351
+
2352
+
2353
+
2354
+
2355
+
2356
+
2357
+
2358
+
2359
+
2360
+
2361
+
2362
+
2363
+
2364
+
2365
+
2366
+
2367
+
2368
+
2369
+
2370
+
2371
+
2372
+
2373
+
2374
+
2375
+
2376
+
2377
+
2378
+
2379
+
2380
+
2381
+
2382
+
2383
+
2384
+
2385
+
2386
+
2387
+
2388
+
2389
+
2390
+
2391
+
2392
+
2393
+
2394
+
2395
+
2396
+
2397
+
2398
+
2399
+
2400
+
2401
+
2402
+
2403
+
2404
+
2405
+
2406
+
2407
+
2408
+
2409
+
2410
+
2411
+
2412
+
2413
+
2414
+
2415
+
2416
+
2417
+
2418
+
2419
+
2420
+
2421
+
2422
+
2423
+
2424
+
2425
+
2426
+
2427
+
2428
+
2429
+
2430
+
2431
+
2432
+
2433
+
2434
+
2435
+
2436
+
2437
+
2438
+
2439
+
2440
+
2441
+
2442
+
2443
+
2444
+
2445
+
2446
+
2447
+
2448
+
2449
+
2450
+
2451
+
2452
+
2453
+
2454
+
2455
+
2456
+
2457
+
2458
+
2459
+
2460
+
2461
+
2462
+
2463
+
2464
+
2465
+
2466
+
2467
+
2468
+
2469
+
2470
+
2471
+
2472
+
2473
+
2474
+
2475
+
2476
+
2477
+
2478
+
2479
+
2480
+
2481
+
2482
+
2483
+
2484
+
2485
+
2486
+
2487
+
2488
+
2489
+
2490
+
2491
+
2492
+
2493
+
2494
+
2495
+
2496
+
2497
+
2498
+
2499
+
2500
+
2501
+
2502
+
2503
+
2504
+
2505
+
2506
+
2507
+
2508
+
2509
+
2510
+
2511
+
2512
+
2513
+
2514
+
2515
+
2516
+
2517
+
2518
+
2519
+
2520
+
2521
+
2522
+
2523
+
2524
+
2525
+
2526
+
2527
+
2528
+
2529
+
2530
+
2531
+
2532
+
2533
+
2534
+
2535
+
2536
+
2537
+
2538
+
2539
+
2540
+
2541
+
2542
+
2543
+
2544
+
2545
+ 𠮶
src/f5_tts/infer/infer_cli.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import codecs
3
+ import os
4
+ import re
5
+ from pathlib import Path
6
+ from importlib.resources import files
7
+
8
+ import numpy as np
9
+ import soundfile as sf
10
+ import tomli
11
+ from cached_path import cached_path
12
+
13
+ from f5_tts.model import DiT, UNetT
14
+ from f5_tts.infer.utils_infer import (
15
+ load_vocoder,
16
+ load_model,
17
+ preprocess_ref_audio_text,
18
+ infer_process,
19
+ remove_silence_for_generated_wav,
20
+ )
21
+
22
+
23
+ parser = argparse.ArgumentParser(
24
+ prog="python3 infer-cli.py",
25
+ description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
26
+ epilog="Specify options above to override one or more settings from config.",
27
+ )
28
+ parser.add_argument(
29
+ "-c",
30
+ "--config",
31
+ help="Configuration file. Default=infer/examples/basic/basic.toml",
32
+ default=os.path.join(files("f5_tts").joinpath("infer/examples/basic"), "basic.toml"),
33
+ )
34
+ parser.add_argument(
35
+ "-m",
36
+ "--model",
37
+ help="F5-TTS | E2-TTS",
38
+ )
39
+ parser.add_argument(
40
+ "-p",
41
+ "--ckpt_file",
42
+ help="The Checkpoint .pt",
43
+ )
44
+ parser.add_argument(
45
+ "-v",
46
+ "--vocab_file",
47
+ help="The vocab .txt",
48
+ )
49
+ parser.add_argument("-r", "--ref_audio", type=str, help="Reference audio file < 15 seconds.")
50
+ parser.add_argument("-s", "--ref_text", type=str, default="666", help="Subtitle for the reference audio.")
51
+ parser.add_argument(
52
+ "-t",
53
+ "--gen_text",
54
+ type=str,
55
+ help="Text to generate.",
56
+ )
57
+ parser.add_argument(
58
+ "-f",
59
+ "--gen_file",
60
+ type=str,
61
+ help="File with text to generate. Ignores --text",
62
+ )
63
+ parser.add_argument(
64
+ "-o",
65
+ "--output_dir",
66
+ type=str,
67
+ help="Path to output folder..",
68
+ )
69
+ parser.add_argument(
70
+ "--remove_silence",
71
+ help="Remove silence.",
72
+ )
73
+ parser.add_argument(
74
+ "--load_vocoder_from_local",
75
+ action="store_true",
76
+ help="load vocoder from local. Default: ../checkpoints/charactr/vocos-mel-24khz",
77
+ )
78
+ args = parser.parse_args()
79
+
80
+ config = tomli.load(open(args.config, "rb"))
81
+
82
+ ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
83
+ ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
84
+ gen_text = args.gen_text if args.gen_text else config["gen_text"]
85
+ gen_file = args.gen_file if args.gen_file else config["gen_file"]
86
+
87
+ # patches for pip pkg user
88
+ if "infer/examples/" in ref_audio:
89
+ ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
90
+ if "infer/examples/" in gen_file:
91
+ gen_file = str(files("f5_tts").joinpath(f"{gen_file}"))
92
+ if "voices" in config:
93
+ for voice in config["voices"]:
94
+ voice_ref_audio = config["voices"][voice]["ref_audio"]
95
+ if "infer/examples/" in voice_ref_audio:
96
+ config["voices"][voice]["ref_audio"] = str(files("f5_tts").joinpath(f"{voice_ref_audio}"))
97
+
98
+ if gen_file:
99
+ gen_text = codecs.open(gen_file, "r", "utf-8").read()
100
+ output_dir = args.output_dir if args.output_dir else config["output_dir"]
101
+ model = args.model if args.model else config["model"]
102
+ ckpt_file = args.ckpt_file if args.ckpt_file else ""
103
+ vocab_file = args.vocab_file if args.vocab_file else ""
104
+ remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
105
+ wave_path = Path(output_dir) / "infer_cli_out.wav"
106
+ # spectrogram_path = Path(output_dir) / "infer_cli_out.png"
107
+ vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
108
+
109
+ vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
110
+
111
+
112
+ # load models
113
+ if model == "F5-TTS":
114
+ model_cls = DiT
115
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
116
+ if ckpt_file == "":
117
+ repo_name = "F5-TTS"
118
+ exp_name = "F5TTS_Base"
119
+ ckpt_step = 1200000
120
+ ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.safetensors"))
121
+ # ckpt_file = f"ckpts/{exp_name}/model_{ckpt_step}.pt" # .pt | .safetensors; local path
122
+
123
+ elif model == "E2-TTS":
124
+ model_cls = UNetT
125
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
126
+ if ckpt_file == "":
127
+ repo_name = "E2-TTS"
128
+ exp_name = "E2TTS_Base"
129
+ ckpt_step = 1200000
130
+ ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.safetensors"))
131
+ # ckpt_file = f"ckpts/{exp_name}/model_{ckpt_step}.pt" # .pt | .safetensors; local path
132
+
133
+ print(f"Using {model}...")
134
+ ema_model = load_model(model_cls, model_cfg, ckpt_file, vocab_file)
135
+
136
+
137
+ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
138
+ main_voice = {"ref_audio": ref_audio, "ref_text": ref_text}
139
+ if "voices" not in config:
140
+ voices = {"main": main_voice}
141
+ else:
142
+ voices = config["voices"]
143
+ voices["main"] = main_voice
144
+ for voice in voices:
145
+ voices[voice]["ref_audio"], voices[voice]["ref_text"] = preprocess_ref_audio_text(
146
+ voices[voice]["ref_audio"], voices[voice]["ref_text"]
147
+ )
148
+ print("Voice:", voice)
149
+ print("Ref_audio:", voices[voice]["ref_audio"])
150
+ print("Ref_text:", voices[voice]["ref_text"])
151
+
152
+ generated_audio_segments = []
153
+ reg1 = r"(?=\[\w+\])"
154
+ chunks = re.split(reg1, text_gen)
155
+ reg2 = r"\[(\w+)\]"
156
+ for text in chunks:
157
+ match = re.match(reg2, text)
158
+ if match:
159
+ voice = match[1]
160
+ else:
161
+ print("No voice tag found, using main.")
162
+ voice = "main"
163
+ if voice not in voices:
164
+ print(f"Voice {voice} not found, using main.")
165
+ voice = "main"
166
+ text = re.sub(reg2, "", text)
167
+ gen_text = text.strip()
168
+ ref_audio = voices[voice]["ref_audio"]
169
+ ref_text = voices[voice]["ref_text"]
170
+ print(f"Voice: {voice}")
171
+ audio, final_sample_rate, spectragram = infer_process(ref_audio, ref_text, gen_text, model_obj)
172
+ generated_audio_segments.append(audio)
173
+
174
+ if generated_audio_segments:
175
+ final_wave = np.concatenate(generated_audio_segments)
176
+
177
+ if not os.path.exists(output_dir):
178
+ os.makedirs(output_dir)
179
+
180
+ with open(wave_path, "wb") as f:
181
+ sf.write(f.name, final_wave, final_sample_rate)
182
+ # Remove silence
183
+ if remove_silence:
184
+ remove_silence_for_generated_wav(f.name)
185
+ print(f.name)
186
+
187
+
188
+ def main():
189
+ main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
190
+
191
+
192
+ if __name__ == "__main__":
193
+ main()
src/f5_tts/infer/speech_edit.py ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ import torch
4
+ import torch.nn.functional as F
5
+ import torchaudio
6
+ from vocos import Vocos
7
+
8
+ from f5_tts.model import CFM, UNetT, DiT
9
+ from f5_tts.model.utils import (
10
+ get_tokenizer,
11
+ convert_char_to_pinyin,
12
+ )
13
+ from f5_tts.infer.utils_infer import (
14
+ load_checkpoint,
15
+ save_spectrogram,
16
+ )
17
+
18
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
19
+
20
+
21
+ # --------------------- Dataset Settings -------------------- #
22
+
23
+ target_sample_rate = 24000
24
+ n_mel_channels = 100
25
+ hop_length = 256
26
+ target_rms = 0.1
27
+
28
+ tokenizer = "pinyin"
29
+ dataset_name = "Emilia_ZH_EN"
30
+
31
+
32
+ # ---------------------- infer setting ---------------------- #
33
+
34
+ seed = None # int | None
35
+
36
+ exp_name = "F5TTS_Base" # F5TTS_Base | E2TTS_Base
37
+ ckpt_step = 1200000
38
+
39
+ nfe_step = 32 # 16, 32
40
+ cfg_strength = 2.0
41
+ ode_method = "euler" # euler | midpoint
42
+ sway_sampling_coef = -1.0
43
+ speed = 1.0
44
+
45
+ if exp_name == "F5TTS_Base":
46
+ model_cls = DiT
47
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
48
+
49
+ elif exp_name == "E2TTS_Base":
50
+ model_cls = UNetT
51
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
52
+
53
+ ckpt_path = f"ckpts/{exp_name}/model_{ckpt_step}.safetensors"
54
+ output_dir = "tests"
55
+
56
+ # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
57
+ # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
58
+ # [write the origin_text into a file, e.g. tests/test_edit.txt]
59
+ # ctc-forced-aligner --audio_path "src/f5_tts/infer/examples/basic/basic_ref_en.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
60
+ # [result will be saved at same path of audio file]
61
+ # [--language "zho" for Chinese, "eng" for English]
62
+ # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
63
+
64
+ audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_en.wav"
65
+ origin_text = "Some call me nature, others call me mother nature."
66
+ target_text = "Some call me optimist, others call me realist."
67
+ parts_to_edit = [
68
+ [1.42, 2.44],
69
+ [4.04, 4.9],
70
+ ] # stard_ends of "nature" & "mother nature", in seconds
71
+ fix_duration = [
72
+ 1.2,
73
+ 1,
74
+ ] # fix duration for "optimist" & "realist", in seconds
75
+
76
+ # audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_zh.wav"
77
+ # origin_text = "对,这就是我,万人敬仰的太乙真人。"
78
+ # target_text = "对,那就是你,万人敬仰的太白金星。"
79
+ # parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]
80
+ # fix_duration = None # use origin text duration
81
+
82
+
83
+ # -------------------------------------------------#
84
+
85
+ use_ema = True
86
+
87
+ if not os.path.exists(output_dir):
88
+ os.makedirs(output_dir)
89
+
90
+ # Vocoder model
91
+ local = False
92
+ if local:
93
+ vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
94
+ vocos = Vocos.from_hparams(f"{vocos_local_path}/config.yaml")
95
+ state_dict = torch.load(f"{vocos_local_path}/pytorch_model.bin", weights_only=True, map_location=device)
96
+ vocos.load_state_dict(state_dict)
97
+
98
+ vocos.eval()
99
+ else:
100
+ vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
101
+
102
+ # Tokenizer
103
+ vocab_char_map, vocab_size = get_tokenizer(dataset_name, tokenizer)
104
+
105
+ # Model
106
+ model = CFM(
107
+ transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
108
+ mel_spec_kwargs=dict(
109
+ target_sample_rate=target_sample_rate,
110
+ n_mel_channels=n_mel_channels,
111
+ hop_length=hop_length,
112
+ ),
113
+ odeint_kwargs=dict(
114
+ method=ode_method,
115
+ ),
116
+ vocab_char_map=vocab_char_map,
117
+ ).to(device)
118
+
119
+ model = load_checkpoint(model, ckpt_path, device, use_ema=use_ema)
120
+
121
+ # Audio
122
+ audio, sr = torchaudio.load(audio_to_edit)
123
+ if audio.shape[0] > 1:
124
+ audio = torch.mean(audio, dim=0, keepdim=True)
125
+ rms = torch.sqrt(torch.mean(torch.square(audio)))
126
+ if rms < target_rms:
127
+ audio = audio * target_rms / rms
128
+ if sr != target_sample_rate:
129
+ resampler = torchaudio.transforms.Resample(sr, target_sample_rate)
130
+ audio = resampler(audio)
131
+ offset = 0
132
+ audio_ = torch.zeros(1, 0)
133
+ edit_mask = torch.zeros(1, 0, dtype=torch.bool)
134
+ for part in parts_to_edit:
135
+ start, end = part
136
+ part_dur = end - start if fix_duration is None else fix_duration.pop(0)
137
+ part_dur = part_dur * target_sample_rate
138
+ start = start * target_sample_rate
139
+ audio_ = torch.cat((audio_, audio[:, round(offset) : round(start)], torch.zeros(1, round(part_dur))), dim=-1)
140
+ edit_mask = torch.cat(
141
+ (
142
+ edit_mask,
143
+ torch.ones(1, round((start - offset) / hop_length), dtype=torch.bool),
144
+ torch.zeros(1, round(part_dur / hop_length), dtype=torch.bool),
145
+ ),
146
+ dim=-1,
147
+ )
148
+ offset = end * target_sample_rate
149
+ # audio = torch.cat((audio_, audio[:, round(offset):]), dim = -1)
150
+ edit_mask = F.pad(edit_mask, (0, audio.shape[-1] // hop_length - edit_mask.shape[-1] + 1), value=True)
151
+ audio = audio.to(device)
152
+ edit_mask = edit_mask.to(device)
153
+
154
+ # Text
155
+ text_list = [target_text]
156
+ if tokenizer == "pinyin":
157
+ final_text_list = convert_char_to_pinyin(text_list)
158
+ else:
159
+ final_text_list = [text_list]
160
+ print(f"text : {text_list}")
161
+ print(f"pinyin: {final_text_list}")
162
+
163
+ # Duration
164
+ ref_audio_len = 0
165
+ duration = audio.shape[-1] // hop_length
166
+
167
+ # Inference
168
+ with torch.inference_mode():
169
+ generated, trajectory = model.sample(
170
+ cond=audio,
171
+ text=final_text_list,
172
+ duration=duration,
173
+ steps=nfe_step,
174
+ cfg_strength=cfg_strength,
175
+ sway_sampling_coef=sway_sampling_coef,
176
+ seed=seed,
177
+ edit_mask=edit_mask,
178
+ )
179
+ print(f"Generated mel: {generated.shape}")
180
+
181
+ # Final result
182
+ generated = generated.to(torch.float32)
183
+ generated = generated[:, ref_audio_len:, :]
184
+ generated_mel_spec = generated.permute(0, 2, 1)
185
+ generated_wave = vocos.decode(generated_mel_spec.cpu())
186
+ if rms < target_rms:
187
+ generated_wave = generated_wave * rms / target_rms
188
+
189
+ save_spectrogram(generated_mel_spec[0].cpu().numpy(), f"{output_dir}/speech_edit_out.png")
190
+ torchaudio.save(f"{output_dir}/speech_edit_out.wav", generated_wave, target_sample_rate)
191
+ print(f"Generated wav: {generated_wave.shape}")
src/f5_tts/infer/utils_infer.py ADDED
@@ -0,0 +1,417 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # A unified script for inference process
2
+ # Make adjustments inside functions, and consider both gradio and cli scripts if need to change func output format
3
+
4
+ import hashlib
5
+ import re
6
+ import tempfile
7
+ from importlib.resources import files
8
+
9
+ import matplotlib
10
+
11
+ matplotlib.use("Agg")
12
+
13
+ import matplotlib.pylab as plt
14
+ import numpy as np
15
+ import torch
16
+ import torchaudio
17
+ import tqdm
18
+ from pydub import AudioSegment, silence
19
+ from transformers import pipeline
20
+ from vocos import Vocos
21
+
22
+ from f5_tts.model import CFM
23
+ from f5_tts.model.utils import (
24
+ get_tokenizer,
25
+ convert_char_to_pinyin,
26
+ )
27
+
28
+ _ref_audio_cache = {}
29
+
30
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
31
+
32
+ vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
33
+
34
+
35
+ # -----------------------------------------
36
+
37
+ target_sample_rate = 24000
38
+ n_mel_channels = 100
39
+ hop_length = 256
40
+ target_rms = 0.1
41
+ cross_fade_duration = 0.15
42
+ ode_method = "euler"
43
+ nfe_step = 32 # 16, 32
44
+ cfg_strength = 2.0
45
+ sway_sampling_coef = -1.0
46
+ speed = 1.0
47
+ fix_duration = None
48
+
49
+ # -----------------------------------------
50
+
51
+
52
+ # chunk text into smaller pieces
53
+
54
+
55
+ def chunk_text(text, max_chars=135):
56
+ """
57
+ Splits the input text into chunks, each with a maximum number of characters.
58
+
59
+ Args:
60
+ text (str): The text to be split.
61
+ max_chars (int): The maximum number of characters per chunk.
62
+
63
+ Returns:
64
+ List[str]: A list of text chunks.
65
+ """
66
+ chunks = []
67
+ current_chunk = ""
68
+ # Split the text into sentences based on punctuation followed by whitespace
69
+ sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[;:,。!?])", text)
70
+
71
+ for sentence in sentences:
72
+ if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars:
73
+ current_chunk += sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
74
+ else:
75
+ if current_chunk:
76
+ chunks.append(current_chunk.strip())
77
+ current_chunk = sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
78
+
79
+ if current_chunk:
80
+ chunks.append(current_chunk.strip())
81
+
82
+ return chunks
83
+
84
+
85
+ # load vocoder
86
+ def load_vocoder(is_local=False, local_path="", device=device):
87
+ if is_local:
88
+ print(f"Load vocos from local path {local_path}")
89
+ vocos = Vocos.from_hparams(f"{local_path}/config.yaml")
90
+ state_dict = torch.load(f"{local_path}/pytorch_model.bin", map_location=device)
91
+ vocos.load_state_dict(state_dict)
92
+ vocos.eval()
93
+ else:
94
+ print("Download Vocos from huggingface charactr/vocos-mel-24khz")
95
+ vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
96
+ return vocos
97
+
98
+
99
+ # load asr pipeline
100
+
101
+ asr_pipe = None
102
+
103
+
104
+ def initialize_asr_pipeline(device=device):
105
+ global asr_pipe
106
+ asr_pipe = pipeline(
107
+ "automatic-speech-recognition",
108
+ model="openai/whisper-large-v3-turbo",
109
+ torch_dtype=torch.float16,
110
+ device=device,
111
+ )
112
+
113
+
114
+ # load model checkpoint for inference
115
+
116
+
117
+ def load_checkpoint(model, ckpt_path, device, use_ema=True):
118
+ if device == "cuda":
119
+ model = model.half()
120
+
121
+ ckpt_type = ckpt_path.split(".")[-1]
122
+ if ckpt_type == "safetensors":
123
+ from safetensors.torch import load_file
124
+
125
+ checkpoint = load_file(ckpt_path)
126
+ else:
127
+ checkpoint = torch.load(ckpt_path, weights_only=True)
128
+
129
+ if use_ema:
130
+ if ckpt_type == "safetensors":
131
+ checkpoint = {"ema_model_state_dict": checkpoint}
132
+ checkpoint["model_state_dict"] = {
133
+ k.replace("ema_model.", ""): v
134
+ for k, v in checkpoint["ema_model_state_dict"].items()
135
+ if k not in ["initted", "step"]
136
+ }
137
+ model.load_state_dict(checkpoint["model_state_dict"])
138
+ else:
139
+ if ckpt_type == "safetensors":
140
+ checkpoint = {"model_state_dict": checkpoint}
141
+ model.load_state_dict(checkpoint["model_state_dict"])
142
+
143
+ return model.to(device)
144
+
145
+
146
+ # load model for inference
147
+
148
+
149
+ def load_model(model_cls, model_cfg, ckpt_path, vocab_file="", ode_method=ode_method, use_ema=True, device=device):
150
+ if vocab_file == "":
151
+ vocab_file = str(files("f5_tts").joinpath("infer/examples/vocab.txt"))
152
+ tokenizer = "custom"
153
+
154
+ print("\nvocab : ", vocab_file)
155
+ print("tokenizer : ", tokenizer)
156
+ print("model : ", ckpt_path, "\n")
157
+
158
+ vocab_char_map, vocab_size = get_tokenizer(vocab_file, tokenizer)
159
+ model = CFM(
160
+ transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
161
+ mel_spec_kwargs=dict(
162
+ target_sample_rate=target_sample_rate,
163
+ n_mel_channels=n_mel_channels,
164
+ hop_length=hop_length,
165
+ ),
166
+ odeint_kwargs=dict(
167
+ method=ode_method,
168
+ ),
169
+ vocab_char_map=vocab_char_map,
170
+ ).to(device)
171
+
172
+ model = load_checkpoint(model, ckpt_path, device, use_ema=use_ema)
173
+
174
+ return model
175
+
176
+
177
+ # preprocess reference audio and text
178
+
179
+
180
+ def preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=print, device=device):
181
+ show_info("Converting audio...")
182
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
183
+ aseg = AudioSegment.from_file(ref_audio_orig)
184
+
185
+ non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
186
+ non_silent_wave = AudioSegment.silent(duration=0)
187
+ for non_silent_seg in non_silent_segs:
188
+ if len(non_silent_wave) > 10000 and len(non_silent_wave + non_silent_seg) > 18000:
189
+ show_info("Audio is over 18s, clipping short.")
190
+ break
191
+ non_silent_wave += non_silent_seg
192
+ aseg = non_silent_wave
193
+
194
+ aseg.export(f.name, format="wav")
195
+ ref_audio = f.name
196
+
197
+ # Compute a hash of the reference audio file
198
+ with open(ref_audio, "rb") as audio_file:
199
+ audio_data = audio_file.read()
200
+ audio_hash = hashlib.md5(audio_data).hexdigest()
201
+
202
+ global _ref_audio_cache
203
+ if audio_hash in _ref_audio_cache:
204
+ # Use cached reference text
205
+ show_info("Using cached reference text...")
206
+ ref_text = _ref_audio_cache[audio_hash]
207
+ else:
208
+ if not ref_text.strip():
209
+ global asr_pipe
210
+ if asr_pipe is None:
211
+ initialize_asr_pipeline(device=device)
212
+ show_info("No reference text provided, transcribing reference audio...")
213
+ ref_text = asr_pipe(
214
+ ref_audio,
215
+ chunk_length_s=30,
216
+ batch_size=128,
217
+ generate_kwargs={"task": "transcribe"},
218
+ return_timestamps=False,
219
+ )["text"].strip()
220
+ show_info("Finished transcription")
221
+ else:
222
+ show_info("Using custom reference text...")
223
+ # Cache the transcribed text
224
+ _ref_audio_cache[audio_hash] = ref_text
225
+
226
+ # Ensure ref_text ends with a proper sentence-ending punctuation
227
+ if not ref_text.endswith(". ") and not ref_text.endswith("。"):
228
+ if ref_text.endswith("."):
229
+ ref_text += " "
230
+ else:
231
+ ref_text += ". "
232
+
233
+ return ref_audio, ref_text
234
+
235
+
236
+ # infer process: chunk text -> infer batches [i.e. infer_batch_process()]
237
+
238
+
239
+ def infer_process(
240
+ ref_audio,
241
+ ref_text,
242
+ gen_text,
243
+ model_obj,
244
+ show_info=print,
245
+ progress=tqdm,
246
+ target_rms=target_rms,
247
+ cross_fade_duration=cross_fade_duration,
248
+ nfe_step=nfe_step,
249
+ cfg_strength=cfg_strength,
250
+ sway_sampling_coef=sway_sampling_coef,
251
+ speed=speed,
252
+ fix_duration=fix_duration,
253
+ device=device,
254
+ ):
255
+ # Split the input text into batches
256
+ audio, sr = torchaudio.load(ref_audio)
257
+ max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (25 - audio.shape[-1] / sr))
258
+ gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
259
+ for i, gen_text in enumerate(gen_text_batches):
260
+ print(f"gen_text {i}", gen_text)
261
+
262
+ show_info(f"Generating audio in {len(gen_text_batches)} batches...")
263
+ return infer_batch_process(
264
+ (audio, sr),
265
+ ref_text,
266
+ gen_text_batches,
267
+ model_obj,
268
+ progress=progress,
269
+ target_rms=target_rms,
270
+ cross_fade_duration=cross_fade_duration,
271
+ nfe_step=nfe_step,
272
+ cfg_strength=cfg_strength,
273
+ sway_sampling_coef=sway_sampling_coef,
274
+ speed=speed,
275
+ fix_duration=fix_duration,
276
+ device=device,
277
+ )
278
+
279
+
280
+ # infer batches
281
+
282
+
283
+ def infer_batch_process(
284
+ ref_audio,
285
+ ref_text,
286
+ gen_text_batches,
287
+ model_obj,
288
+ progress=tqdm,
289
+ target_rms=0.1,
290
+ cross_fade_duration=0.15,
291
+ nfe_step=32,
292
+ cfg_strength=2.0,
293
+ sway_sampling_coef=-1,
294
+ speed=1,
295
+ fix_duration=None,
296
+ device=None,
297
+ ):
298
+ audio, sr = ref_audio
299
+ if audio.shape[0] > 1:
300
+ audio = torch.mean(audio, dim=0, keepdim=True)
301
+
302
+ rms = torch.sqrt(torch.mean(torch.square(audio)))
303
+ if rms < target_rms:
304
+ audio = audio * target_rms / rms
305
+ if sr != target_sample_rate:
306
+ resampler = torchaudio.transforms.Resample(sr, target_sample_rate)
307
+ audio = resampler(audio)
308
+ audio = audio.to(device)
309
+
310
+ generated_waves = []
311
+ spectrograms = []
312
+
313
+ if len(ref_text[-1].encode("utf-8")) == 1:
314
+ ref_text = ref_text + " "
315
+ for i, gen_text in enumerate(progress.tqdm(gen_text_batches)):
316
+ # Prepare the text
317
+ text_list = [ref_text + gen_text]
318
+ final_text_list = convert_char_to_pinyin(text_list)
319
+
320
+ ref_audio_len = audio.shape[-1] // hop_length
321
+ if fix_duration is not None:
322
+ duration = int(fix_duration * target_sample_rate / hop_length)
323
+ else:
324
+ # Calculate duration
325
+ ref_text_len = len(ref_text.encode("utf-8"))
326
+ gen_text_len = len(gen_text.encode("utf-8"))
327
+ duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / speed)
328
+
329
+ # inference
330
+ with torch.inference_mode():
331
+ generated, _ = model_obj.sample(
332
+ cond=audio,
333
+ text=final_text_list,
334
+ duration=duration,
335
+ steps=nfe_step,
336
+ cfg_strength=cfg_strength,
337
+ sway_sampling_coef=sway_sampling_coef,
338
+ )
339
+
340
+ generated = generated.to(torch.float32)
341
+ generated = generated[:, ref_audio_len:, :]
342
+ generated_mel_spec = generated.permute(0, 2, 1)
343
+ generated_wave = vocos.decode(generated_mel_spec.cpu())
344
+ if rms < target_rms:
345
+ generated_wave = generated_wave * rms / target_rms
346
+
347
+ # wav -> numpy
348
+ generated_wave = generated_wave.squeeze().cpu().numpy()
349
+
350
+ generated_waves.append(generated_wave)
351
+ spectrograms.append(generated_mel_spec[0].cpu().numpy())
352
+
353
+ # Combine all generated waves with cross-fading
354
+ if cross_fade_duration <= 0:
355
+ # Simply concatenate
356
+ final_wave = np.concatenate(generated_waves)
357
+ else:
358
+ final_wave = generated_waves[0]
359
+ for i in range(1, len(generated_waves)):
360
+ prev_wave = final_wave
361
+ next_wave = generated_waves[i]
362
+
363
+ # Calculate cross-fade samples, ensuring it does not exceed wave lengths
364
+ cross_fade_samples = int(cross_fade_duration * target_sample_rate)
365
+ cross_fade_samples = min(cross_fade_samples, len(prev_wave), len(next_wave))
366
+
367
+ if cross_fade_samples <= 0:
368
+ # No overlap possible, concatenate
369
+ final_wave = np.concatenate([prev_wave, next_wave])
370
+ continue
371
+
372
+ # Overlapping parts
373
+ prev_overlap = prev_wave[-cross_fade_samples:]
374
+ next_overlap = next_wave[:cross_fade_samples]
375
+
376
+ # Fade out and fade in
377
+ fade_out = np.linspace(1, 0, cross_fade_samples)
378
+ fade_in = np.linspace(0, 1, cross_fade_samples)
379
+
380
+ # Cross-faded overlap
381
+ cross_faded_overlap = prev_overlap * fade_out + next_overlap * fade_in
382
+
383
+ # Combine
384
+ new_wave = np.concatenate(
385
+ [prev_wave[:-cross_fade_samples], cross_faded_overlap, next_wave[cross_fade_samples:]]
386
+ )
387
+
388
+ final_wave = new_wave
389
+
390
+ # Create a combined spectrogram
391
+ combined_spectrogram = np.concatenate(spectrograms, axis=1)
392
+
393
+ return final_wave, target_sample_rate, combined_spectrogram
394
+
395
+
396
+ # remove silence from generated wav
397
+
398
+
399
+ def remove_silence_for_generated_wav(filename):
400
+ aseg = AudioSegment.from_file(filename)
401
+ non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=500)
402
+ non_silent_wave = AudioSegment.silent(duration=0)
403
+ for non_silent_seg in non_silent_segs:
404
+ non_silent_wave += non_silent_seg
405
+ aseg = non_silent_wave
406
+ aseg.export(filename, format="wav")
407
+
408
+
409
+ # save spectrogram
410
+
411
+
412
+ def save_spectrogram(spectrogram, path):
413
+ plt.figure(figsize=(12, 4))
414
+ plt.imshow(spectrogram, origin="lower", aspect="auto")
415
+ plt.colorbar()
416
+ plt.savefig(path)
417
+ plt.close()
src/f5_tts/model/__init__.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ from f5_tts.model.cfm import CFM
2
+
3
+ from f5_tts.model.backbones.unett import UNetT
4
+ from f5_tts.model.backbones.dit import DiT
5
+ from f5_tts.model.backbones.mmdit import MMDiT
6
+
7
+ from f5_tts.model.trainer import Trainer
8
+
9
+
10
+ __all__ = ["CFM", "UNetT", "DiT", "MMDiT", "Trainer"]
src/f5_tts/model/backbones/README.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Backbones quick introduction
2
+
3
+
4
+ ### unett.py
5
+ - flat unet transformer
6
+ - structure same as in e2-tts & voicebox paper except using rotary pos emb
7
+ - update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat
8
+
9
+ ### dit.py
10
+ - adaln-zero dit
11
+ - embedded timestep as condition
12
+ - concatted noised_input + masked_cond + embedded_text, linear proj in
13
+ - possible abs pos emb & convnextv2 blocks for embedded text before concat
14
+ - possible long skip connection (first layer to last layer)
15
+
16
+ ### mmdit.py
17
+ - sd3 structure
18
+ - timestep as condition
19
+ - left stream: text embedded and applied a abs pos emb
20
+ - right stream: masked_cond & noised_input concatted and with same conv pos emb as unett
src/f5_tts/model/backbones/dit.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ein notation:
3
+ b - batch
4
+ n - sequence
5
+ nt - text sequence
6
+ nw - raw wave length
7
+ d - dimension
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import torch
13
+ from torch import nn
14
+ import torch.nn.functional as F
15
+
16
+ from x_transformers.x_transformers import RotaryEmbedding
17
+
18
+ from f5_tts.model.modules import (
19
+ TimestepEmbedding,
20
+ ConvNeXtV2Block,
21
+ ConvPositionEmbedding,
22
+ DiTBlock,
23
+ AdaLayerNormZero_Final,
24
+ precompute_freqs_cis,
25
+ get_pos_embed_indices,
26
+ )
27
+
28
+
29
+ # Text embedding
30
+
31
+
32
+ class TextEmbedding(nn.Module):
33
+ def __init__(self, text_num_embeds, text_dim, conv_layers=0, conv_mult=2):
34
+ super().__init__()
35
+ self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim) # use 0 as filler token
36
+
37
+ if conv_layers > 0:
38
+ self.extra_modeling = True
39
+ self.precompute_max_pos = 4096 # ~44s of 24khz audio
40
+ self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
41
+ self.text_blocks = nn.Sequential(
42
+ *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
43
+ )
44
+ else:
45
+ self.extra_modeling = False
46
+
47
+ def forward(self, text: int["b nt"], seq_len, drop_text=False): # noqa: F722
48
+ text = text + 1 # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
49
+ text = text[:, :seq_len] # curtail if character tokens are more than the mel spec tokens
50
+ batch, text_len = text.shape[0], text.shape[1]
51
+ text = F.pad(text, (0, seq_len - text_len), value=0)
52
+
53
+ if drop_text: # cfg for text
54
+ text = torch.zeros_like(text)
55
+
56
+ text = self.text_embed(text) # b n -> b n d
57
+
58
+ # possible extra modeling
59
+ if self.extra_modeling:
60
+ # sinus pos emb
61
+ batch_start = torch.zeros((batch,), dtype=torch.long)
62
+ pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
63
+ text_pos_embed = self.freqs_cis[pos_idx]
64
+ text = text + text_pos_embed
65
+
66
+ # convnextv2 blocks
67
+ text = self.text_blocks(text)
68
+
69
+ return text
70
+
71
+
72
+ # noised input audio and context mixing embedding
73
+
74
+
75
+ class InputEmbedding(nn.Module):
76
+ def __init__(self, mel_dim, text_dim, out_dim):
77
+ super().__init__()
78
+ self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
79
+ self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
80
+
81
+ def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False): # noqa: F722
82
+ if drop_audio_cond: # cfg for cond audio
83
+ cond = torch.zeros_like(cond)
84
+
85
+ x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
86
+ x = self.conv_pos_embed(x) + x
87
+ return x
88
+
89
+
90
+ # Transformer backbone using DiT blocks
91
+
92
+
93
+ class DiT(nn.Module):
94
+ def __init__(
95
+ self,
96
+ *,
97
+ dim,
98
+ depth=8,
99
+ heads=8,
100
+ dim_head=64,
101
+ dropout=0.1,
102
+ ff_mult=4,
103
+ mel_dim=100,
104
+ text_num_embeds=256,
105
+ text_dim=None,
106
+ conv_layers=0,
107
+ long_skip_connection=False,
108
+ ):
109
+ super().__init__()
110
+
111
+ self.time_embed = TimestepEmbedding(dim)
112
+ if text_dim is None:
113
+ text_dim = mel_dim
114
+ self.text_embed = TextEmbedding(text_num_embeds, text_dim, conv_layers=conv_layers)
115
+ self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
116
+
117
+ self.rotary_embed = RotaryEmbedding(dim_head)
118
+
119
+ self.dim = dim
120
+ self.depth = depth
121
+
122
+ self.transformer_blocks = nn.ModuleList(
123
+ [DiTBlock(dim=dim, heads=heads, dim_head=dim_head, ff_mult=ff_mult, dropout=dropout) for _ in range(depth)]
124
+ )
125
+ self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
126
+
127
+ self.norm_out = AdaLayerNormZero_Final(dim) # final modulation
128
+ self.proj_out = nn.Linear(dim, mel_dim)
129
+
130
+ def forward(
131
+ self,
132
+ x: float["b n d"], # nosied input audio # noqa: F722
133
+ cond: float["b n d"], # masked cond audio # noqa: F722
134
+ text: int["b nt"], # text # noqa: F722
135
+ time: float["b"] | float[""], # time step # noqa: F821 F722
136
+ drop_audio_cond, # cfg for cond audio
137
+ drop_text, # cfg for text
138
+ mask: bool["b n"] | None = None, # noqa: F722
139
+ ):
140
+ batch, seq_len = x.shape[0], x.shape[1]
141
+ if time.ndim == 0:
142
+ time = time.repeat(batch)
143
+
144
+ # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
145
+ t = self.time_embed(time)
146
+ text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
147
+ x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
148
+
149
+ rope = self.rotary_embed.forward_from_seq_len(seq_len)
150
+
151
+ if self.long_skip_connection is not None:
152
+ residual = x
153
+
154
+ for block in self.transformer_blocks:
155
+ x = block(x, t, mask=mask, rope=rope)
156
+
157
+ if self.long_skip_connection is not None:
158
+ x = self.long_skip_connection(torch.cat((x, residual), dim=-1))
159
+
160
+ x = self.norm_out(x, t)
161
+ output = self.proj_out(x)
162
+
163
+ return output
src/f5_tts/model/backbones/mmdit.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ein notation:
3
+ b - batch
4
+ n - sequence
5
+ nt - text sequence
6
+ nw - raw wave length
7
+ d - dimension
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import torch
13
+ from torch import nn
14
+
15
+ from x_transformers.x_transformers import RotaryEmbedding
16
+
17
+ from f5_tts.model.modules import (
18
+ TimestepEmbedding,
19
+ ConvPositionEmbedding,
20
+ MMDiTBlock,
21
+ AdaLayerNormZero_Final,
22
+ precompute_freqs_cis,
23
+ get_pos_embed_indices,
24
+ )
25
+
26
+
27
+ # text embedding
28
+
29
+
30
+ class TextEmbedding(nn.Module):
31
+ def __init__(self, out_dim, text_num_embeds):
32
+ super().__init__()
33
+ self.text_embed = nn.Embedding(text_num_embeds + 1, out_dim) # will use 0 as filler token
34
+
35
+ self.precompute_max_pos = 1024
36
+ self.register_buffer("freqs_cis", precompute_freqs_cis(out_dim, self.precompute_max_pos), persistent=False)
37
+
38
+ def forward(self, text: int["b nt"], drop_text=False) -> int["b nt d"]: # noqa: F722
39
+ text = text + 1
40
+ if drop_text:
41
+ text = torch.zeros_like(text)
42
+ text = self.text_embed(text)
43
+
44
+ # sinus pos emb
45
+ batch_start = torch.zeros((text.shape[0],), dtype=torch.long)
46
+ batch_text_len = text.shape[1]
47
+ pos_idx = get_pos_embed_indices(batch_start, batch_text_len, max_pos=self.precompute_max_pos)
48
+ text_pos_embed = self.freqs_cis[pos_idx]
49
+
50
+ text = text + text_pos_embed
51
+
52
+ return text
53
+
54
+
55
+ # noised input & masked cond audio embedding
56
+
57
+
58
+ class AudioEmbedding(nn.Module):
59
+ def __init__(self, in_dim, out_dim):
60
+ super().__init__()
61
+ self.linear = nn.Linear(2 * in_dim, out_dim)
62
+ self.conv_pos_embed = ConvPositionEmbedding(out_dim)
63
+
64
+ def forward(self, x: float["b n d"], cond: float["b n d"], drop_audio_cond=False): # noqa: F722
65
+ if drop_audio_cond:
66
+ cond = torch.zeros_like(cond)
67
+ x = torch.cat((x, cond), dim=-1)
68
+ x = self.linear(x)
69
+ x = self.conv_pos_embed(x) + x
70
+ return x
71
+
72
+
73
+ # Transformer backbone using MM-DiT blocks
74
+
75
+
76
+ class MMDiT(nn.Module):
77
+ def __init__(
78
+ self,
79
+ *,
80
+ dim,
81
+ depth=8,
82
+ heads=8,
83
+ dim_head=64,
84
+ dropout=0.1,
85
+ ff_mult=4,
86
+ text_num_embeds=256,
87
+ mel_dim=100,
88
+ ):
89
+ super().__init__()
90
+
91
+ self.time_embed = TimestepEmbedding(dim)
92
+ self.text_embed = TextEmbedding(dim, text_num_embeds)
93
+ self.audio_embed = AudioEmbedding(mel_dim, dim)
94
+
95
+ self.rotary_embed = RotaryEmbedding(dim_head)
96
+
97
+ self.dim = dim
98
+ self.depth = depth
99
+
100
+ self.transformer_blocks = nn.ModuleList(
101
+ [
102
+ MMDiTBlock(
103
+ dim=dim,
104
+ heads=heads,
105
+ dim_head=dim_head,
106
+ dropout=dropout,
107
+ ff_mult=ff_mult,
108
+ context_pre_only=i == depth - 1,
109
+ )
110
+ for i in range(depth)
111
+ ]
112
+ )
113
+ self.norm_out = AdaLayerNormZero_Final(dim) # final modulation
114
+ self.proj_out = nn.Linear(dim, mel_dim)
115
+
116
+ def forward(
117
+ self,
118
+ x: float["b n d"], # nosied input audio # noqa: F722
119
+ cond: float["b n d"], # masked cond audio # noqa: F722
120
+ text: int["b nt"], # text # noqa: F722
121
+ time: float["b"] | float[""], # time step # noqa: F821 F722
122
+ drop_audio_cond, # cfg for cond audio
123
+ drop_text, # cfg for text
124
+ mask: bool["b n"] | None = None, # noqa: F722
125
+ ):
126
+ batch = x.shape[0]
127
+ if time.ndim == 0:
128
+ time = time.repeat(batch)
129
+
130
+ # t: conditioning (time), c: context (text + masked cond audio), x: noised input audio
131
+ t = self.time_embed(time)
132
+ c = self.text_embed(text, drop_text=drop_text)
133
+ x = self.audio_embed(x, cond, drop_audio_cond=drop_audio_cond)
134
+
135
+ seq_len = x.shape[1]
136
+ text_len = text.shape[1]
137
+ rope_audio = self.rotary_embed.forward_from_seq_len(seq_len)
138
+ rope_text = self.rotary_embed.forward_from_seq_len(text_len)
139
+
140
+ for block in self.transformer_blocks:
141
+ c, x = block(x, c, t, mask=mask, rope=rope_audio, c_rope=rope_text)
142
+
143
+ x = self.norm_out(x, t)
144
+ output = self.proj_out(x)
145
+
146
+ return output
src/f5_tts/model/backbones/unett.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ein notation:
3
+ b - batch
4
+ n - sequence
5
+ nt - text sequence
6
+ nw - raw wave length
7
+ d - dimension
8
+ """
9
+
10
+ from __future__ import annotations
11
+ from typing import Literal
12
+
13
+ import torch
14
+ from torch import nn
15
+ import torch.nn.functional as F
16
+
17
+ from x_transformers import RMSNorm
18
+ from x_transformers.x_transformers import RotaryEmbedding
19
+
20
+ from f5_tts.model.modules import (
21
+ TimestepEmbedding,
22
+ ConvNeXtV2Block,
23
+ ConvPositionEmbedding,
24
+ Attention,
25
+ AttnProcessor,
26
+ FeedForward,
27
+ precompute_freqs_cis,
28
+ get_pos_embed_indices,
29
+ )
30
+
31
+
32
+ # Text embedding
33
+
34
+
35
+ class TextEmbedding(nn.Module):
36
+ def __init__(self, text_num_embeds, text_dim, conv_layers=0, conv_mult=2):
37
+ super().__init__()
38
+ self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim) # use 0 as filler token
39
+
40
+ if conv_layers > 0:
41
+ self.extra_modeling = True
42
+ self.precompute_max_pos = 4096 # ~44s of 24khz audio
43
+ self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
44
+ self.text_blocks = nn.Sequential(
45
+ *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
46
+ )
47
+ else:
48
+ self.extra_modeling = False
49
+
50
+ def forward(self, text: int["b nt"], seq_len, drop_text=False): # noqa: F722
51
+ text = text + 1 # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
52
+ text = text[:, :seq_len] # curtail if character tokens are more than the mel spec tokens
53
+ batch, text_len = text.shape[0], text.shape[1]
54
+ text = F.pad(text, (0, seq_len - text_len), value=0)
55
+
56
+ if drop_text: # cfg for text
57
+ text = torch.zeros_like(text)
58
+
59
+ text = self.text_embed(text) # b n -> b n d
60
+
61
+ # possible extra modeling
62
+ if self.extra_modeling:
63
+ # sinus pos emb
64
+ batch_start = torch.zeros((batch,), dtype=torch.long)
65
+ pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
66
+ text_pos_embed = self.freqs_cis[pos_idx]
67
+ text = text + text_pos_embed
68
+
69
+ # convnextv2 blocks
70
+ text = self.text_blocks(text)
71
+
72
+ return text
73
+
74
+
75
+ # noised input audio and context mixing embedding
76
+
77
+
78
+ class InputEmbedding(nn.Module):
79
+ def __init__(self, mel_dim, text_dim, out_dim):
80
+ super().__init__()
81
+ self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
82
+ self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
83
+
84
+ def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False): # noqa: F722
85
+ if drop_audio_cond: # cfg for cond audio
86
+ cond = torch.zeros_like(cond)
87
+
88
+ x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
89
+ x = self.conv_pos_embed(x) + x
90
+ return x
91
+
92
+
93
+ # Flat UNet Transformer backbone
94
+
95
+
96
+ class UNetT(nn.Module):
97
+ def __init__(
98
+ self,
99
+ *,
100
+ dim,
101
+ depth=8,
102
+ heads=8,
103
+ dim_head=64,
104
+ dropout=0.1,
105
+ ff_mult=4,
106
+ mel_dim=100,
107
+ text_num_embeds=256,
108
+ text_dim=None,
109
+ conv_layers=0,
110
+ skip_connect_type: Literal["add", "concat", "none"] = "concat",
111
+ ):
112
+ super().__init__()
113
+ assert depth % 2 == 0, "UNet-Transformer's depth should be even."
114
+
115
+ self.time_embed = TimestepEmbedding(dim)
116
+ if text_dim is None:
117
+ text_dim = mel_dim
118
+ self.text_embed = TextEmbedding(text_num_embeds, text_dim, conv_layers=conv_layers)
119
+ self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
120
+
121
+ self.rotary_embed = RotaryEmbedding(dim_head)
122
+
123
+ # transformer layers & skip connections
124
+
125
+ self.dim = dim
126
+ self.skip_connect_type = skip_connect_type
127
+ needs_skip_proj = skip_connect_type == "concat"
128
+
129
+ self.depth = depth
130
+ self.layers = nn.ModuleList([])
131
+
132
+ for idx in range(depth):
133
+ is_later_half = idx >= (depth // 2)
134
+
135
+ attn_norm = RMSNorm(dim)
136
+ attn = Attention(
137
+ processor=AttnProcessor(),
138
+ dim=dim,
139
+ heads=heads,
140
+ dim_head=dim_head,
141
+ dropout=dropout,
142
+ )
143
+
144
+ ff_norm = RMSNorm(dim)
145
+ ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
146
+
147
+ skip_proj = nn.Linear(dim * 2, dim, bias=False) if needs_skip_proj and is_later_half else None
148
+
149
+ self.layers.append(
150
+ nn.ModuleList(
151
+ [
152
+ skip_proj,
153
+ attn_norm,
154
+ attn,
155
+ ff_norm,
156
+ ff,
157
+ ]
158
+ )
159
+ )
160
+
161
+ self.norm_out = RMSNorm(dim)
162
+ self.proj_out = nn.Linear(dim, mel_dim)
163
+
164
+ def forward(
165
+ self,
166
+ x: float["b n d"], # nosied input audio # noqa: F722
167
+ cond: float["b n d"], # masked cond audio # noqa: F722
168
+ text: int["b nt"], # text # noqa: F722
169
+ time: float["b"] | float[""], # time step # noqa: F821 F722
170
+ drop_audio_cond, # cfg for cond audio
171
+ drop_text, # cfg for text
172
+ mask: bool["b n"] | None = None, # noqa: F722
173
+ ):
174
+ batch, seq_len = x.shape[0], x.shape[1]
175
+ if time.ndim == 0:
176
+ time = time.repeat(batch)
177
+
178
+ # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
179
+ t = self.time_embed(time)
180
+ text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
181
+ x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
182
+
183
+ # postfix time t to input x, [b n d] -> [b n+1 d]
184
+ x = torch.cat([t.unsqueeze(1), x], dim=1) # pack t to x
185
+ if mask is not None:
186
+ mask = F.pad(mask, (1, 0), value=1)
187
+
188
+ rope = self.rotary_embed.forward_from_seq_len(seq_len + 1)
189
+
190
+ # flat unet transformer
191
+ skip_connect_type = self.skip_connect_type
192
+ skips = []
193
+ for idx, (maybe_skip_proj, attn_norm, attn, ff_norm, ff) in enumerate(self.layers):
194
+ layer = idx + 1
195
+
196
+ # skip connection logic
197
+ is_first_half = layer <= (self.depth // 2)
198
+ is_later_half = not is_first_half
199
+
200
+ if is_first_half:
201
+ skips.append(x)
202
+
203
+ if is_later_half:
204
+ skip = skips.pop()
205
+ if skip_connect_type == "concat":
206
+ x = torch.cat((x, skip), dim=-1)
207
+ x = maybe_skip_proj(x)
208
+ elif skip_connect_type == "add":
209
+ x = x + skip
210
+
211
+ # attention and feedforward blocks
212
+ x = attn(attn_norm(x), rope=rope, mask=mask) + x
213
+ x = ff(ff_norm(x)) + x
214
+
215
+ assert len(skips) == 0
216
+
217
+ x = self.norm_out(x)[:, 1:, :] # unpack t from x
218
+
219
+ return self.proj_out(x)
src/f5_tts/model/cfm.py ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ein notation:
3
+ b - batch
4
+ n - sequence
5
+ nt - text sequence
6
+ nw - raw wave length
7
+ d - dimension
8
+ """
9
+
10
+ from __future__ import annotations
11
+ from typing import Callable
12
+ from random import random
13
+
14
+ import torch
15
+ from torch import nn
16
+ import torch.nn.functional as F
17
+ from torch.nn.utils.rnn import pad_sequence
18
+
19
+ from torchdiffeq import odeint
20
+
21
+ from f5_tts.model.modules import MelSpec
22
+ from f5_tts.model.utils import (
23
+ default,
24
+ exists,
25
+ list_str_to_idx,
26
+ list_str_to_tensor,
27
+ lens_to_mask,
28
+ mask_from_frac_lengths,
29
+ )
30
+
31
+
32
+ class CFM(nn.Module):
33
+ def __init__(
34
+ self,
35
+ transformer: nn.Module,
36
+ sigma=0.0,
37
+ odeint_kwargs: dict = dict(
38
+ # atol = 1e-5,
39
+ # rtol = 1e-5,
40
+ method="euler" # 'midpoint'
41
+ ),
42
+ audio_drop_prob=0.3,
43
+ cond_drop_prob=0.2,
44
+ num_channels=None,
45
+ mel_spec_module: nn.Module | None = None,
46
+ mel_spec_kwargs: dict = dict(),
47
+ frac_lengths_mask: tuple[float, float] = (0.7, 1.0),
48
+ vocab_char_map: dict[str:int] | None = None,
49
+ ):
50
+ super().__init__()
51
+
52
+ self.frac_lengths_mask = frac_lengths_mask
53
+
54
+ # mel spec
55
+ self.mel_spec = default(mel_spec_module, MelSpec(**mel_spec_kwargs))
56
+ num_channels = default(num_channels, self.mel_spec.n_mel_channels)
57
+ self.num_channels = num_channels
58
+
59
+ # classifier-free guidance
60
+ self.audio_drop_prob = audio_drop_prob
61
+ self.cond_drop_prob = cond_drop_prob
62
+
63
+ # transformer
64
+ self.transformer = transformer
65
+ dim = transformer.dim
66
+ self.dim = dim
67
+
68
+ # conditional flow related
69
+ self.sigma = sigma
70
+
71
+ # sampling related
72
+ self.odeint_kwargs = odeint_kwargs
73
+
74
+ # vocab map for tokenization
75
+ self.vocab_char_map = vocab_char_map
76
+
77
+ @property
78
+ def device(self):
79
+ return next(self.parameters()).device
80
+
81
+ @torch.no_grad()
82
+ def sample(
83
+ self,
84
+ cond: float["b n d"] | float["b nw"], # noqa: F722
85
+ text: int["b nt"] | list[str], # noqa: F722
86
+ duration: int | int["b"], # noqa: F821
87
+ *,
88
+ lens: int["b"] | None = None, # noqa: F821
89
+ steps=32,
90
+ cfg_strength=1.0,
91
+ sway_sampling_coef=None,
92
+ seed: int | None = None,
93
+ max_duration=4096,
94
+ vocoder: Callable[[float["b d n"]], float["b nw"]] | None = None, # noqa: F722
95
+ no_ref_audio=False,
96
+ duplicate_test=False,
97
+ t_inter=0.1,
98
+ edit_mask=None,
99
+ ):
100
+ self.eval()
101
+
102
+ if next(self.parameters()).dtype == torch.float16:
103
+ cond = cond.half()
104
+
105
+ # raw wave
106
+
107
+ if cond.ndim == 2:
108
+ cond = self.mel_spec(cond)
109
+ cond = cond.permute(0, 2, 1)
110
+ assert cond.shape[-1] == self.num_channels
111
+
112
+ batch, cond_seq_len, device = *cond.shape[:2], cond.device
113
+ if not exists(lens):
114
+ lens = torch.full((batch,), cond_seq_len, device=device, dtype=torch.long)
115
+
116
+ # text
117
+
118
+ if isinstance(text, list):
119
+ if exists(self.vocab_char_map):
120
+ text = list_str_to_idx(text, self.vocab_char_map).to(device)
121
+ else:
122
+ text = list_str_to_tensor(text).to(device)
123
+ assert text.shape[0] == batch
124
+
125
+ if exists(text):
126
+ text_lens = (text != -1).sum(dim=-1)
127
+ lens = torch.maximum(text_lens, lens) # make sure lengths are at least those of the text characters
128
+
129
+ # duration
130
+
131
+ cond_mask = lens_to_mask(lens)
132
+ if edit_mask is not None:
133
+ cond_mask = cond_mask & edit_mask
134
+
135
+ if isinstance(duration, int):
136
+ duration = torch.full((batch,), duration, device=device, dtype=torch.long)
137
+
138
+ duration = torch.maximum(lens + 1, duration) # just add one token so something is generated
139
+ duration = duration.clamp(max=max_duration)
140
+ max_duration = duration.amax()
141
+
142
+ # duplicate test corner for inner time step oberservation
143
+ if duplicate_test:
144
+ test_cond = F.pad(cond, (0, 0, cond_seq_len, max_duration - 2 * cond_seq_len), value=0.0)
145
+
146
+ cond = F.pad(cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
147
+ cond_mask = F.pad(cond_mask, (0, max_duration - cond_mask.shape[-1]), value=False)
148
+ cond_mask = cond_mask.unsqueeze(-1)
149
+ step_cond = torch.where(
150
+ cond_mask, cond, torch.zeros_like(cond)
151
+ ) # allow direct control (cut cond audio) with lens passed in
152
+
153
+ if batch > 1:
154
+ mask = lens_to_mask(duration)
155
+ else: # save memory and speed up, as single inference need no mask currently
156
+ mask = None
157
+
158
+ # test for no ref audio
159
+ if no_ref_audio:
160
+ cond = torch.zeros_like(cond)
161
+
162
+ # neural ode
163
+
164
+ def fn(t, x):
165
+ # at each step, conditioning is fixed
166
+ # step_cond = torch.where(cond_mask, cond, torch.zeros_like(cond))
167
+
168
+ # predict flow
169
+ pred = self.transformer(
170
+ x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=False, drop_text=False
171
+ )
172
+ if cfg_strength < 1e-5:
173
+ return pred
174
+
175
+ null_pred = self.transformer(
176
+ x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=True, drop_text=True
177
+ )
178
+ return pred + (pred - null_pred) * cfg_strength
179
+
180
+ # noise input
181
+ # to make sure batch inference result is same with different batch size, and for sure single inference
182
+ # still some difference maybe due to convolutional layers
183
+ y0 = []
184
+ for dur in duration:
185
+ if exists(seed):
186
+ torch.manual_seed(seed)
187
+ y0.append(torch.randn(dur, self.num_channels, device=self.device, dtype=step_cond.dtype))
188
+ y0 = pad_sequence(y0, padding_value=0, batch_first=True)
189
+
190
+ t_start = 0
191
+
192
+ # duplicate test corner for inner time step oberservation
193
+ if duplicate_test:
194
+ t_start = t_inter
195
+ y0 = (1 - t_start) * y0 + t_start * test_cond
196
+ steps = int(steps * (1 - t_start))
197
+
198
+ t = torch.linspace(t_start, 1, steps, device=self.device, dtype=step_cond.dtype)
199
+ if sway_sampling_coef is not None:
200
+ t = t + sway_sampling_coef * (torch.cos(torch.pi / 2 * t) - 1 + t)
201
+
202
+ trajectory = odeint(fn, y0, t, **self.odeint_kwargs)
203
+
204
+ sampled = trajectory[-1]
205
+ out = sampled
206
+ out = torch.where(cond_mask, cond, out)
207
+
208
+ if exists(vocoder):
209
+ out = out.permute(0, 2, 1)
210
+ out = vocoder(out)
211
+
212
+ return out, trajectory
213
+
214
+ def forward(
215
+ self,
216
+ inp: float["b n d"] | float["b nw"], # mel or raw wave # noqa: F722
217
+ text: int["b nt"] | list[str], # noqa: F722
218
+ *,
219
+ lens: int["b"] | None = None, # noqa: F821
220
+ noise_scheduler: str | None = None,
221
+ ):
222
+ # handle raw wave
223
+ if inp.ndim == 2:
224
+ inp = self.mel_spec(inp)
225
+ inp = inp.permute(0, 2, 1)
226
+ assert inp.shape[-1] == self.num_channels
227
+
228
+ batch, seq_len, dtype, device, _σ1 = *inp.shape[:2], inp.dtype, self.device, self.sigma
229
+
230
+ # handle text as string
231
+ if isinstance(text, list):
232
+ if exists(self.vocab_char_map):
233
+ text = list_str_to_idx(text, self.vocab_char_map).to(device)
234
+ else:
235
+ text = list_str_to_tensor(text).to(device)
236
+ assert text.shape[0] == batch
237
+
238
+ # lens and mask
239
+ if not exists(lens):
240
+ lens = torch.full((batch,), seq_len, device=device)
241
+
242
+ mask = lens_to_mask(lens, length=seq_len) # useless here, as collate_fn will pad to max length in batch
243
+
244
+ # get a random span to mask out for training conditionally
245
+ frac_lengths = torch.zeros((batch,), device=self.device).float().uniform_(*self.frac_lengths_mask)
246
+ rand_span_mask = mask_from_frac_lengths(lens, frac_lengths)
247
+
248
+ if exists(mask):
249
+ rand_span_mask &= mask
250
+
251
+ # mel is x1
252
+ x1 = inp
253
+
254
+ # x0 is gaussian noise
255
+ x0 = torch.randn_like(x1)
256
+
257
+ # time step
258
+ time = torch.rand((batch,), dtype=dtype, device=self.device)
259
+ # TODO. noise_scheduler
260
+
261
+ # sample xt (φ_t(x) in the paper)
262
+ t = time.unsqueeze(-1).unsqueeze(-1)
263
+ φ = (1 - t) * x0 + t * x1
264
+ flow = x1 - x0
265
+
266
+ # only predict what is within the random mask span for infilling
267
+ cond = torch.where(rand_span_mask[..., None], torch.zeros_like(x1), x1)
268
+
269
+ # transformer and cfg training with a drop rate
270
+ drop_audio_cond = random() < self.audio_drop_prob # p_drop in voicebox paper
271
+ if random() < self.cond_drop_prob: # p_uncond in voicebox paper
272
+ drop_audio_cond = True
273
+ drop_text = True
274
+ else:
275
+ drop_text = False
276
+
277
+ # if want rigourously mask out padding, record in collate_fn in dataset.py, and pass in here
278
+ # adding mask will use more memory, thus also need to adjust batchsampler with scaled down threshold for long sequences
279
+ pred = self.transformer(
280
+ x=φ, cond=cond, text=text, time=time, drop_audio_cond=drop_audio_cond, drop_text=drop_text
281
+ )
282
+
283
+ # flow matching loss
284
+ loss = F.mse_loss(pred, flow, reduction="none")
285
+ loss = loss[rand_span_mask]
286
+
287
+ return loss.mean(), cond, pred
src/f5_tts/model/dataset.py ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import random
3
+ from importlib.resources import files
4
+ from tqdm import tqdm
5
+
6
+ import torch
7
+ import torch.nn.functional as F
8
+ import torchaudio
9
+ from torch import nn
10
+ from torch.utils.data import Dataset, Sampler
11
+ from datasets import load_from_disk
12
+ from datasets import Dataset as Dataset_
13
+
14
+ from f5_tts.model.modules import MelSpec
15
+ from f5_tts.model.utils import default
16
+
17
+
18
+ class HFDataset(Dataset):
19
+ def __init__(
20
+ self,
21
+ hf_dataset: Dataset,
22
+ target_sample_rate=24_000,
23
+ n_mel_channels=100,
24
+ hop_length=256,
25
+ ):
26
+ self.data = hf_dataset
27
+ self.target_sample_rate = target_sample_rate
28
+ self.hop_length = hop_length
29
+ self.mel_spectrogram = MelSpec(
30
+ target_sample_rate=target_sample_rate, n_mel_channels=n_mel_channels, hop_length=hop_length
31
+ )
32
+
33
+ def get_frame_len(self, index):
34
+ row = self.data[index]
35
+ audio = row["audio"]["array"]
36
+ sample_rate = row["audio"]["sampling_rate"]
37
+ return audio.shape[-1] / sample_rate * self.target_sample_rate / self.hop_length
38
+
39
+ def __len__(self):
40
+ return len(self.data)
41
+
42
+ def __getitem__(self, index):
43
+ row = self.data[index]
44
+ audio = row["audio"]["array"]
45
+
46
+ # logger.info(f"Audio shape: {audio.shape}")
47
+
48
+ sample_rate = row["audio"]["sampling_rate"]
49
+ duration = audio.shape[-1] / sample_rate
50
+
51
+ if duration > 30 or duration < 0.3:
52
+ return self.__getitem__((index + 1) % len(self.data))
53
+
54
+ audio_tensor = torch.from_numpy(audio).float()
55
+
56
+ if sample_rate != self.target_sample_rate:
57
+ resampler = torchaudio.transforms.Resample(sample_rate, self.target_sample_rate)
58
+ audio_tensor = resampler(audio_tensor)
59
+
60
+ audio_tensor = audio_tensor.unsqueeze(0) # 't -> 1 t')
61
+
62
+ mel_spec = self.mel_spectrogram(audio_tensor)
63
+
64
+ mel_spec = mel_spec.squeeze(0) # '1 d t -> d t'
65
+
66
+ text = row["text"]
67
+
68
+ return dict(
69
+ mel_spec=mel_spec,
70
+ text=text,
71
+ )
72
+
73
+
74
+ class CustomDataset(Dataset):
75
+ def __init__(
76
+ self,
77
+ custom_dataset: Dataset,
78
+ durations=None,
79
+ target_sample_rate=24_000,
80
+ hop_length=256,
81
+ n_mel_channels=100,
82
+ preprocessed_mel=False,
83
+ mel_spec_module: nn.Module | None = None,
84
+ ):
85
+ self.data = custom_dataset
86
+ self.durations = durations
87
+ self.target_sample_rate = target_sample_rate
88
+ self.hop_length = hop_length
89
+ self.preprocessed_mel = preprocessed_mel
90
+
91
+ if not preprocessed_mel:
92
+ self.mel_spectrogram = default(
93
+ mel_spec_module,
94
+ MelSpec(
95
+ target_sample_rate=target_sample_rate,
96
+ hop_length=hop_length,
97
+ n_mel_channels=n_mel_channels,
98
+ ),
99
+ )
100
+
101
+ def get_frame_len(self, index):
102
+ if (
103
+ self.durations is not None
104
+ ): # Please make sure the separately provided durations are correct, otherwise 99.99% OOM
105
+ return self.durations[index] * self.target_sample_rate / self.hop_length
106
+ return self.data[index]["duration"] * self.target_sample_rate / self.hop_length
107
+
108
+ def __len__(self):
109
+ return len(self.data)
110
+
111
+ def __getitem__(self, index):
112
+ row = self.data[index]
113
+ audio_path = row["audio_path"]
114
+ text = row["text"]
115
+ duration = row["duration"]
116
+
117
+ if self.preprocessed_mel:
118
+ mel_spec = torch.tensor(row["mel_spec"])
119
+
120
+ else:
121
+ audio, source_sample_rate = torchaudio.load(audio_path)
122
+ if audio.shape[0] > 1:
123
+ audio = torch.mean(audio, dim=0, keepdim=True)
124
+
125
+ if duration > 30 or duration < 0.3:
126
+ return self.__getitem__((index + 1) % len(self.data))
127
+
128
+ if source_sample_rate != self.target_sample_rate:
129
+ resampler = torchaudio.transforms.Resample(source_sample_rate, self.target_sample_rate)
130
+ audio = resampler(audio)
131
+
132
+ mel_spec = self.mel_spectrogram(audio)
133
+ mel_spec = mel_spec.squeeze(0) # '1 d t -> d t')
134
+
135
+ return dict(
136
+ mel_spec=mel_spec,
137
+ text=text,
138
+ )
139
+
140
+
141
+ # Dynamic Batch Sampler
142
+
143
+
144
+ class DynamicBatchSampler(Sampler[list[int]]):
145
+ """Extension of Sampler that will do the following:
146
+ 1. Change the batch size (essentially number of sequences)
147
+ in a batch to ensure that the total number of frames are less
148
+ than a certain threshold.
149
+ 2. Make sure the padding efficiency in the batch is high.
150
+ """
151
+
152
+ def __init__(
153
+ self, sampler: Sampler[int], frames_threshold: int, max_samples=0, random_seed=None, drop_last: bool = False
154
+ ):
155
+ self.sampler = sampler
156
+ self.frames_threshold = frames_threshold
157
+ self.max_samples = max_samples
158
+
159
+ indices, batches = [], []
160
+ data_source = self.sampler.data_source
161
+
162
+ for idx in tqdm(
163
+ self.sampler, desc="Sorting with sampler... if slow, check whether dataset is provided with duration"
164
+ ):
165
+ indices.append((idx, data_source.get_frame_len(idx)))
166
+ indices.sort(key=lambda elem: elem[1])
167
+
168
+ batch = []
169
+ batch_frames = 0
170
+ for idx, frame_len in tqdm(
171
+ indices, desc=f"Creating dynamic batches with {frames_threshold} audio frames per gpu"
172
+ ):
173
+ if batch_frames + frame_len <= self.frames_threshold and (max_samples == 0 or len(batch) < max_samples):
174
+ batch.append(idx)
175
+ batch_frames += frame_len
176
+ else:
177
+ if len(batch) > 0:
178
+ batches.append(batch)
179
+ if frame_len <= self.frames_threshold:
180
+ batch = [idx]
181
+ batch_frames = frame_len
182
+ else:
183
+ batch = []
184
+ batch_frames = 0
185
+
186
+ if not drop_last and len(batch) > 0:
187
+ batches.append(batch)
188
+
189
+ del indices
190
+
191
+ # if want to have different batches between epochs, may just set a seed and log it in ckpt
192
+ # cuz during multi-gpu training, although the batch on per gpu not change between epochs, the formed general minibatch is different
193
+ # e.g. for epoch n, use (random_seed + n)
194
+ random.seed(random_seed)
195
+ random.shuffle(batches)
196
+
197
+ self.batches = batches
198
+
199
+ def __iter__(self):
200
+ return iter(self.batches)
201
+
202
+ def __len__(self):
203
+ return len(self.batches)
204
+
205
+
206
+ # Load dataset
207
+
208
+
209
+ def load_dataset(
210
+ dataset_name: str,
211
+ tokenizer: str = "pinyin",
212
+ dataset_type: str = "CustomDataset",
213
+ audio_type: str = "raw",
214
+ mel_spec_module: nn.Module | None = None,
215
+ mel_spec_kwargs: dict = dict(),
216
+ ) -> CustomDataset | HFDataset:
217
+ """
218
+ dataset_type - "CustomDataset" if you want to use tokenizer name and default data path to load for train_dataset
219
+ - "CustomDatasetPath" if you just want to pass the full path to a preprocessed dataset without relying on tokenizer
220
+ """
221
+
222
+ print("Loading dataset ...")
223
+
224
+ if dataset_type == "CustomDataset":
225
+ rel_data_path = str(files("f5_tts").joinpath(f"../../data/{dataset_name}_{tokenizer}"))
226
+ if audio_type == "raw":
227
+ try:
228
+ train_dataset = load_from_disk(f"{rel_data_path}/raw")
229
+ except: # noqa: E722
230
+ train_dataset = Dataset_.from_file(f"{rel_data_path}/raw.arrow")
231
+ preprocessed_mel = False
232
+ elif audio_type == "mel":
233
+ train_dataset = Dataset_.from_file(f"{rel_data_path}/mel.arrow")
234
+ preprocessed_mel = True
235
+ with open(f"{rel_data_path}/duration.json", "r", encoding="utf-8") as f:
236
+ data_dict = json.load(f)
237
+ durations = data_dict["duration"]
238
+ train_dataset = CustomDataset(
239
+ train_dataset,
240
+ durations=durations,
241
+ preprocessed_mel=preprocessed_mel,
242
+ mel_spec_module=mel_spec_module,
243
+ **mel_spec_kwargs,
244
+ )
245
+
246
+ elif dataset_type == "CustomDatasetPath":
247
+ try:
248
+ train_dataset = load_from_disk(f"{dataset_name}/raw")
249
+ except: # noqa: E722
250
+ train_dataset = Dataset_.from_file(f"{dataset_name}/raw.arrow")
251
+
252
+ with open(f"{dataset_name}/duration.json", "r", encoding="utf-8") as f:
253
+ data_dict = json.load(f)
254
+ durations = data_dict["duration"]
255
+ train_dataset = CustomDataset(
256
+ train_dataset, durations=durations, preprocessed_mel=preprocessed_mel, **mel_spec_kwargs
257
+ )
258
+
259
+ elif dataset_type == "HFDataset":
260
+ print(
261
+ "Should manually modify the path of huggingface dataset to your need.\n"
262
+ + "May also the corresponding script cuz different dataset may have different format."
263
+ )
264
+ pre, post = dataset_name.split("_")
265
+ train_dataset = HFDataset(
266
+ load_dataset(f"{pre}/{pre}", split=f"train.{post}", cache_dir=str(files("f5_tts").joinpath("../../data"))),
267
+ )
268
+
269
+ return train_dataset
270
+
271
+
272
+ # collation
273
+
274
+
275
+ def collate_fn(batch):
276
+ mel_specs = [item["mel_spec"].squeeze(0) for item in batch]
277
+ mel_lengths = torch.LongTensor([spec.shape[-1] for spec in mel_specs])
278
+ max_mel_length = mel_lengths.amax()
279
+
280
+ padded_mel_specs = []
281
+ for spec in mel_specs: # TODO. maybe records mask for attention here
282
+ padding = (0, max_mel_length - spec.size(-1))
283
+ padded_spec = F.pad(spec, padding, value=0)
284
+ padded_mel_specs.append(padded_spec)
285
+
286
+ mel_specs = torch.stack(padded_mel_specs)
287
+
288
+ text = [item["text"] for item in batch]
289
+ text_lengths = torch.LongTensor([len(item) for item in text])
290
+
291
+ return dict(
292
+ mel=mel_specs,
293
+ mel_lengths=mel_lengths,
294
+ text=text,
295
+ text_lengths=text_lengths,
296
+ )
src/f5_tts/model/modules.py ADDED
@@ -0,0 +1,581 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ein notation:
3
+ b - batch
4
+ n - sequence
5
+ nt - text sequence
6
+ nw - raw wave length
7
+ d - dimension
8
+ """
9
+
10
+ from __future__ import annotations
11
+ from typing import Optional
12
+ import math
13
+
14
+ import torch
15
+ from torch import nn
16
+ import torch.nn.functional as F
17
+ import torchaudio
18
+
19
+ from x_transformers.x_transformers import apply_rotary_pos_emb
20
+
21
+
22
+ # raw wav to mel spec
23
+
24
+
25
+ class MelSpec(nn.Module):
26
+ def __init__(
27
+ self,
28
+ filter_length=1024,
29
+ hop_length=256,
30
+ win_length=1024,
31
+ n_mel_channels=100,
32
+ target_sample_rate=24_000,
33
+ normalize=False,
34
+ power=1,
35
+ norm=None,
36
+ center=True,
37
+ ):
38
+ super().__init__()
39
+ self.n_mel_channels = n_mel_channels
40
+
41
+ self.mel_stft = torchaudio.transforms.MelSpectrogram(
42
+ sample_rate=target_sample_rate,
43
+ n_fft=filter_length,
44
+ win_length=win_length,
45
+ hop_length=hop_length,
46
+ n_mels=n_mel_channels,
47
+ power=power,
48
+ center=center,
49
+ normalized=normalize,
50
+ norm=norm,
51
+ )
52
+
53
+ self.register_buffer("dummy", torch.tensor(0), persistent=False)
54
+
55
+ def forward(self, inp):
56
+ if len(inp.shape) == 3:
57
+ inp = inp.squeeze(1) # 'b 1 nw -> b nw'
58
+
59
+ assert len(inp.shape) == 2
60
+
61
+ if self.dummy.device != inp.device:
62
+ self.to(inp.device)
63
+
64
+ mel = self.mel_stft(inp)
65
+ mel = mel.clamp(min=1e-5).log()
66
+ return mel
67
+
68
+
69
+ # sinusoidal position embedding
70
+
71
+
72
+ class SinusPositionEmbedding(nn.Module):
73
+ def __init__(self, dim):
74
+ super().__init__()
75
+ self.dim = dim
76
+
77
+ def forward(self, x, scale=1000):
78
+ device = x.device
79
+ half_dim = self.dim // 2
80
+ emb = math.log(10000) / (half_dim - 1)
81
+ emb = torch.exp(torch.arange(half_dim, device=device).float() * -emb)
82
+ emb = scale * x.unsqueeze(1) * emb.unsqueeze(0)
83
+ emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
84
+ return emb
85
+
86
+
87
+ # convolutional position embedding
88
+
89
+
90
+ class ConvPositionEmbedding(nn.Module):
91
+ def __init__(self, dim, kernel_size=31, groups=16):
92
+ super().__init__()
93
+ assert kernel_size % 2 != 0
94
+ self.conv1d = nn.Sequential(
95
+ nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
96
+ nn.Mish(),
97
+ nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
98
+ nn.Mish(),
99
+ )
100
+
101
+ def forward(self, x: float["b n d"], mask: bool["b n"] | None = None): # noqa: F722
102
+ if mask is not None:
103
+ mask = mask[..., None]
104
+ x = x.masked_fill(~mask, 0.0)
105
+
106
+ x = x.permute(0, 2, 1)
107
+ x = self.conv1d(x)
108
+ out = x.permute(0, 2, 1)
109
+
110
+ if mask is not None:
111
+ out = out.masked_fill(~mask, 0.0)
112
+
113
+ return out
114
+
115
+
116
+ # rotary positional embedding related
117
+
118
+
119
+ def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, theta_rescale_factor=1.0):
120
+ # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
121
+ # has some connection to NTK literature
122
+ # https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
123
+ # https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py
124
+ theta *= theta_rescale_factor ** (dim / (dim - 2))
125
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
126
+ t = torch.arange(end, device=freqs.device) # type: ignore
127
+ freqs = torch.outer(t, freqs).float() # type: ignore
128
+ freqs_cos = torch.cos(freqs) # real part
129
+ freqs_sin = torch.sin(freqs) # imaginary part
130
+ return torch.cat([freqs_cos, freqs_sin], dim=-1)
131
+
132
+
133
+ def get_pos_embed_indices(start, length, max_pos, scale=1.0):
134
+ # length = length if isinstance(length, int) else length.max()
135
+ scale = scale * torch.ones_like(start, dtype=torch.float32) # in case scale is a scalar
136
+ pos = (
137
+ start.unsqueeze(1)
138
+ + (torch.arange(length, device=start.device, dtype=torch.float32).unsqueeze(0) * scale.unsqueeze(1)).long()
139
+ )
140
+ # avoid extra long error.
141
+ pos = torch.where(pos < max_pos, pos, max_pos - 1)
142
+ return pos
143
+
144
+
145
+ # Global Response Normalization layer (Instance Normalization ?)
146
+
147
+
148
+ class GRN(nn.Module):
149
+ def __init__(self, dim):
150
+ super().__init__()
151
+ self.gamma = nn.Parameter(torch.zeros(1, 1, dim))
152
+ self.beta = nn.Parameter(torch.zeros(1, 1, dim))
153
+
154
+ def forward(self, x):
155
+ Gx = torch.norm(x, p=2, dim=1, keepdim=True)
156
+ Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
157
+ return self.gamma * (x * Nx) + self.beta + x
158
+
159
+
160
+ # ConvNeXt-V2 Block https://github.com/facebookresearch/ConvNeXt-V2/blob/main/models/convnextv2.py
161
+ # ref: https://github.com/bfs18/e2_tts/blob/main/rfwave/modules.py#L108
162
+
163
+
164
+ class ConvNeXtV2Block(nn.Module):
165
+ def __init__(
166
+ self,
167
+ dim: int,
168
+ intermediate_dim: int,
169
+ dilation: int = 1,
170
+ ):
171
+ super().__init__()
172
+ padding = (dilation * (7 - 1)) // 2
173
+ self.dwconv = nn.Conv1d(
174
+ dim, dim, kernel_size=7, padding=padding, groups=dim, dilation=dilation
175
+ ) # depthwise conv
176
+ self.norm = nn.LayerNorm(dim, eps=1e-6)
177
+ self.pwconv1 = nn.Linear(dim, intermediate_dim) # pointwise/1x1 convs, implemented with linear layers
178
+ self.act = nn.GELU()
179
+ self.grn = GRN(intermediate_dim)
180
+ self.pwconv2 = nn.Linear(intermediate_dim, dim)
181
+
182
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
183
+ residual = x
184
+ x = x.transpose(1, 2) # b n d -> b d n
185
+ x = self.dwconv(x)
186
+ x = x.transpose(1, 2) # b d n -> b n d
187
+ x = self.norm(x)
188
+ x = self.pwconv1(x)
189
+ x = self.act(x)
190
+ x = self.grn(x)
191
+ x = self.pwconv2(x)
192
+ return residual + x
193
+
194
+
195
+ # AdaLayerNormZero
196
+ # return with modulated x for attn input, and params for later mlp modulation
197
+
198
+
199
+ class AdaLayerNormZero(nn.Module):
200
+ def __init__(self, dim):
201
+ super().__init__()
202
+
203
+ self.silu = nn.SiLU()
204
+ self.linear = nn.Linear(dim, dim * 6)
205
+
206
+ self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
207
+
208
+ def forward(self, x, emb=None):
209
+ emb = self.linear(self.silu(emb))
210
+ shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = torch.chunk(emb, 6, dim=1)
211
+
212
+ x = self.norm(x) * (1 + scale_msa[:, None]) + shift_msa[:, None]
213
+ return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
214
+
215
+
216
+ # AdaLayerNormZero for final layer
217
+ # return only with modulated x for attn input, cuz no more mlp modulation
218
+
219
+
220
+ class AdaLayerNormZero_Final(nn.Module):
221
+ def __init__(self, dim):
222
+ super().__init__()
223
+
224
+ self.silu = nn.SiLU()
225
+ self.linear = nn.Linear(dim, dim * 2)
226
+
227
+ self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
228
+
229
+ def forward(self, x, emb):
230
+ emb = self.linear(self.silu(emb))
231
+ scale, shift = torch.chunk(emb, 2, dim=1)
232
+
233
+ x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
234
+ return x
235
+
236
+
237
+ # FeedForward
238
+
239
+
240
+ class FeedForward(nn.Module):
241
+ def __init__(self, dim, dim_out=None, mult=4, dropout=0.0, approximate: str = "none"):
242
+ super().__init__()
243
+ inner_dim = int(dim * mult)
244
+ dim_out = dim_out if dim_out is not None else dim
245
+
246
+ activation = nn.GELU(approximate=approximate)
247
+ project_in = nn.Sequential(nn.Linear(dim, inner_dim), activation)
248
+ self.ff = nn.Sequential(project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out))
249
+
250
+ def forward(self, x):
251
+ return self.ff(x)
252
+
253
+
254
+ # Attention with possible joint part
255
+ # modified from diffusers/src/diffusers/models/attention_processor.py
256
+
257
+
258
+ class Attention(nn.Module):
259
+ def __init__(
260
+ self,
261
+ processor: JointAttnProcessor | AttnProcessor,
262
+ dim: int,
263
+ heads: int = 8,
264
+ dim_head: int = 64,
265
+ dropout: float = 0.0,
266
+ context_dim: Optional[int] = None, # if not None -> joint attention
267
+ context_pre_only=None,
268
+ ):
269
+ super().__init__()
270
+
271
+ if not hasattr(F, "scaled_dot_product_attention"):
272
+ raise ImportError("Attention equires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
273
+
274
+ self.processor = processor
275
+
276
+ self.dim = dim
277
+ self.heads = heads
278
+ self.inner_dim = dim_head * heads
279
+ self.dropout = dropout
280
+
281
+ self.context_dim = context_dim
282
+ self.context_pre_only = context_pre_only
283
+
284
+ self.to_q = nn.Linear(dim, self.inner_dim)
285
+ self.to_k = nn.Linear(dim, self.inner_dim)
286
+ self.to_v = nn.Linear(dim, self.inner_dim)
287
+
288
+ if self.context_dim is not None:
289
+ self.to_k_c = nn.Linear(context_dim, self.inner_dim)
290
+ self.to_v_c = nn.Linear(context_dim, self.inner_dim)
291
+ if self.context_pre_only is not None:
292
+ self.to_q_c = nn.Linear(context_dim, self.inner_dim)
293
+
294
+ self.to_out = nn.ModuleList([])
295
+ self.to_out.append(nn.Linear(self.inner_dim, dim))
296
+ self.to_out.append(nn.Dropout(dropout))
297
+
298
+ if self.context_pre_only is not None and not self.context_pre_only:
299
+ self.to_out_c = nn.Linear(self.inner_dim, dim)
300
+
301
+ def forward(
302
+ self,
303
+ x: float["b n d"], # noised input x # noqa: F722
304
+ c: float["b n d"] = None, # context c # noqa: F722
305
+ mask: bool["b n"] | None = None, # noqa: F722
306
+ rope=None, # rotary position embedding for x
307
+ c_rope=None, # rotary position embedding for c
308
+ ) -> torch.Tensor:
309
+ if c is not None:
310
+ return self.processor(self, x, c=c, mask=mask, rope=rope, c_rope=c_rope)
311
+ else:
312
+ return self.processor(self, x, mask=mask, rope=rope)
313
+
314
+
315
+ # Attention processor
316
+
317
+
318
+ class AttnProcessor:
319
+ def __init__(self):
320
+ pass
321
+
322
+ def __call__(
323
+ self,
324
+ attn: Attention,
325
+ x: float["b n d"], # noised input x # noqa: F722
326
+ mask: bool["b n"] | None = None, # noqa: F722
327
+ rope=None, # rotary position embedding
328
+ ) -> torch.FloatTensor:
329
+ batch_size = x.shape[0]
330
+
331
+ # `sample` projections.
332
+ query = attn.to_q(x)
333
+ key = attn.to_k(x)
334
+ value = attn.to_v(x)
335
+
336
+ # apply rotary position embedding
337
+ if rope is not None:
338
+ freqs, xpos_scale = rope
339
+ q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
340
+
341
+ query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
342
+ key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
343
+
344
+ # attention
345
+ inner_dim = key.shape[-1]
346
+ head_dim = inner_dim // attn.heads
347
+ query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
348
+ key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
349
+ value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
350
+
351
+ # mask. e.g. inference got a batch with different target durations, mask out the padding
352
+ if mask is not None:
353
+ attn_mask = mask
354
+ attn_mask = attn_mask.unsqueeze(1).unsqueeze(1) # 'b n -> b 1 1 n'
355
+ attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
356
+ else:
357
+ attn_mask = None
358
+
359
+ x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
360
+ x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
361
+ x = x.to(query.dtype)
362
+
363
+ # linear proj
364
+ x = attn.to_out[0](x)
365
+ # dropout
366
+ x = attn.to_out[1](x)
367
+
368
+ if mask is not None:
369
+ mask = mask.unsqueeze(-1)
370
+ x = x.masked_fill(~mask, 0.0)
371
+
372
+ return x
373
+
374
+
375
+ # Joint Attention processor for MM-DiT
376
+ # modified from diffusers/src/diffusers/models/attention_processor.py
377
+
378
+
379
+ class JointAttnProcessor:
380
+ def __init__(self):
381
+ pass
382
+
383
+ def __call__(
384
+ self,
385
+ attn: Attention,
386
+ x: float["b n d"], # noised input x # noqa: F722
387
+ c: float["b nt d"] = None, # context c, here text # noqa: F722
388
+ mask: bool["b n"] | None = None, # noqa: F722
389
+ rope=None, # rotary position embedding for x
390
+ c_rope=None, # rotary position embedding for c
391
+ ) -> torch.FloatTensor:
392
+ residual = x
393
+
394
+ batch_size = c.shape[0]
395
+
396
+ # `sample` projections.
397
+ query = attn.to_q(x)
398
+ key = attn.to_k(x)
399
+ value = attn.to_v(x)
400
+
401
+ # `context` projections.
402
+ c_query = attn.to_q_c(c)
403
+ c_key = attn.to_k_c(c)
404
+ c_value = attn.to_v_c(c)
405
+
406
+ # apply rope for context and noised input independently
407
+ if rope is not None:
408
+ freqs, xpos_scale = rope
409
+ q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
410
+ query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
411
+ key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
412
+ if c_rope is not None:
413
+ freqs, xpos_scale = c_rope
414
+ q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
415
+ c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
416
+ c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
417
+
418
+ # attention
419
+ query = torch.cat([query, c_query], dim=1)
420
+ key = torch.cat([key, c_key], dim=1)
421
+ value = torch.cat([value, c_value], dim=1)
422
+
423
+ inner_dim = key.shape[-1]
424
+ head_dim = inner_dim // attn.heads
425
+ query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
426
+ key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
427
+ value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
428
+
429
+ # mask. e.g. inference got a batch with different target durations, mask out the padding
430
+ if mask is not None:
431
+ attn_mask = F.pad(mask, (0, c.shape[1]), value=True) # no mask for c (text)
432
+ attn_mask = attn_mask.unsqueeze(1).unsqueeze(1) # 'b n -> b 1 1 n'
433
+ attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
434
+ else:
435
+ attn_mask = None
436
+
437
+ x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
438
+ x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
439
+ x = x.to(query.dtype)
440
+
441
+ # Split the attention outputs.
442
+ x, c = (
443
+ x[:, : residual.shape[1]],
444
+ x[:, residual.shape[1] :],
445
+ )
446
+
447
+ # linear proj
448
+ x = attn.to_out[0](x)
449
+ # dropout
450
+ x = attn.to_out[1](x)
451
+ if not attn.context_pre_only:
452
+ c = attn.to_out_c(c)
453
+
454
+ if mask is not None:
455
+ mask = mask.unsqueeze(-1)
456
+ x = x.masked_fill(~mask, 0.0)
457
+ # c = c.masked_fill(~mask, 0.) # no mask for c (text)
458
+
459
+ return x, c
460
+
461
+
462
+ # DiT Block
463
+
464
+
465
+ class DiTBlock(nn.Module):
466
+ def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1):
467
+ super().__init__()
468
+
469
+ self.attn_norm = AdaLayerNormZero(dim)
470
+ self.attn = Attention(
471
+ processor=AttnProcessor(),
472
+ dim=dim,
473
+ heads=heads,
474
+ dim_head=dim_head,
475
+ dropout=dropout,
476
+ )
477
+
478
+ self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
479
+ self.ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
480
+
481
+ def forward(self, x, t, mask=None, rope=None): # x: noised input, t: time embedding
482
+ # pre-norm & modulation for attention input
483
+ norm, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.attn_norm(x, emb=t)
484
+
485
+ # attention
486
+ attn_output = self.attn(x=norm, mask=mask, rope=rope)
487
+
488
+ # process attention output for input x
489
+ x = x + gate_msa.unsqueeze(1) * attn_output
490
+
491
+ norm = self.ff_norm(x) * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
492
+ ff_output = self.ff(norm)
493
+ x = x + gate_mlp.unsqueeze(1) * ff_output
494
+
495
+ return x
496
+
497
+
498
+ # MMDiT Block https://arxiv.org/abs/2403.03206
499
+
500
+
501
+ class MMDiTBlock(nn.Module):
502
+ r"""
503
+ modified from diffusers/src/diffusers/models/attention.py
504
+
505
+ notes.
506
+ _c: context related. text, cond, etc. (left part in sd3 fig2.b)
507
+ _x: noised input related. (right part)
508
+ context_pre_only: last layer only do prenorm + modulation cuz no more ffn
509
+ """
510
+
511
+ def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_pre_only=False):
512
+ super().__init__()
513
+
514
+ self.context_pre_only = context_pre_only
515
+
516
+ self.attn_norm_c = AdaLayerNormZero_Final(dim) if context_pre_only else AdaLayerNormZero(dim)
517
+ self.attn_norm_x = AdaLayerNormZero(dim)
518
+ self.attn = Attention(
519
+ processor=JointAttnProcessor(),
520
+ dim=dim,
521
+ heads=heads,
522
+ dim_head=dim_head,
523
+ dropout=dropout,
524
+ context_dim=dim,
525
+ context_pre_only=context_pre_only,
526
+ )
527
+
528
+ if not context_pre_only:
529
+ self.ff_norm_c = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
530
+ self.ff_c = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
531
+ else:
532
+ self.ff_norm_c = None
533
+ self.ff_c = None
534
+ self.ff_norm_x = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
535
+ self.ff_x = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
536
+
537
+ def forward(self, x, c, t, mask=None, rope=None, c_rope=None): # x: noised input, c: context, t: time embedding
538
+ # pre-norm & modulation for attention input
539
+ if self.context_pre_only:
540
+ norm_c = self.attn_norm_c(c, t)
541
+ else:
542
+ norm_c, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.attn_norm_c(c, emb=t)
543
+ norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp = self.attn_norm_x(x, emb=t)
544
+
545
+ # attention
546
+ x_attn_output, c_attn_output = self.attn(x=norm_x, c=norm_c, mask=mask, rope=rope, c_rope=c_rope)
547
+
548
+ # process attention output for context c
549
+ if self.context_pre_only:
550
+ c = None
551
+ else: # if not last layer
552
+ c = c + c_gate_msa.unsqueeze(1) * c_attn_output
553
+
554
+ norm_c = self.ff_norm_c(c) * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
555
+ c_ff_output = self.ff_c(norm_c)
556
+ c = c + c_gate_mlp.unsqueeze(1) * c_ff_output
557
+
558
+ # process attention output for input x
559
+ x = x + x_gate_msa.unsqueeze(1) * x_attn_output
560
+
561
+ norm_x = self.ff_norm_x(x) * (1 + x_scale_mlp[:, None]) + x_shift_mlp[:, None]
562
+ x_ff_output = self.ff_x(norm_x)
563
+ x = x + x_gate_mlp.unsqueeze(1) * x_ff_output
564
+
565
+ return c, x
566
+
567
+
568
+ # time step conditioning embedding
569
+
570
+
571
+ class TimestepEmbedding(nn.Module):
572
+ def __init__(self, dim, freq_embed_dim=256):
573
+ super().__init__()
574
+ self.time_embed = SinusPositionEmbedding(freq_embed_dim)
575
+ self.time_mlp = nn.Sequential(nn.Linear(freq_embed_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
576
+
577
+ def forward(self, timestep: float["b"]): # noqa: F821
578
+ time_hidden = self.time_embed(timestep)
579
+ time_hidden = time_hidden.to(timestep.dtype)
580
+ time = self.time_mlp(time_hidden) # b d
581
+ return time
src/f5_tts/model/trainer.py ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import gc
5
+ from tqdm import tqdm
6
+ import wandb
7
+
8
+ import torch
9
+ from torch.optim import AdamW
10
+ from torch.utils.data import DataLoader, Dataset, SequentialSampler
11
+ from torch.optim.lr_scheduler import LinearLR, SequentialLR
12
+
13
+ from accelerate import Accelerator
14
+ from accelerate.utils import DistributedDataParallelKwargs
15
+
16
+ from ema_pytorch import EMA
17
+
18
+ from f5_tts.model import CFM
19
+ from f5_tts.model.utils import exists, default
20
+ from f5_tts.model.dataset import DynamicBatchSampler, collate_fn
21
+
22
+
23
+ # trainer
24
+
25
+
26
+ class Trainer:
27
+ def __init__(
28
+ self,
29
+ model: CFM,
30
+ epochs,
31
+ learning_rate,
32
+ num_warmup_updates=20000,
33
+ save_per_updates=1000,
34
+ checkpoint_path=None,
35
+ batch_size=32,
36
+ batch_size_type: str = "sample",
37
+ max_samples=32,
38
+ grad_accumulation_steps=1,
39
+ max_grad_norm=1.0,
40
+ noise_scheduler: str | None = None,
41
+ duration_predictor: torch.nn.Module | None = None,
42
+ wandb_project="test_e2-tts",
43
+ wandb_run_name="test_run",
44
+ wandb_resume_id: str = None,
45
+ last_per_steps=None,
46
+ accelerate_kwargs: dict = dict(),
47
+ ema_kwargs: dict = dict(),
48
+ bnb_optimizer: bool = False,
49
+ ):
50
+ ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
51
+
52
+ logger = "wandb" if wandb.api.api_key else None
53
+ print(f"Using logger: {logger}")
54
+
55
+ self.accelerator = Accelerator(
56
+ log_with=logger,
57
+ kwargs_handlers=[ddp_kwargs],
58
+ gradient_accumulation_steps=grad_accumulation_steps,
59
+ **accelerate_kwargs,
60
+ )
61
+
62
+ if logger == "wandb":
63
+ if exists(wandb_resume_id):
64
+ init_kwargs = {"wandb": {"resume": "allow", "name": wandb_run_name, "id": wandb_resume_id}}
65
+ else:
66
+ init_kwargs = {"wandb": {"resume": "allow", "name": wandb_run_name}}
67
+ self.accelerator.init_trackers(
68
+ project_name=wandb_project,
69
+ init_kwargs=init_kwargs,
70
+ config={
71
+ "epochs": epochs,
72
+ "learning_rate": learning_rate,
73
+ "num_warmup_updates": num_warmup_updates,
74
+ "batch_size": batch_size,
75
+ "batch_size_type": batch_size_type,
76
+ "max_samples": max_samples,
77
+ "grad_accumulation_steps": grad_accumulation_steps,
78
+ "max_grad_norm": max_grad_norm,
79
+ "gpus": self.accelerator.num_processes,
80
+ "noise_scheduler": noise_scheduler,
81
+ },
82
+ )
83
+
84
+ self.model = model
85
+
86
+ if self.is_main:
87
+ self.ema_model = EMA(model, include_online_model=False, **ema_kwargs)
88
+
89
+ self.ema_model.to(self.accelerator.device)
90
+
91
+ self.epochs = epochs
92
+ self.num_warmup_updates = num_warmup_updates
93
+ self.save_per_updates = save_per_updates
94
+ self.last_per_steps = default(last_per_steps, save_per_updates * grad_accumulation_steps)
95
+ self.checkpoint_path = default(checkpoint_path, "ckpts/test_e2-tts")
96
+
97
+ self.batch_size = batch_size
98
+ self.batch_size_type = batch_size_type
99
+ self.max_samples = max_samples
100
+ self.grad_accumulation_steps = grad_accumulation_steps
101
+ self.max_grad_norm = max_grad_norm
102
+
103
+ self.noise_scheduler = noise_scheduler
104
+
105
+ self.duration_predictor = duration_predictor
106
+
107
+ if bnb_optimizer:
108
+ import bitsandbytes as bnb
109
+
110
+ self.optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=learning_rate)
111
+ else:
112
+ self.optimizer = AdamW(model.parameters(), lr=learning_rate)
113
+ self.model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
114
+
115
+ @property
116
+ def is_main(self):
117
+ return self.accelerator.is_main_process
118
+
119
+ def save_checkpoint(self, step, last=False):
120
+ self.accelerator.wait_for_everyone()
121
+ if self.is_main:
122
+ checkpoint = dict(
123
+ model_state_dict=self.accelerator.unwrap_model(self.model).state_dict(),
124
+ optimizer_state_dict=self.accelerator.unwrap_model(self.optimizer).state_dict(),
125
+ ema_model_state_dict=self.ema_model.state_dict(),
126
+ scheduler_state_dict=self.scheduler.state_dict(),
127
+ step=step,
128
+ )
129
+ if not os.path.exists(self.checkpoint_path):
130
+ os.makedirs(self.checkpoint_path)
131
+ if last:
132
+ self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_last.pt")
133
+ print(f"Saved last checkpoint at step {step}")
134
+ else:
135
+ self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_{step}.pt")
136
+
137
+ def load_checkpoint(self):
138
+ if (
139
+ not exists(self.checkpoint_path)
140
+ or not os.path.exists(self.checkpoint_path)
141
+ or not os.listdir(self.checkpoint_path)
142
+ ):
143
+ return 0
144
+
145
+ self.accelerator.wait_for_everyone()
146
+ if "model_last.pt" in os.listdir(self.checkpoint_path):
147
+ latest_checkpoint = "model_last.pt"
148
+ else:
149
+ latest_checkpoint = sorted(
150
+ [f for f in os.listdir(self.checkpoint_path) if f.endswith(".pt")],
151
+ key=lambda x: int("".join(filter(str.isdigit, x))),
152
+ )[-1]
153
+ # checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", map_location=self.accelerator.device) # rather use accelerator.load_state ಥ_ಥ
154
+ checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", weights_only=True, map_location="cpu")
155
+
156
+ if self.is_main:
157
+ self.ema_model.load_state_dict(checkpoint["ema_model_state_dict"])
158
+
159
+ if "step" in checkpoint:
160
+ self.accelerator.unwrap_model(self.model).load_state_dict(checkpoint["model_state_dict"])
161
+ self.accelerator.unwrap_model(self.optimizer).load_state_dict(checkpoint["optimizer_state_dict"])
162
+ if self.scheduler:
163
+ self.scheduler.load_state_dict(checkpoint["scheduler_state_dict"])
164
+ step = checkpoint["step"]
165
+ else:
166
+ checkpoint["model_state_dict"] = {
167
+ k.replace("ema_model.", ""): v
168
+ for k, v in checkpoint["ema_model_state_dict"].items()
169
+ if k not in ["initted", "step"]
170
+ }
171
+ self.accelerator.unwrap_model(self.model).load_state_dict(checkpoint["model_state_dict"])
172
+ step = 0
173
+
174
+ del checkpoint
175
+ gc.collect()
176
+ return step
177
+
178
+ def train(self, train_dataset: Dataset, num_workers=16, resumable_with_seed: int = None):
179
+ if exists(resumable_with_seed):
180
+ generator = torch.Generator()
181
+ generator.manual_seed(resumable_with_seed)
182
+ else:
183
+ generator = None
184
+
185
+ if self.batch_size_type == "sample":
186
+ train_dataloader = DataLoader(
187
+ train_dataset,
188
+ collate_fn=collate_fn,
189
+ num_workers=num_workers,
190
+ pin_memory=True,
191
+ persistent_workers=True,
192
+ batch_size=self.batch_size,
193
+ shuffle=True,
194
+ generator=generator,
195
+ )
196
+ elif self.batch_size_type == "frame":
197
+ self.accelerator.even_batches = False
198
+ sampler = SequentialSampler(train_dataset)
199
+ batch_sampler = DynamicBatchSampler(
200
+ sampler, self.batch_size, max_samples=self.max_samples, random_seed=resumable_with_seed, drop_last=False
201
+ )
202
+ train_dataloader = DataLoader(
203
+ train_dataset,
204
+ collate_fn=collate_fn,
205
+ num_workers=num_workers,
206
+ pin_memory=True,
207
+ persistent_workers=True,
208
+ batch_sampler=batch_sampler,
209
+ )
210
+ else:
211
+ raise ValueError(f"batch_size_type must be either 'sample' or 'frame', but received {self.batch_size_type}")
212
+
213
+ # accelerator.prepare() dispatches batches to devices;
214
+ # which means the length of dataloader calculated before, should consider the number of devices
215
+ warmup_steps = (
216
+ self.num_warmup_updates * self.accelerator.num_processes
217
+ ) # consider a fixed warmup steps while using accelerate multi-gpu ddp
218
+ # otherwise by default with split_batches=False, warmup steps change with num_processes
219
+ total_steps = len(train_dataloader) * self.epochs / self.grad_accumulation_steps
220
+ decay_steps = total_steps - warmup_steps
221
+ warmup_scheduler = LinearLR(self.optimizer, start_factor=1e-8, end_factor=1.0, total_iters=warmup_steps)
222
+ decay_scheduler = LinearLR(self.optimizer, start_factor=1.0, end_factor=1e-8, total_iters=decay_steps)
223
+ self.scheduler = SequentialLR(
224
+ self.optimizer, schedulers=[warmup_scheduler, decay_scheduler], milestones=[warmup_steps]
225
+ )
226
+ train_dataloader, self.scheduler = self.accelerator.prepare(
227
+ train_dataloader, self.scheduler
228
+ ) # actual steps = 1 gpu steps / gpus
229
+ start_step = self.load_checkpoint()
230
+ global_step = start_step
231
+
232
+ if exists(resumable_with_seed):
233
+ orig_epoch_step = len(train_dataloader)
234
+ skipped_epoch = int(start_step // orig_epoch_step)
235
+ skipped_batch = start_step % orig_epoch_step
236
+ skipped_dataloader = self.accelerator.skip_first_batches(train_dataloader, num_batches=skipped_batch)
237
+ else:
238
+ skipped_epoch = 0
239
+
240
+ for epoch in range(skipped_epoch, self.epochs):
241
+ self.model.train()
242
+ if exists(resumable_with_seed) and epoch == skipped_epoch:
243
+ progress_bar = tqdm(
244
+ skipped_dataloader,
245
+ desc=f"Epoch {epoch+1}/{self.epochs}",
246
+ unit="step",
247
+ disable=not self.accelerator.is_local_main_process,
248
+ initial=skipped_batch,
249
+ total=orig_epoch_step,
250
+ )
251
+ else:
252
+ progress_bar = tqdm(
253
+ train_dataloader,
254
+ desc=f"Epoch {epoch+1}/{self.epochs}",
255
+ unit="step",
256
+ disable=not self.accelerator.is_local_main_process,
257
+ )
258
+
259
+ for batch in progress_bar:
260
+ with self.accelerator.accumulate(self.model):
261
+ text_inputs = batch["text"]
262
+ mel_spec = batch["mel"].permute(0, 2, 1)
263
+ mel_lengths = batch["mel_lengths"]
264
+
265
+ # TODO. add duration predictor training
266
+ if self.duration_predictor is not None and self.accelerator.is_local_main_process:
267
+ dur_loss = self.duration_predictor(mel_spec, lens=batch.get("durations"))
268
+ self.accelerator.log({"duration loss": dur_loss.item()}, step=global_step)
269
+
270
+ loss, cond, pred = self.model(
271
+ mel_spec, text=text_inputs, lens=mel_lengths, noise_scheduler=self.noise_scheduler
272
+ )
273
+ self.accelerator.backward(loss)
274
+
275
+ if self.max_grad_norm > 0 and self.accelerator.sync_gradients:
276
+ self.accelerator.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
277
+
278
+ self.optimizer.step()
279
+ self.scheduler.step()
280
+ self.optimizer.zero_grad()
281
+
282
+ if self.is_main:
283
+ self.ema_model.update()
284
+
285
+ global_step += 1
286
+
287
+ if self.accelerator.is_local_main_process:
288
+ self.accelerator.log({"loss": loss.item(), "lr": self.scheduler.get_last_lr()[0]}, step=global_step)
289
+
290
+ progress_bar.set_postfix(step=str(global_step), loss=loss.item())
291
+
292
+ if global_step % (self.save_per_updates * self.grad_accumulation_steps) == 0:
293
+ self.save_checkpoint(global_step)
294
+
295
+ if global_step % self.last_per_steps == 0:
296
+ self.save_checkpoint(global_step, last=True)
297
+
298
+ self.save_checkpoint(global_step, last=True)
299
+
300
+ self.accelerator.end_training()
src/f5_tts/model/utils.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import random
5
+ from collections import defaultdict
6
+ from importlib.resources import files
7
+
8
+ import torch
9
+ from torch.nn.utils.rnn import pad_sequence
10
+
11
+ import jieba
12
+ from pypinyin import lazy_pinyin, Style
13
+
14
+
15
+ # seed everything
16
+
17
+
18
+ def seed_everything(seed=0):
19
+ random.seed(seed)
20
+ os.environ["PYTHONHASHSEED"] = str(seed)
21
+ torch.manual_seed(seed)
22
+ torch.cuda.manual_seed(seed)
23
+ torch.cuda.manual_seed_all(seed)
24
+ torch.backends.cudnn.deterministic = True
25
+ torch.backends.cudnn.benchmark = False
26
+
27
+
28
+ # helpers
29
+
30
+
31
+ def exists(v):
32
+ return v is not None
33
+
34
+
35
+ def default(v, d):
36
+ return v if exists(v) else d
37
+
38
+
39
+ # tensor helpers
40
+
41
+
42
+ def lens_to_mask(t: int["b"], length: int | None = None) -> bool["b n"]: # noqa: F722 F821
43
+ if not exists(length):
44
+ length = t.amax()
45
+
46
+ seq = torch.arange(length, device=t.device)
47
+ return seq[None, :] < t[:, None]
48
+
49
+
50
+ def mask_from_start_end_indices(seq_len: int["b"], start: int["b"], end: int["b"]): # noqa: F722 F821
51
+ max_seq_len = seq_len.max().item()
52
+ seq = torch.arange(max_seq_len, device=start.device).long()
53
+ start_mask = seq[None, :] >= start[:, None]
54
+ end_mask = seq[None, :] < end[:, None]
55
+ return start_mask & end_mask
56
+
57
+
58
+ def mask_from_frac_lengths(seq_len: int["b"], frac_lengths: float["b"]): # noqa: F722 F821
59
+ lengths = (frac_lengths * seq_len).long()
60
+ max_start = seq_len - lengths
61
+
62
+ rand = torch.rand_like(frac_lengths)
63
+ start = (max_start * rand).long().clamp(min=0)
64
+ end = start + lengths
65
+
66
+ return mask_from_start_end_indices(seq_len, start, end)
67
+
68
+
69
+ def maybe_masked_mean(t: float["b n d"], mask: bool["b n"] = None) -> float["b d"]: # noqa: F722
70
+ if not exists(mask):
71
+ return t.mean(dim=1)
72
+
73
+ t = torch.where(mask[:, :, None], t, torch.tensor(0.0, device=t.device))
74
+ num = t.sum(dim=1)
75
+ den = mask.float().sum(dim=1)
76
+
77
+ return num / den.clamp(min=1.0)
78
+
79
+
80
+ # simple utf-8 tokenizer, since paper went character based
81
+ def list_str_to_tensor(text: list[str], padding_value=-1) -> int["b nt"]: # noqa: F722
82
+ list_tensors = [torch.tensor([*bytes(t, "UTF-8")]) for t in text] # ByT5 style
83
+ text = pad_sequence(list_tensors, padding_value=padding_value, batch_first=True)
84
+ return text
85
+
86
+
87
+ # char tokenizer, based on custom dataset's extracted .txt file
88
+ def list_str_to_idx(
89
+ text: list[str] | list[list[str]],
90
+ vocab_char_map: dict[str, int], # {char: idx}
91
+ padding_value=-1,
92
+ ) -> int["b nt"]: # noqa: F722
93
+ list_idx_tensors = [torch.tensor([vocab_char_map.get(c, 0) for c in t]) for t in text] # pinyin or char style
94
+ text = pad_sequence(list_idx_tensors, padding_value=padding_value, batch_first=True)
95
+ return text
96
+
97
+
98
+ # Get tokenizer
99
+
100
+
101
+ def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
102
+ """
103
+ tokenizer - "pinyin" do g2p for only chinese characters, need .txt vocab_file
104
+ - "char" for char-wise tokenizer, need .txt vocab_file
105
+ - "byte" for utf-8 tokenizer
106
+ - "custom" if you're directly passing in a path to the vocab.txt you want to use
107
+ vocab_size - if use "pinyin", all available pinyin types, common alphabets (also those with accent) and symbols
108
+ - if use "char", derived from unfiltered character & symbol counts of custom dataset
109
+ - if use "byte", set to 256 (unicode byte range)
110
+ """
111
+ if tokenizer in ["pinyin", "char"]:
112
+ tokenizer_path = os.path.join(files("f5_tts").joinpath("../../data"), f"{dataset_name}_{tokenizer}/vocab.txt")
113
+ with open(tokenizer_path, "r", encoding="utf-8") as f:
114
+ vocab_char_map = {}
115
+ for i, char in enumerate(f):
116
+ vocab_char_map[char[:-1]] = i
117
+ vocab_size = len(vocab_char_map)
118
+ assert vocab_char_map[" "] == 0, "make sure space is of idx 0 in vocab.txt, cuz 0 is used for unknown char"
119
+
120
+ elif tokenizer == "byte":
121
+ vocab_char_map = None
122
+ vocab_size = 256
123
+
124
+ elif tokenizer == "custom":
125
+ with open(dataset_name, "r", encoding="utf-8") as f:
126
+ vocab_char_map = {}
127
+ for i, char in enumerate(f):
128
+ vocab_char_map[char[:-1]] = i
129
+ vocab_size = len(vocab_char_map)
130
+
131
+ return vocab_char_map, vocab_size
132
+
133
+
134
+ # convert char to pinyin
135
+
136
+
137
+ def convert_char_to_pinyin(text_list, polyphone=True):
138
+ final_text_list = []
139
+ god_knows_why_en_testset_contains_zh_quote = str.maketrans(
140
+ {"“": '"', "”": '"', "‘": "'", "’": "'"}
141
+ ) # in case librispeech (orig no-pc) test-clean
142
+ custom_trans = str.maketrans({";": ","}) # add custom trans here, to address oov
143
+ for text in text_list:
144
+ char_list = []
145
+ text = text.translate(god_knows_why_en_testset_contains_zh_quote)
146
+ text = text.translate(custom_trans)
147
+ for seg in jieba.cut(text):
148
+ seg_byte_len = len(bytes(seg, "UTF-8"))
149
+ if seg_byte_len == len(seg): # if pure alphabets and symbols
150
+ if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
151
+ char_list.append(" ")
152
+ char_list.extend(seg)
153
+ elif polyphone and seg_byte_len == 3 * len(seg): # if pure chinese characters
154
+ seg = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
155
+ for c in seg:
156
+ if c not in "。,、;:?!《》【】—…":
157
+ char_list.append(" ")
158
+ char_list.append(c)
159
+ else: # if mixed chinese characters, alphabets and symbols
160
+ for c in seg:
161
+ if ord(c) < 256:
162
+ char_list.extend(c)
163
+ else:
164
+ if c not in "。,、;:?!《》【】—…":
165
+ char_list.append(" ")
166
+ char_list.extend(lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True))
167
+ else: # if is zh punc
168
+ char_list.append(c)
169
+ final_text_list.append(char_list)
170
+
171
+ return final_text_list
172
+
173
+
174
+ # filter func for dirty data with many repetitions
175
+
176
+
177
+ def repetition_found(text, length=2, tolerance=10):
178
+ pattern_count = defaultdict(int)
179
+ for i in range(len(text) - length + 1):
180
+ pattern = text[i : i + length]
181
+ pattern_count[pattern] += 1
182
+ for pattern, count in pattern_count.items():
183
+ if count > tolerance:
184
+ return True
185
+ return False
src/f5_tts/scripts/count_max_epoch.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ADAPTIVE BATCH SIZE"""
2
+
3
+ print("Adaptive batch size: using grouping batch sampler, frames_per_gpu fixed fed in")
4
+ print(" -> least padding, gather wavs with accumulated frames in a batch\n")
5
+
6
+ # data
7
+ total_hours = 95282
8
+ mel_hop_length = 256
9
+ mel_sampling_rate = 24000
10
+
11
+ # target
12
+ wanted_max_updates = 1000000
13
+
14
+ # train params
15
+ gpus = 8
16
+ frames_per_gpu = 38400 # 8 * 38400 = 307200
17
+ grad_accum = 1
18
+
19
+ # intermediate
20
+ mini_batch_frames = frames_per_gpu * grad_accum * gpus
21
+ mini_batch_hours = mini_batch_frames * mel_hop_length / mel_sampling_rate / 3600
22
+ updates_per_epoch = total_hours / mini_batch_hours
23
+ steps_per_epoch = updates_per_epoch * grad_accum
24
+
25
+ # result
26
+ epochs = wanted_max_updates / updates_per_epoch
27
+ print(f"epochs should be set to: {epochs:.0f} ({epochs/grad_accum:.1f} x gd_acum {grad_accum})")
28
+ print(f"progress_bar should show approx. 0/{updates_per_epoch:.0f} updates")
29
+ print(f" or approx. 0/{steps_per_epoch:.0f} steps")
30
+
31
+ # others
32
+ print(f"total {total_hours:.0f} hours")
33
+ print(f"mini-batch of {mini_batch_frames:.0f} frames, {mini_batch_hours:.2f} hours per mini-batch")
src/f5_tts/scripts/count_params_gflops.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+
4
+ sys.path.append(os.getcwd())
5
+
6
+ from f5_tts.model import CFM, DiT
7
+
8
+ import torch
9
+ import thop
10
+
11
+
12
+ """ ~155M """
13
+ # transformer = UNetT(dim = 768, depth = 20, heads = 12, ff_mult = 4)
14
+ # transformer = UNetT(dim = 768, depth = 20, heads = 12, ff_mult = 4, text_dim = 512, conv_layers = 4)
15
+ # transformer = DiT(dim = 768, depth = 18, heads = 12, ff_mult = 2)
16
+ # transformer = DiT(dim = 768, depth = 18, heads = 12, ff_mult = 2, text_dim = 512, conv_layers = 4)
17
+ # transformer = DiT(dim = 768, depth = 18, heads = 12, ff_mult = 2, text_dim = 512, conv_layers = 4, long_skip_connection = True)
18
+ # transformer = MMDiT(dim = 512, depth = 16, heads = 16, ff_mult = 2)
19
+
20
+ """ ~335M """
21
+ # FLOPs: 622.1 G, Params: 333.2 M
22
+ # transformer = UNetT(dim = 1024, depth = 24, heads = 16, ff_mult = 4)
23
+ # FLOPs: 363.4 G, Params: 335.8 M
24
+ transformer = DiT(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
25
+
26
+
27
+ model = CFM(transformer=transformer)
28
+ target_sample_rate = 24000
29
+ n_mel_channels = 100
30
+ hop_length = 256
31
+ duration = 20
32
+ frame_length = int(duration * target_sample_rate / hop_length)
33
+ text_length = 150
34
+
35
+ flops, params = thop.profile(
36
+ model, inputs=(torch.randn(1, frame_length, n_mel_channels), torch.zeros(1, text_length, dtype=torch.long))
37
+ )
38
+ print(f"FLOPs: {flops / 1e9} G")
39
+ print(f"Params: {params / 1e6} M")
src/f5_tts/train/README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training
2
+
3
+ ## Prepare Dataset
4
+
5
+ Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `src/f5_tts/model/dataset.py`.
6
+
7
+ ### 1. Datasets used for pretrained models
8
+ Download corresponding dataset first, and fill in the path in scripts.
9
+
10
+ ```bash
11
+ # Prepare the Emilia dataset
12
+ python src/f5_tts/train/datasets/prepare_emilia.py
13
+
14
+ # Prepare the Wenetspeech4TTS dataset
15
+ python src/f5_tts/train/datasets/prepare_wenetspeech4tts.py
16
+ ```
17
+
18
+ ### 2. Create custom dataset with metadata.csv
19
+ Use guidance see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029).
20
+
21
+ ```bash
22
+ python src/f5_tts/train/datasets/prepare_csv_wavs.py
23
+ ```
24
+
25
+ ## Training & Finetuning
26
+
27
+ Once your datasets are prepared, you can start the training process.
28
+
29
+ ### 1. Training script used for pretrained model
30
+
31
+ ```bash
32
+ # setup accelerate config, e.g. use multi-gpu ddp, fp16
33
+ # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
34
+ accelerate config
35
+ accelerate launch src/f5_tts/train/train.py
36
+ ```
37
+
38
+ ### 2. Finetuning practice
39
+ Discussion board for Finetuning [#57](https://github.com/SWivid/F5-TTS/discussions/57).
40
+
41
+ Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
42
+
43
+ ### 3. Wandb Logging
44
+
45
+ The `wandb/` dir will be created under path you run training/finetuning scripts.
46
+
47
+ By default, the training script does NOT use logging (assuming you didn't manually log in using `wandb login`).
48
+
49
+ To turn on wandb logging, you can either:
50
+
51
+ 1. Manually login with `wandb login`: Learn more [here](https://docs.wandb.ai/ref/cli/wandb-login)
52
+ 2. Automatically login programmatically by setting an environment variable: Get an API KEY at https://wandb.ai/site/ and set the environment variable as follows:
53
+
54
+ On Mac & Linux:
55
+
56
+ ```
57
+ export WANDB_API_KEY=<YOUR WANDB API KEY>
58
+ ```
59
+
60
+ On Windows:
61
+
62
+ ```
63
+ set WANDB_API_KEY=<YOUR WANDB API KEY>
64
+ ```
65
+ Moreover, if you couldn't access Wandb and want to log metrics offline, you can the environment variable as follows:
66
+
67
+ ```
68
+ export WANDB_MODE=offline
69
+ ```
src/f5_tts/train/datasets/prepare_csv_wavs.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+
4
+ sys.path.append(os.getcwd())
5
+
6
+ import argparse
7
+ import csv
8
+ import json
9
+ import shutil
10
+ from importlib.resources import files
11
+ from pathlib import Path
12
+
13
+ import torchaudio
14
+ from tqdm import tqdm
15
+ from datasets.arrow_writer import ArrowWriter
16
+
17
+ from f5_tts.model.utils import (
18
+ convert_char_to_pinyin,
19
+ )
20
+
21
+
22
+ PRETRAINED_VOCAB_PATH = files("f5_tts").joinpath("../../data/Emilia_ZH_EN_pinyin/vocab.txt")
23
+
24
+
25
+ def is_csv_wavs_format(input_dataset_dir):
26
+ fpath = Path(input_dataset_dir)
27
+ metadata = fpath / "metadata.csv"
28
+ wavs = fpath / "wavs"
29
+ return metadata.exists() and metadata.is_file() and wavs.exists() and wavs.is_dir()
30
+
31
+
32
+ def prepare_csv_wavs_dir(input_dir):
33
+ assert is_csv_wavs_format(input_dir), f"not csv_wavs format: {input_dir}"
34
+ input_dir = Path(input_dir)
35
+ metadata_path = input_dir / "metadata.csv"
36
+ audio_path_text_pairs = read_audio_text_pairs(metadata_path.as_posix())
37
+
38
+ sub_result, durations = [], []
39
+ vocab_set = set()
40
+ polyphone = True
41
+ for audio_path, text in audio_path_text_pairs:
42
+ if not Path(audio_path).exists():
43
+ print(f"audio {audio_path} not found, skipping")
44
+ continue
45
+ audio_duration = get_audio_duration(audio_path)
46
+ # assume tokenizer = "pinyin" ("pinyin" | "char")
47
+ text = convert_char_to_pinyin([text], polyphone=polyphone)[0]
48
+ sub_result.append({"audio_path": audio_path, "text": text, "duration": audio_duration})
49
+ durations.append(audio_duration)
50
+ vocab_set.update(list(text))
51
+
52
+ return sub_result, durations, vocab_set
53
+
54
+
55
+ def get_audio_duration(audio_path):
56
+ audio, sample_rate = torchaudio.load(audio_path)
57
+ num_channels = audio.shape[0]
58
+ return audio.shape[1] / (sample_rate * num_channels)
59
+
60
+
61
+ def read_audio_text_pairs(csv_file_path):
62
+ audio_text_pairs = []
63
+
64
+ parent = Path(csv_file_path).parent
65
+ with open(csv_file_path, mode="r", newline="", encoding="utf-8-sig") as csvfile:
66
+ reader = csv.reader(csvfile, delimiter="|")
67
+ next(reader) # Skip the header row
68
+ for row in reader:
69
+ if len(row) >= 2:
70
+ audio_file = row[0].strip() # First column: audio file path
71
+ text = row[1].strip() # Second column: text
72
+ audio_file_path = parent / audio_file
73
+ audio_text_pairs.append((audio_file_path.as_posix(), text))
74
+
75
+ return audio_text_pairs
76
+
77
+
78
+ def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_finetune):
79
+ out_dir = Path(out_dir)
80
+ # save preprocessed dataset to disk
81
+ out_dir.mkdir(exist_ok=True, parents=True)
82
+ print(f"\nSaving to {out_dir} ...")
83
+
84
+ # dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list}) # oom
85
+ # dataset.save_to_disk(f"{out_dir}/raw", max_shard_size="2GB")
86
+ raw_arrow_path = out_dir / "raw.arrow"
87
+ with ArrowWriter(path=raw_arrow_path.as_posix(), writer_batch_size=1) as writer:
88
+ for line in tqdm(result, desc="Writing to raw.arrow ..."):
89
+ writer.write(line)
90
+
91
+ # dup a json separately saving duration in case for DynamicBatchSampler ease
92
+ dur_json_path = out_dir / "duration.json"
93
+ with open(dur_json_path.as_posix(), "w", encoding="utf-8") as f:
94
+ json.dump({"duration": duration_list}, f, ensure_ascii=False)
95
+
96
+ # vocab map, i.e. tokenizer
97
+ # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
98
+ # if tokenizer == "pinyin":
99
+ # text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
100
+ voca_out_path = out_dir / "vocab.txt"
101
+ with open(voca_out_path.as_posix(), "w") as f:
102
+ for vocab in sorted(text_vocab_set):
103
+ f.write(vocab + "\n")
104
+
105
+ if is_finetune:
106
+ file_vocab_finetune = PRETRAINED_VOCAB_PATH.as_posix()
107
+ shutil.copy2(file_vocab_finetune, voca_out_path)
108
+ else:
109
+ with open(voca_out_path, "w") as f:
110
+ for vocab in sorted(text_vocab_set):
111
+ f.write(vocab + "\n")
112
+
113
+ dataset_name = out_dir.stem
114
+ print(f"\nFor {dataset_name}, sample count: {len(result)}")
115
+ print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
116
+ print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")
117
+
118
+
119
+ def prepare_and_save_set(inp_dir, out_dir, is_finetune: bool = True):
120
+ if is_finetune:
121
+ assert PRETRAINED_VOCAB_PATH.exists(), f"pretrained vocab.txt not found: {PRETRAINED_VOCAB_PATH}"
122
+ sub_result, durations, vocab_set = prepare_csv_wavs_dir(inp_dir)
123
+ save_prepped_dataset(out_dir, sub_result, durations, vocab_set, is_finetune)
124
+
125
+
126
+ def cli():
127
+ # finetune: python scripts/prepare_csv_wavs.py /path/to/input_dir /path/to/output_dir_pinyin
128
+ # pretrain: python scripts/prepare_csv_wavs.py /path/to/output_dir_pinyin --pretrain
129
+ parser = argparse.ArgumentParser(description="Prepare and save dataset.")
130
+ parser.add_argument("inp_dir", type=str, help="Input directory containing the data.")
131
+ parser.add_argument("out_dir", type=str, help="Output directory to save the prepared data.")
132
+ parser.add_argument("--pretrain", action="store_true", help="Enable for new pretrain, otherwise is a fine-tune")
133
+
134
+ args = parser.parse_args()
135
+
136
+ prepare_and_save_set(args.inp_dir, args.out_dir, is_finetune=not args.pretrain)
137
+
138
+
139
+ if __name__ == "__main__":
140
+ cli()
src/f5_tts/train/datasets/prepare_emilia.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Emilia Dataset: https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07
2
+ # if use updated new version, i.e. WebDataset, feel free to modify / draft your own script
3
+
4
+ # generate audio text map for Emilia ZH & EN
5
+ # evaluate for vocab size
6
+
7
+ import os
8
+ import sys
9
+
10
+ sys.path.append(os.getcwd())
11
+
12
+ import json
13
+ from concurrent.futures import ProcessPoolExecutor
14
+ from importlib.resources import files
15
+ from pathlib import Path
16
+ from tqdm import tqdm
17
+
18
+ from datasets.arrow_writer import ArrowWriter
19
+
20
+ from f5_tts.model.utils import (
21
+ repetition_found,
22
+ convert_char_to_pinyin,
23
+ )
24
+
25
+
26
+ out_zh = {
27
+ "ZH_B00041_S06226",
28
+ "ZH_B00042_S09204",
29
+ "ZH_B00065_S09430",
30
+ "ZH_B00065_S09431",
31
+ "ZH_B00066_S09327",
32
+ "ZH_B00066_S09328",
33
+ }
34
+ zh_filters = ["い", "て"]
35
+ # seems synthesized audios, or heavily code-switched
36
+ out_en = {
37
+ "EN_B00013_S00913",
38
+ "EN_B00042_S00120",
39
+ "EN_B00055_S04111",
40
+ "EN_B00061_S00693",
41
+ "EN_B00061_S01494",
42
+ "EN_B00061_S03375",
43
+ "EN_B00059_S00092",
44
+ "EN_B00111_S04300",
45
+ "EN_B00100_S03759",
46
+ "EN_B00087_S03811",
47
+ "EN_B00059_S00950",
48
+ "EN_B00089_S00946",
49
+ "EN_B00078_S05127",
50
+ "EN_B00070_S04089",
51
+ "EN_B00074_S09659",
52
+ "EN_B00061_S06983",
53
+ "EN_B00061_S07060",
54
+ "EN_B00059_S08397",
55
+ "EN_B00082_S06192",
56
+ "EN_B00091_S01238",
57
+ "EN_B00089_S07349",
58
+ "EN_B00070_S04343",
59
+ "EN_B00061_S02400",
60
+ "EN_B00076_S01262",
61
+ "EN_B00068_S06467",
62
+ "EN_B00076_S02943",
63
+ "EN_B00064_S05954",
64
+ "EN_B00061_S05386",
65
+ "EN_B00066_S06544",
66
+ "EN_B00076_S06944",
67
+ "EN_B00072_S08620",
68
+ "EN_B00076_S07135",
69
+ "EN_B00076_S09127",
70
+ "EN_B00065_S00497",
71
+ "EN_B00059_S06227",
72
+ "EN_B00063_S02859",
73
+ "EN_B00075_S01547",
74
+ "EN_B00061_S08286",
75
+ "EN_B00079_S02901",
76
+ "EN_B00092_S03643",
77
+ "EN_B00096_S08653",
78
+ "EN_B00063_S04297",
79
+ "EN_B00063_S04614",
80
+ "EN_B00079_S04698",
81
+ "EN_B00104_S01666",
82
+ "EN_B00061_S09504",
83
+ "EN_B00061_S09694",
84
+ "EN_B00065_S05444",
85
+ "EN_B00063_S06860",
86
+ "EN_B00065_S05725",
87
+ "EN_B00069_S07628",
88
+ "EN_B00083_S03875",
89
+ "EN_B00071_S07665",
90
+ "EN_B00071_S07665",
91
+ "EN_B00062_S04187",
92
+ "EN_B00065_S09873",
93
+ "EN_B00065_S09922",
94
+ "EN_B00084_S02463",
95
+ "EN_B00067_S05066",
96
+ "EN_B00106_S08060",
97
+ "EN_B00073_S06399",
98
+ "EN_B00073_S09236",
99
+ "EN_B00087_S00432",
100
+ "EN_B00085_S05618",
101
+ "EN_B00064_S01262",
102
+ "EN_B00072_S01739",
103
+ "EN_B00059_S03913",
104
+ "EN_B00069_S04036",
105
+ "EN_B00067_S05623",
106
+ "EN_B00060_S05389",
107
+ "EN_B00060_S07290",
108
+ "EN_B00062_S08995",
109
+ }
110
+ en_filters = ["ا", "い", "て"]
111
+
112
+
113
+ def deal_with_audio_dir(audio_dir):
114
+ audio_jsonl = audio_dir.with_suffix(".jsonl")
115
+ sub_result, durations = [], []
116
+ vocab_set = set()
117
+ bad_case_zh = 0
118
+ bad_case_en = 0
119
+ with open(audio_jsonl, "r") as f:
120
+ lines = f.readlines()
121
+ for line in tqdm(lines, desc=f"{audio_jsonl.stem}"):
122
+ obj = json.loads(line)
123
+ text = obj["text"]
124
+ if obj["language"] == "zh":
125
+ if obj["wav"].split("/")[1] in out_zh or any(f in text for f in zh_filters) or repetition_found(text):
126
+ bad_case_zh += 1
127
+ continue
128
+ else:
129
+ text = text.translate(
130
+ str.maketrans({",": ",", "!": "!", "?": "?"})
131
+ ) # not "。" cuz much code-switched
132
+ if obj["language"] == "en":
133
+ if (
134
+ obj["wav"].split("/")[1] in out_en
135
+ or any(f in text for f in en_filters)
136
+ or repetition_found(text, length=4)
137
+ ):
138
+ bad_case_en += 1
139
+ continue
140
+ if tokenizer == "pinyin":
141
+ text = convert_char_to_pinyin([text], polyphone=polyphone)[0]
142
+ duration = obj["duration"]
143
+ sub_result.append({"audio_path": str(audio_dir.parent / obj["wav"]), "text": text, "duration": duration})
144
+ durations.append(duration)
145
+ vocab_set.update(list(text))
146
+ return sub_result, durations, vocab_set, bad_case_zh, bad_case_en
147
+
148
+
149
+ def main():
150
+ assert tokenizer in ["pinyin", "char"]
151
+ result = []
152
+ duration_list = []
153
+ text_vocab_set = set()
154
+ total_bad_case_zh = 0
155
+ total_bad_case_en = 0
156
+
157
+ # process raw data
158
+ executor = ProcessPoolExecutor(max_workers=max_workers)
159
+ futures = []
160
+ for lang in langs:
161
+ dataset_path = Path(os.path.join(dataset_dir, lang))
162
+ [
163
+ futures.append(executor.submit(deal_with_audio_dir, audio_dir))
164
+ for audio_dir in dataset_path.iterdir()
165
+ if audio_dir.is_dir()
166
+ ]
167
+ for futures in tqdm(futures, total=len(futures)):
168
+ sub_result, durations, vocab_set, bad_case_zh, bad_case_en = futures.result()
169
+ result.extend(sub_result)
170
+ duration_list.extend(durations)
171
+ text_vocab_set.update(vocab_set)
172
+ total_bad_case_zh += bad_case_zh
173
+ total_bad_case_en += bad_case_en
174
+ executor.shutdown()
175
+
176
+ # save preprocessed dataset to disk
177
+ if not os.path.exists(f"{save_dir}"):
178
+ os.makedirs(f"{save_dir}")
179
+ print(f"\nSaving to {save_dir} ...")
180
+
181
+ # dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list}) # oom
182
+ # dataset.save_to_disk(f"{save_dir}/raw", max_shard_size="2GB")
183
+ with ArrowWriter(path=f"{save_dir}/raw.arrow") as writer:
184
+ for line in tqdm(result, desc="Writing to raw.arrow ..."):
185
+ writer.write(line)
186
+
187
+ # dup a json separately saving duration in case for DynamicBatchSampler ease
188
+ with open(f"{save_dir}/duration.json", "w", encoding="utf-8") as f:
189
+ json.dump({"duration": duration_list}, f, ensure_ascii=False)
190
+
191
+ # vocab map, i.e. tokenizer
192
+ # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
193
+ # if tokenizer == "pinyin":
194
+ # text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
195
+ with open(f"{save_dir}/vocab.txt", "w") as f:
196
+ for vocab in sorted(text_vocab_set):
197
+ f.write(vocab + "\n")
198
+
199
+ print(f"\nFor {dataset_name}, sample count: {len(result)}")
200
+ print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
201
+ print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")
202
+ if "ZH" in langs:
203
+ print(f"Bad zh transcription case: {total_bad_case_zh}")
204
+ if "EN" in langs:
205
+ print(f"Bad en transcription case: {total_bad_case_en}\n")
206
+
207
+
208
+ if __name__ == "__main__":
209
+ max_workers = 32
210
+
211
+ tokenizer = "pinyin" # "pinyin" | "char"
212
+ polyphone = True
213
+
214
+ langs = ["ZH", "EN"]
215
+ dataset_dir = "<SOME_PATH>/Emilia_Dataset/raw"
216
+ dataset_name = f"Emilia_{'_'.join(langs)}_{tokenizer}"
217
+ save_dir = str(files("f5_tts").joinpath("../../")) + f"/data/{dataset_name}"
218
+ print(f"\nPrepare for {dataset_name}, will save to {save_dir}\n")
219
+
220
+ main()
221
+
222
+ # Emilia ZH & EN
223
+ # samples count 37837916 (after removal)
224
+ # pinyin vocab size 2543 (polyphone)
225
+ # total duration 95281.87 (hours)
226
+ # bad zh asr cnt 230435 (samples)
227
+ # bad eh asr cnt 37217 (samples)
228
+
229
+ # vocab size may be slightly different due to jieba tokenizer and pypinyin (e.g. way of polyphoneme)
230
+ # please be careful if using pretrained model, make sure the vocab.txt is same
src/f5_tts/train/datasets/prepare_wenetspeech4tts.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # generate audio text map for WenetSpeech4TTS
2
+ # evaluate for vocab size
3
+
4
+ import os
5
+ import sys
6
+
7
+ sys.path.append(os.getcwd())
8
+
9
+ import json
10
+ from concurrent.futures import ProcessPoolExecutor
11
+ from importlib.resources import files
12
+ from tqdm import tqdm
13
+
14
+ import torchaudio
15
+ from datasets import Dataset
16
+
17
+ from f5_tts.model.utils import convert_char_to_pinyin
18
+
19
+
20
+ def deal_with_sub_path_files(dataset_path, sub_path):
21
+ print(f"Dealing with: {sub_path}")
22
+
23
+ text_dir = os.path.join(dataset_path, sub_path, "txts")
24
+ audio_dir = os.path.join(dataset_path, sub_path, "wavs")
25
+ text_files = os.listdir(text_dir)
26
+
27
+ audio_paths, texts, durations = [], [], []
28
+ for text_file in tqdm(text_files):
29
+ with open(os.path.join(text_dir, text_file), "r", encoding="utf-8") as file:
30
+ first_line = file.readline().split("\t")
31
+ audio_nm = first_line[0]
32
+ audio_path = os.path.join(audio_dir, audio_nm + ".wav")
33
+ text = first_line[1].strip()
34
+
35
+ audio_paths.append(audio_path)
36
+
37
+ if tokenizer == "pinyin":
38
+ texts.extend(convert_char_to_pinyin([text], polyphone=polyphone))
39
+ elif tokenizer == "char":
40
+ texts.append(text)
41
+
42
+ audio, sample_rate = torchaudio.load(audio_path)
43
+ durations.append(audio.shape[-1] / sample_rate)
44
+
45
+ return audio_paths, texts, durations
46
+
47
+
48
+ def main():
49
+ assert tokenizer in ["pinyin", "char"]
50
+
51
+ audio_path_list, text_list, duration_list = [], [], []
52
+
53
+ executor = ProcessPoolExecutor(max_workers=max_workers)
54
+ futures = []
55
+ for dataset_path in dataset_paths:
56
+ sub_items = os.listdir(dataset_path)
57
+ sub_paths = [item for item in sub_items if os.path.isdir(os.path.join(dataset_path, item))]
58
+ for sub_path in sub_paths:
59
+ futures.append(executor.submit(deal_with_sub_path_files, dataset_path, sub_path))
60
+ for future in tqdm(futures, total=len(futures)):
61
+ audio_paths, texts, durations = future.result()
62
+ audio_path_list.extend(audio_paths)
63
+ text_list.extend(texts)
64
+ duration_list.extend(durations)
65
+ executor.shutdown()
66
+
67
+ if not os.path.exists("data"):
68
+ os.makedirs("data")
69
+
70
+ print(f"\nSaving to {save_dir} ...")
71
+ dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list})
72
+ dataset.save_to_disk(f"{save_dir}/raw", max_shard_size="2GB") # arrow format
73
+
74
+ with open(f"{save_dir}/duration.json", "w", encoding="utf-8") as f:
75
+ json.dump(
76
+ {"duration": duration_list}, f, ensure_ascii=False
77
+ ) # dup a json separately saving duration in case for DynamicBatchSampler ease
78
+
79
+ print("\nEvaluating vocab size (all characters and symbols / all phonemes) ...")
80
+ text_vocab_set = set()
81
+ for text in tqdm(text_list):
82
+ text_vocab_set.update(list(text))
83
+
84
+ # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
85
+ if tokenizer == "pinyin":
86
+ text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
87
+
88
+ with open(f"{save_dir}/vocab.txt", "w") as f:
89
+ for vocab in sorted(text_vocab_set):
90
+ f.write(vocab + "\n")
91
+ print(f"\nFor {dataset_name}, sample count: {len(text_list)}")
92
+ print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}\n")
93
+
94
+
95
+ if __name__ == "__main__":
96
+ max_workers = 32
97
+
98
+ tokenizer = "pinyin" # "pinyin" | "char"
99
+ polyphone = True
100
+ dataset_choice = 1 # 1: Premium, 2: Standard, 3: Basic
101
+
102
+ dataset_name = (
103
+ ["WenetSpeech4TTS_Premium", "WenetSpeech4TTS_Standard", "WenetSpeech4TTS_Basic"][dataset_choice - 1]
104
+ + "_"
105
+ + tokenizer
106
+ )
107
+ dataset_paths = [
108
+ "<SOME_PATH>/WenetSpeech4TTS/Basic",
109
+ "<SOME_PATH>/WenetSpeech4TTS/Standard",
110
+ "<SOME_PATH>/WenetSpeech4TTS/Premium",
111
+ ][-dataset_choice:]
112
+ save_dir = str(files("f5_tts").joinpath("../../")) + f"/data/{dataset_name}"
113
+ print(f"\nChoose Dataset: {dataset_name}, will save to {save_dir}\n")
114
+
115
+ main()
116
+
117
+ # Results (if adding alphabets with accents and symbols):
118
+ # WenetSpeech4TTS Basic Standard Premium
119
+ # samples count 3932473 1941220 407494
120
+ # pinyin vocab size 1349 1348 1344 (no polyphone)
121
+ # - - 1459 (polyphone)
122
+ # char vocab size 5264 5219 5042
123
+
124
+ # vocab size may be slightly different due to jieba tokenizer and pypinyin (e.g. way of polyphoneme)
125
+ # please be careful if using pretrained model, make sure the vocab.txt is same
src/f5_tts/train/finetune_cli.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import shutil
4
+
5
+ from cached_path import cached_path
6
+ from f5_tts.model import CFM, UNetT, DiT, Trainer
7
+ from f5_tts.model.utils import get_tokenizer
8
+ from f5_tts.model.dataset import load_dataset
9
+
10
+
11
+ # -------------------------- Dataset Settings --------------------------- #
12
+ target_sample_rate = 24000
13
+ n_mel_channels = 100
14
+ hop_length = 256
15
+
16
+
17
+ # -------------------------- Argument Parsing --------------------------- #
18
+ def parse_args():
19
+ # batch_size_per_gpu = 1000 settting for gpu 8GB
20
+ # batch_size_per_gpu = 1600 settting for gpu 12GB
21
+ # batch_size_per_gpu = 2000 settting for gpu 16GB
22
+ # batch_size_per_gpu = 3200 settting for gpu 24GB
23
+
24
+ # num_warmup_updates = 300 for 5000 sample about 10 hours
25
+
26
+ # change save_per_updates , last_per_steps change this value what you need ,
27
+
28
+ parser = argparse.ArgumentParser(description="Train CFM Model")
29
+
30
+ parser.add_argument(
31
+ "--exp_name", type=str, default="F5TTS_Base", choices=["F5TTS_Base", "E2TTS_Base"], help="Experiment name"
32
+ )
33
+ parser.add_argument("--dataset_name", type=str, default="Emilia_ZH_EN", help="Name of the dataset to use")
34
+ parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate for training")
35
+ parser.add_argument("--batch_size_per_gpu", type=int, default=3200, help="Batch size per GPU")
36
+ parser.add_argument(
37
+ "--batch_size_type", type=str, default="frame", choices=["frame", "sample"], help="Batch size type"
38
+ )
39
+ parser.add_argument("--max_samples", type=int, default=64, help="Max sequences per batch")
40
+ parser.add_argument("--grad_accumulation_steps", type=int, default=1, help="Gradient accumulation steps")
41
+ parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max gradient norm for clipping")
42
+ parser.add_argument("--epochs", type=int, default=10, help="Number of training epochs")
43
+ parser.add_argument("--num_warmup_updates", type=int, default=300, help="Warmup steps")
44
+ parser.add_argument("--save_per_updates", type=int, default=10000, help="Save checkpoint every X steps")
45
+ parser.add_argument("--last_per_steps", type=int, default=50000, help="Save last checkpoint every X steps")
46
+ parser.add_argument("--finetune", type=bool, default=True, help="Use Finetune")
47
+ parser.add_argument("--pretrain", type=str, default=None, help="Use pretrain model for finetune")
48
+ parser.add_argument(
49
+ "--tokenizer", type=str, default="pinyin", choices=["pinyin", "char", "custom"], help="Tokenizer type"
50
+ )
51
+ parser.add_argument(
52
+ "--tokenizer_path",
53
+ type=str,
54
+ default=None,
55
+ help="Path to custom tokenizer vocab file (only used if tokenizer = 'custom')",
56
+ )
57
+
58
+ return parser.parse_args()
59
+
60
+
61
+ # -------------------------- Training Settings -------------------------- #
62
+
63
+
64
+ def main():
65
+ args = parse_args()
66
+
67
+ # Model parameters based on experiment name
68
+ if args.exp_name == "F5TTS_Base":
69
+ wandb_resume_id = None
70
+ model_cls = DiT
71
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
72
+ if args.finetune:
73
+ if args.pretrain is None:
74
+ ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.pt"))
75
+ else:
76
+ ckpt_path = args.pretrain
77
+ elif args.exp_name == "E2TTS_Base":
78
+ wandb_resume_id = None
79
+ model_cls = UNetT
80
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
81
+ if args.finetune:
82
+ if args.pretrain is None:
83
+ ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.pt"))
84
+ else:
85
+ ckpt_path = args.pretrain
86
+
87
+ if args.finetune:
88
+ path_ckpt = os.path.join("ckpts", args.dataset_name)
89
+ if not os.path.isdir(path_ckpt):
90
+ os.makedirs(path_ckpt, exist_ok=True)
91
+ shutil.copy2(ckpt_path, os.path.join(path_ckpt, os.path.basename(ckpt_path)))
92
+
93
+ checkpoint_path = os.path.join("ckpts", args.dataset_name)
94
+
95
+ # Use the tokenizer and tokenizer_path provided in the command line arguments
96
+ tokenizer = args.tokenizer
97
+ if tokenizer == "custom":
98
+ if not args.tokenizer_path:
99
+ raise ValueError("Custom tokenizer selected, but no tokenizer_path provided.")
100
+ tokenizer_path = args.tokenizer_path
101
+ else:
102
+ tokenizer_path = args.dataset_name
103
+
104
+ vocab_char_map, vocab_size = get_tokenizer(tokenizer_path, tokenizer)
105
+
106
+ mel_spec_kwargs = dict(
107
+ target_sample_rate=target_sample_rate,
108
+ n_mel_channels=n_mel_channels,
109
+ hop_length=hop_length,
110
+ )
111
+
112
+ model = CFM(
113
+ transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
114
+ mel_spec_kwargs=mel_spec_kwargs,
115
+ vocab_char_map=vocab_char_map,
116
+ )
117
+
118
+ trainer = Trainer(
119
+ model,
120
+ args.epochs,
121
+ args.learning_rate,
122
+ num_warmup_updates=args.num_warmup_updates,
123
+ save_per_updates=args.save_per_updates,
124
+ checkpoint_path=checkpoint_path,
125
+ batch_size=args.batch_size_per_gpu,
126
+ batch_size_type=args.batch_size_type,
127
+ max_samples=args.max_samples,
128
+ grad_accumulation_steps=args.grad_accumulation_steps,
129
+ max_grad_norm=args.max_grad_norm,
130
+ wandb_project=args.dataset_name,
131
+ wandb_run_name=args.exp_name,
132
+ wandb_resume_id=wandb_resume_id,
133
+ last_per_steps=args.last_per_steps,
134
+ )
135
+
136
+ train_dataset = load_dataset(args.dataset_name, tokenizer, mel_spec_kwargs=mel_spec_kwargs)
137
+
138
+ trainer.train(
139
+ train_dataset,
140
+ resumable_with_seed=666, # seed for shuffling dataset
141
+ )
142
+
143
+
144
+ if __name__ == "__main__":
145
+ main()
src/f5_tts/train/finetune_gradio.py ADDED
@@ -0,0 +1,1223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gc
2
+ import json
3
+ import os
4
+ import platform
5
+ import psutil
6
+ import random
7
+ import signal
8
+ import shutil
9
+ import subprocess
10
+ import sys
11
+ import tempfile
12
+ import time
13
+ from glob import glob
14
+
15
+ import click
16
+ import gradio as gr
17
+ import librosa
18
+ import numpy as np
19
+ import torch
20
+ import torchaudio
21
+ from datasets import Dataset as Dataset_
22
+ from datasets.arrow_writer import ArrowWriter
23
+ from safetensors.torch import save_file
24
+ from scipy.io import wavfile
25
+ from transformers import pipeline
26
+
27
+ from f5_tts.api import F5TTS
28
+ from f5_tts.model.utils import convert_char_to_pinyin
29
+
30
+
31
+ training_process = None
32
+ system = platform.system()
33
+ python_executable = sys.executable or "python"
34
+ tts_api = None
35
+ last_checkpoint = ""
36
+ last_device = ""
37
+
38
+ path_data = "data"
39
+
40
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
41
+
42
+ pipe = None
43
+
44
+
45
+ # Load metadata
46
+ def get_audio_duration(audio_path):
47
+ """Calculate the duration of an audio file."""
48
+ audio, sample_rate = torchaudio.load(audio_path)
49
+ num_channels = audio.shape[0]
50
+ return audio.shape[1] / (sample_rate * num_channels)
51
+
52
+
53
+ def clear_text(text):
54
+ """Clean and prepare text by lowering the case and stripping whitespace."""
55
+ return text.lower().strip()
56
+
57
+
58
+ def get_rms(
59
+ y,
60
+ frame_length=2048,
61
+ hop_length=512,
62
+ pad_mode="constant",
63
+ ): # https://github.com/RVC-Boss/GPT-SoVITS/blob/main/tools/slicer2.py
64
+ padding = (int(frame_length // 2), int(frame_length // 2))
65
+ y = np.pad(y, padding, mode=pad_mode)
66
+
67
+ axis = -1
68
+ # put our new within-frame axis at the end for now
69
+ out_strides = y.strides + tuple([y.strides[axis]])
70
+ # Reduce the shape on the framing axis
71
+ x_shape_trimmed = list(y.shape)
72
+ x_shape_trimmed[axis] -= frame_length - 1
73
+ out_shape = tuple(x_shape_trimmed) + tuple([frame_length])
74
+ xw = np.lib.stride_tricks.as_strided(y, shape=out_shape, strides=out_strides)
75
+ if axis < 0:
76
+ target_axis = axis - 1
77
+ else:
78
+ target_axis = axis + 1
79
+ xw = np.moveaxis(xw, -1, target_axis)
80
+ # Downsample along the target axis
81
+ slices = [slice(None)] * xw.ndim
82
+ slices[axis] = slice(0, None, hop_length)
83
+ x = xw[tuple(slices)]
84
+
85
+ # Calculate power
86
+ power = np.mean(np.abs(x) ** 2, axis=-2, keepdims=True)
87
+
88
+ return np.sqrt(power)
89
+
90
+
91
+ class Slicer: # https://github.com/RVC-Boss/GPT-SoVITS/blob/main/tools/slicer2.py
92
+ def __init__(
93
+ self,
94
+ sr: int,
95
+ threshold: float = -40.0,
96
+ min_length: int = 2000,
97
+ min_interval: int = 300,
98
+ hop_size: int = 20,
99
+ max_sil_kept: int = 2000,
100
+ ):
101
+ if not min_length >= min_interval >= hop_size:
102
+ raise ValueError("The following condition must be satisfied: min_length >= min_interval >= hop_size")
103
+ if not max_sil_kept >= hop_size:
104
+ raise ValueError("The following condition must be satisfied: max_sil_kept >= hop_size")
105
+ min_interval = sr * min_interval / 1000
106
+ self.threshold = 10 ** (threshold / 20.0)
107
+ self.hop_size = round(sr * hop_size / 1000)
108
+ self.win_size = min(round(min_interval), 4 * self.hop_size)
109
+ self.min_length = round(sr * min_length / 1000 / self.hop_size)
110
+ self.min_interval = round(min_interval / self.hop_size)
111
+ self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
112
+
113
+ def _apply_slice(self, waveform, begin, end):
114
+ if len(waveform.shape) > 1:
115
+ return waveform[:, begin * self.hop_size : min(waveform.shape[1], end * self.hop_size)]
116
+ else:
117
+ return waveform[begin * self.hop_size : min(waveform.shape[0], end * self.hop_size)]
118
+
119
+ # @timeit
120
+ def slice(self, waveform):
121
+ if len(waveform.shape) > 1:
122
+ samples = waveform.mean(axis=0)
123
+ else:
124
+ samples = waveform
125
+ if samples.shape[0] <= self.min_length:
126
+ return [waveform]
127
+ rms_list = get_rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
128
+ sil_tags = []
129
+ silence_start = None
130
+ clip_start = 0
131
+ for i, rms in enumerate(rms_list):
132
+ # Keep looping while frame is silent.
133
+ if rms < self.threshold:
134
+ # Record start of silent frames.
135
+ if silence_start is None:
136
+ silence_start = i
137
+ continue
138
+ # Keep looping while frame is not silent and silence start has not been recorded.
139
+ if silence_start is None:
140
+ continue
141
+ # Clear recorded silence start if interval is not enough or clip is too short
142
+ is_leading_silence = silence_start == 0 and i > self.max_sil_kept
143
+ need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
144
+ if not is_leading_silence and not need_slice_middle:
145
+ silence_start = None
146
+ continue
147
+ # Need slicing. Record the range of silent frames to be removed.
148
+ if i - silence_start <= self.max_sil_kept:
149
+ pos = rms_list[silence_start : i + 1].argmin() + silence_start
150
+ if silence_start == 0:
151
+ sil_tags.append((0, pos))
152
+ else:
153
+ sil_tags.append((pos, pos))
154
+ clip_start = pos
155
+ elif i - silence_start <= self.max_sil_kept * 2:
156
+ pos = rms_list[i - self.max_sil_kept : silence_start + self.max_sil_kept + 1].argmin()
157
+ pos += i - self.max_sil_kept
158
+ pos_l = rms_list[silence_start : silence_start + self.max_sil_kept + 1].argmin() + silence_start
159
+ pos_r = rms_list[i - self.max_sil_kept : i + 1].argmin() + i - self.max_sil_kept
160
+ if silence_start == 0:
161
+ sil_tags.append((0, pos_r))
162
+ clip_start = pos_r
163
+ else:
164
+ sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
165
+ clip_start = max(pos_r, pos)
166
+ else:
167
+ pos_l = rms_list[silence_start : silence_start + self.max_sil_kept + 1].argmin() + silence_start
168
+ pos_r = rms_list[i - self.max_sil_kept : i + 1].argmin() + i - self.max_sil_kept
169
+ if silence_start == 0:
170
+ sil_tags.append((0, pos_r))
171
+ else:
172
+ sil_tags.append((pos_l, pos_r))
173
+ clip_start = pos_r
174
+ silence_start = None
175
+ # Deal with trailing silence.
176
+ total_frames = rms_list.shape[0]
177
+ if silence_start is not None and total_frames - silence_start >= self.min_interval:
178
+ silence_end = min(total_frames, silence_start + self.max_sil_kept)
179
+ pos = rms_list[silence_start : silence_end + 1].argmin() + silence_start
180
+ sil_tags.append((pos, total_frames + 1))
181
+ # Apply and return slices.
182
+ ####音频+起始时间+终止时间
183
+ if len(sil_tags) == 0:
184
+ return [[waveform, 0, int(total_frames * self.hop_size)]]
185
+ else:
186
+ chunks = []
187
+ if sil_tags[0][0] > 0:
188
+ chunks.append([self._apply_slice(waveform, 0, sil_tags[0][0]), 0, int(sil_tags[0][0] * self.hop_size)])
189
+ for i in range(len(sil_tags) - 1):
190
+ chunks.append(
191
+ [
192
+ self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0]),
193
+ int(sil_tags[i][1] * self.hop_size),
194
+ int(sil_tags[i + 1][0] * self.hop_size),
195
+ ]
196
+ )
197
+ if sil_tags[-1][1] < total_frames:
198
+ chunks.append(
199
+ [
200
+ self._apply_slice(waveform, sil_tags[-1][1], total_frames),
201
+ int(sil_tags[-1][1] * self.hop_size),
202
+ int(total_frames * self.hop_size),
203
+ ]
204
+ )
205
+ return chunks
206
+
207
+
208
+ # terminal
209
+ def terminate_process_tree(pid, including_parent=True):
210
+ try:
211
+ parent = psutil.Process(pid)
212
+ except psutil.NoSuchProcess:
213
+ # Process already terminated
214
+ return
215
+
216
+ children = parent.children(recursive=True)
217
+ for child in children:
218
+ try:
219
+ os.kill(child.pid, signal.SIGTERM) # or signal.SIGKILL
220
+ except OSError:
221
+ pass
222
+ if including_parent:
223
+ try:
224
+ os.kill(parent.pid, signal.SIGTERM) # or signal.SIGKILL
225
+ except OSError:
226
+ pass
227
+
228
+
229
+ def terminate_process(pid):
230
+ if system == "Windows":
231
+ cmd = f"taskkill /t /f /pid {pid}"
232
+ os.system(cmd)
233
+ else:
234
+ terminate_process_tree(pid)
235
+
236
+
237
+ def start_training(
238
+ dataset_name="",
239
+ exp_name="F5TTS_Base",
240
+ learning_rate=1e-4,
241
+ batch_size_per_gpu=400,
242
+ batch_size_type="frame",
243
+ max_samples=64,
244
+ grad_accumulation_steps=1,
245
+ max_grad_norm=1.0,
246
+ epochs=11,
247
+ num_warmup_updates=200,
248
+ save_per_updates=400,
249
+ last_per_steps=800,
250
+ finetune=True,
251
+ file_checkpoint_train="",
252
+ tokenizer_type="pinyin",
253
+ tokenizer_file="",
254
+ mixed_precision="fp16",
255
+ ):
256
+ global training_process, tts_api
257
+
258
+ if tts_api is not None:
259
+ del tts_api
260
+ gc.collect()
261
+ torch.cuda.empty_cache()
262
+ tts_api = None
263
+
264
+ path_project = os.path.join(path_data, dataset_name)
265
+
266
+ if not os.path.isdir(path_project):
267
+ yield (
268
+ f"There is not project with name {dataset_name}",
269
+ gr.update(interactive=True),
270
+ gr.update(interactive=False),
271
+ )
272
+ return
273
+
274
+ file_raw = os.path.join(path_project, "raw.arrow")
275
+ if not os.path.isfile(file_raw):
276
+ yield f"There is no file {file_raw}", gr.update(interactive=True), gr.update(interactive=False)
277
+ return
278
+
279
+ # Check if a training process is already running
280
+ if training_process is not None:
281
+ return "Train run already!", gr.update(interactive=False), gr.update(interactive=True)
282
+
283
+ yield "start train", gr.update(interactive=False), gr.update(interactive=False)
284
+
285
+ # Command to run the training script with the specified arguments
286
+
287
+ if tokenizer_file == "":
288
+ if dataset_name.endswith("_pinyin"):
289
+ tokenizer_type = "pinyin"
290
+ elif dataset_name.endswith("_char"):
291
+ tokenizer_type = "char"
292
+ else:
293
+ tokenizer_file = "custom"
294
+
295
+ dataset_name = dataset_name.replace("_pinyin", "").replace("_char", "")
296
+
297
+ if mixed_precision != "none":
298
+ fp16 = f"--mixed_precision={mixed_precision}"
299
+ else:
300
+ fp16 = ""
301
+
302
+ cmd = (
303
+ f"accelerate launch {fp16} finetune-cli.py --exp_name {exp_name} "
304
+ f"--learning_rate {learning_rate} "
305
+ f"--batch_size_per_gpu {batch_size_per_gpu} "
306
+ f"--batch_size_type {batch_size_type} "
307
+ f"--max_samples {max_samples} "
308
+ f"--grad_accumulation_steps {grad_accumulation_steps} "
309
+ f"--max_grad_norm {max_grad_norm} "
310
+ f"--epochs {epochs} "
311
+ f"--num_warmup_updates {num_warmup_updates} "
312
+ f"--save_per_updates {save_per_updates} "
313
+ f"--last_per_steps {last_per_steps} "
314
+ f"--dataset_name {dataset_name}"
315
+ )
316
+ if finetune:
317
+ cmd += f" --finetune {finetune}"
318
+
319
+ if file_checkpoint_train != "":
320
+ cmd += f" --file_checkpoint_train {file_checkpoint_train}"
321
+
322
+ if tokenizer_file != "":
323
+ cmd += f" --tokenizer_path {tokenizer_file}"
324
+
325
+ cmd += f" --tokenizer {tokenizer_type} "
326
+
327
+ print(cmd)
328
+
329
+ try:
330
+ # Start the training process
331
+ training_process = subprocess.Popen(cmd, shell=True)
332
+
333
+ time.sleep(5)
334
+ yield "train start", gr.update(interactive=False), gr.update(interactive=True)
335
+
336
+ # Wait for the training process to finish
337
+ training_process.wait()
338
+ time.sleep(1)
339
+
340
+ if training_process is None:
341
+ text_info = "train stop"
342
+ else:
343
+ text_info = "train complete !"
344
+
345
+ except Exception as e: # Catch all exceptions
346
+ # Ensure that we reset the training process variable in case of an error
347
+ text_info = f"An error occurred: {str(e)}"
348
+
349
+ training_process = None
350
+
351
+ yield text_info, gr.update(interactive=True), gr.update(interactive=False)
352
+
353
+
354
+ def stop_training():
355
+ global training_process
356
+ if training_process is None:
357
+ return "Train not run !", gr.update(interactive=True), gr.update(interactive=False)
358
+ terminate_process_tree(training_process.pid)
359
+ training_process = None
360
+ return "train stop", gr.update(interactive=True), gr.update(interactive=False)
361
+
362
+
363
+ def get_list_projects():
364
+ project_list = []
365
+ for folder in os.listdir("data"):
366
+ path_folder = os.path.join("data", folder)
367
+ if not os.path.isdir(path_folder):
368
+ continue
369
+ folder = folder.lower()
370
+ if folder == "emilia_zh_en_pinyin":
371
+ continue
372
+ project_list.append(folder)
373
+
374
+ projects_selelect = None if not project_list else project_list[-1]
375
+
376
+ return project_list, projects_selelect
377
+
378
+
379
+ def create_data_project(name, tokenizer_type):
380
+ name += "_" + tokenizer_type
381
+ os.makedirs(os.path.join(path_data, name), exist_ok=True)
382
+ os.makedirs(os.path.join(path_data, name, "dataset"), exist_ok=True)
383
+ project_list, projects_selelect = get_list_projects()
384
+ return gr.update(choices=project_list, value=name)
385
+
386
+
387
+ def transcribe(file_audio, language="english"):
388
+ global pipe
389
+
390
+ if pipe is None:
391
+ pipe = pipeline(
392
+ "automatic-speech-recognition",
393
+ model="openai/whisper-large-v3-turbo",
394
+ torch_dtype=torch.float16,
395
+ device=device,
396
+ )
397
+
398
+ text_transcribe = pipe(
399
+ file_audio,
400
+ chunk_length_s=30,
401
+ batch_size=128,
402
+ generate_kwargs={"task": "transcribe", "language": language},
403
+ return_timestamps=False,
404
+ )["text"].strip()
405
+ return text_transcribe
406
+
407
+
408
+ def transcribe_all(name_project, audio_files, language, user=False, progress=gr.Progress()):
409
+ path_project = os.path.join(path_data, name_project)
410
+ path_dataset = os.path.join(path_project, "dataset")
411
+ path_project_wavs = os.path.join(path_project, "wavs")
412
+ file_metadata = os.path.join(path_project, "metadata.csv")
413
+
414
+ if not user:
415
+ if audio_files is None:
416
+ return "You need to load an audio file."
417
+
418
+ if os.path.isdir(path_project_wavs):
419
+ shutil.rmtree(path_project_wavs)
420
+
421
+ if os.path.isfile(file_metadata):
422
+ os.remove(file_metadata)
423
+
424
+ os.makedirs(path_project_wavs, exist_ok=True)
425
+
426
+ if user:
427
+ file_audios = [
428
+ file
429
+ for format in ("*.wav", "*.ogg", "*.opus", "*.mp3", "*.flac")
430
+ for file in glob(os.path.join(path_dataset, format))
431
+ ]
432
+ if file_audios == []:
433
+ return "No audio file was found in the dataset."
434
+ else:
435
+ file_audios = audio_files
436
+
437
+ alpha = 0.5
438
+ _max = 1.0
439
+ slicer = Slicer(24000)
440
+
441
+ num = 0
442
+ error_num = 0
443
+ data = ""
444
+ for file_audio in progress.tqdm(file_audios, desc="transcribe files", total=len((file_audios))):
445
+ audio, _ = librosa.load(file_audio, sr=24000, mono=True)
446
+
447
+ list_slicer = slicer.slice(audio)
448
+ for chunk, start, end in progress.tqdm(list_slicer, total=len(list_slicer), desc="slicer files"):
449
+ name_segment = os.path.join(f"segment_{num}")
450
+ file_segment = os.path.join(path_project_wavs, f"{name_segment}.wav")
451
+
452
+ tmp_max = np.abs(chunk).max()
453
+ if tmp_max > 1:
454
+ chunk /= tmp_max
455
+ chunk = (chunk / tmp_max * (_max * alpha)) + (1 - alpha) * chunk
456
+ wavfile.write(file_segment, 24000, (chunk * 32767).astype(np.int16))
457
+
458
+ try:
459
+ text = transcribe(file_segment, language)
460
+ text = text.lower().strip().replace('"', "")
461
+
462
+ data += f"{name_segment}|{text}\n"
463
+
464
+ num += 1
465
+ except: # noqa: E722
466
+ error_num += 1
467
+
468
+ with open(file_metadata, "w", encoding="utf-8-sig") as f:
469
+ f.write(data)
470
+
471
+ if error_num != []:
472
+ error_text = f"\nerror files : {error_num}"
473
+ else:
474
+ error_text = ""
475
+
476
+ return f"transcribe complete samples : {num}\npath : {path_project_wavs}{error_text}"
477
+
478
+
479
+ def format_seconds_to_hms(seconds):
480
+ hours = int(seconds / 3600)
481
+ minutes = int((seconds % 3600) / 60)
482
+ seconds = seconds % 60
483
+ return "{:02d}:{:02d}:{:02d}".format(hours, minutes, int(seconds))
484
+
485
+
486
+ def create_metadata(name_project, ch_tokenizer, progress=gr.Progress()):
487
+ path_project = os.path.join(path_data, name_project)
488
+ path_project_wavs = os.path.join(path_project, "wavs")
489
+ file_metadata = os.path.join(path_project, "metadata.csv")
490
+ file_raw = os.path.join(path_project, "raw.arrow")
491
+ file_duration = os.path.join(path_project, "duration.json")
492
+ file_vocab = os.path.join(path_project, "vocab.txt")
493
+
494
+ if not os.path.isfile(file_metadata):
495
+ return "The file was not found in " + file_metadata, ""
496
+
497
+ with open(file_metadata, "r", encoding="utf-8-sig") as f:
498
+ data = f.read()
499
+
500
+ audio_path_list = []
501
+ text_list = []
502
+ duration_list = []
503
+
504
+ count = data.split("\n")
505
+ lenght = 0
506
+ result = []
507
+ error_files = []
508
+ text_vocab_set = set()
509
+ for line in progress.tqdm(data.split("\n"), total=count):
510
+ sp_line = line.split("|")
511
+ if len(sp_line) != 2:
512
+ continue
513
+ name_audio, text = sp_line[:2]
514
+
515
+ file_audio = os.path.join(path_project_wavs, name_audio + ".wav")
516
+
517
+ if not os.path.isfile(file_audio):
518
+ error_files.append([file_audio, "error path"])
519
+ continue
520
+
521
+ try:
522
+ duration = get_audio_duration(file_audio)
523
+ except Exception as e:
524
+ error_files.append([file_audio, "duration"])
525
+ print(f"Error processing {file_audio}: {e}")
526
+ continue
527
+
528
+ if duration < 1 and duration > 25:
529
+ error_files.append([file_audio, "duration < 1 and > 25 "])
530
+ continue
531
+ if len(text) < 4:
532
+ error_files.append([file_audio, "very small text len 3"])
533
+ continue
534
+
535
+ text = clear_text(text)
536
+ text = convert_char_to_pinyin([text], polyphone=True)[0]
537
+
538
+ audio_path_list.append(file_audio)
539
+ duration_list.append(duration)
540
+ text_list.append(text)
541
+
542
+ result.append({"audio_path": file_audio, "text": text, "duration": duration})
543
+ if ch_tokenizer:
544
+ text_vocab_set.update(list(text))
545
+
546
+ lenght += duration
547
+
548
+ if duration_list == []:
549
+ return f"Error: No audio files found in the specified path : {path_project_wavs}", ""
550
+
551
+ min_second = round(min(duration_list), 2)
552
+ max_second = round(max(duration_list), 2)
553
+
554
+ with ArrowWriter(path=file_raw, writer_batch_size=1) as writer:
555
+ for line in progress.tqdm(result, total=len(result), desc="prepare data"):
556
+ writer.write(line)
557
+
558
+ with open(file_duration, "w") as f:
559
+ json.dump({"duration": duration_list}, f, ensure_ascii=False)
560
+
561
+ new_vocal = ""
562
+ if not ch_tokenizer:
563
+ file_vocab_finetune = "data/Emilia_ZH_EN_pinyin/vocab.txt"
564
+ if not os.path.isfile(file_vocab_finetune):
565
+ return "Error: Vocabulary file 'Emilia_ZH_EN_pinyin' not found!"
566
+ shutil.copy2(file_vocab_finetune, file_vocab)
567
+
568
+ with open(file_vocab, "r", encoding="utf-8-sig") as f:
569
+ vocab_char_map = {}
570
+ for i, char in enumerate(f):
571
+ vocab_char_map[char[:-1]] = i
572
+ vocab_size = len(vocab_char_map)
573
+
574
+ else:
575
+ with open(file_vocab, "w", encoding="utf-8-sig") as f:
576
+ for vocab in sorted(text_vocab_set):
577
+ f.write(vocab + "\n")
578
+ new_vocal += vocab + "\n"
579
+ vocab_size = len(text_vocab_set)
580
+
581
+ if error_files != []:
582
+ error_text = "\n".join([" = ".join(item) for item in error_files])
583
+ else:
584
+ error_text = ""
585
+
586
+ return (
587
+ f"prepare complete \nsamples : {len(text_list)}\ntime data : {format_seconds_to_hms(lenght)}\nmin sec : {min_second}\nmax sec : {max_second}\nfile_arrow : {file_raw}\nvocab : {vocab_size}\n{error_text}",
588
+ new_vocal,
589
+ )
590
+
591
+
592
+ def check_user(value):
593
+ return gr.update(visible=not value), gr.update(visible=value)
594
+
595
+
596
+ def calculate_train(
597
+ name_project,
598
+ batch_size_type,
599
+ max_samples,
600
+ learning_rate,
601
+ num_warmup_updates,
602
+ save_per_updates,
603
+ last_per_steps,
604
+ finetune,
605
+ ):
606
+ path_project = os.path.join(path_data, name_project)
607
+ file_duraction = os.path.join(path_project, "duration.json")
608
+
609
+ if not os.path.isfile(file_duraction):
610
+ return (
611
+ 1000,
612
+ max_samples,
613
+ num_warmup_updates,
614
+ save_per_updates,
615
+ last_per_steps,
616
+ "project not found !",
617
+ learning_rate,
618
+ )
619
+
620
+ with open(file_duraction, "r") as file:
621
+ data = json.load(file)
622
+
623
+ duration_list = data["duration"]
624
+ samples = len(duration_list)
625
+ hours = sum(duration_list) / 3600
626
+
627
+ # if torch.cuda.is_available():
628
+ # gpu_properties = torch.cuda.get_device_properties(0)
629
+ # total_memory = gpu_properties.total_memory / (1024**3)
630
+ # elif torch.backends.mps.is_available():
631
+ # total_memory = psutil.virtual_memory().available / (1024**3)
632
+
633
+ if torch.cuda.is_available():
634
+ gpu_count = torch.cuda.device_count()
635
+ total_memory = 0
636
+ for i in range(gpu_count):
637
+ gpu_properties = torch.cuda.get_device_properties(i)
638
+ total_memory += gpu_properties.total_memory / (1024**3) # in GB
639
+
640
+ elif torch.backends.mps.is_available():
641
+ gpu_count = 1
642
+ total_memory = psutil.virtual_memory().available / (1024**3)
643
+
644
+ if batch_size_type == "frame":
645
+ batch = int(total_memory * 0.5)
646
+ batch = (lambda num: num + 1 if num % 2 != 0 else num)(batch)
647
+ batch_size_per_gpu = int(38400 / batch)
648
+ else:
649
+ batch_size_per_gpu = int(total_memory / 8)
650
+ batch_size_per_gpu = (lambda num: num + 1 if num % 2 != 0 else num)(batch_size_per_gpu)
651
+ batch = batch_size_per_gpu
652
+
653
+ if batch_size_per_gpu <= 0:
654
+ batch_size_per_gpu = 1
655
+
656
+ if samples < 64:
657
+ max_samples = int(samples * 0.25)
658
+ else:
659
+ max_samples = 64
660
+
661
+ num_warmup_updates = int(samples * 0.05)
662
+ save_per_updates = int(samples * 0.10)
663
+ last_per_steps = int(save_per_updates * 5)
664
+
665
+ max_samples = (lambda num: num + 1 if num % 2 != 0 else num)(max_samples)
666
+ num_warmup_updates = (lambda num: num + 1 if num % 2 != 0 else num)(num_warmup_updates)
667
+ save_per_updates = (lambda num: num + 1 if num % 2 != 0 else num)(save_per_updates)
668
+ last_per_steps = (lambda num: num + 1 if num % 2 != 0 else num)(last_per_steps)
669
+
670
+ total_hours = hours
671
+ mel_hop_length = 256
672
+ mel_sampling_rate = 24000
673
+
674
+ # target
675
+ wanted_max_updates = 1000000
676
+
677
+ # train params
678
+ gpus = gpu_count
679
+ frames_per_gpu = batch_size_per_gpu # 8 * 38400 = 307200
680
+ grad_accum = 1
681
+
682
+ # intermediate
683
+ mini_batch_frames = frames_per_gpu * grad_accum * gpus
684
+ mini_batch_hours = mini_batch_frames * mel_hop_length / mel_sampling_rate / 3600
685
+ updates_per_epoch = total_hours / mini_batch_hours
686
+ # steps_per_epoch = updates_per_epoch * grad_accum
687
+ epochs = wanted_max_updates / updates_per_epoch
688
+
689
+ if finetune:
690
+ learning_rate = 1e-5
691
+ else:
692
+ learning_rate = 7.5e-5
693
+
694
+ return (
695
+ batch_size_per_gpu,
696
+ max_samples,
697
+ num_warmup_updates,
698
+ save_per_updates,
699
+ last_per_steps,
700
+ samples,
701
+ learning_rate,
702
+ int(epochs),
703
+ )
704
+
705
+
706
+ def extract_and_save_ema_model(checkpoint_path: str, new_checkpoint_path: str, safetensors: bool) -> str:
707
+ try:
708
+ checkpoint = torch.load(checkpoint_path)
709
+ print("Original Checkpoint Keys:", checkpoint.keys())
710
+
711
+ ema_model_state_dict = checkpoint.get("ema_model_state_dict", None)
712
+ if ema_model_state_dict is None:
713
+ return "No 'ema_model_state_dict' found in the checkpoint."
714
+
715
+ if safetensors:
716
+ new_checkpoint_path = new_checkpoint_path.replace(".pt", ".safetensors")
717
+ save_file(ema_model_state_dict, new_checkpoint_path)
718
+ else:
719
+ new_checkpoint_path = new_checkpoint_path.replace(".safetensors", ".pt")
720
+ new_checkpoint = {"ema_model_state_dict": ema_model_state_dict}
721
+ torch.save(new_checkpoint, new_checkpoint_path)
722
+
723
+ return f"New checkpoint saved at: {new_checkpoint_path}"
724
+
725
+ except Exception as e:
726
+ return f"An error occurred: {e}"
727
+
728
+
729
+ def vocab_check(project_name):
730
+ name_project = project_name
731
+ path_project = os.path.join(path_data, name_project)
732
+
733
+ file_metadata = os.path.join(path_project, "metadata.csv")
734
+
735
+ file_vocab = "data/Emilia_ZH_EN_pinyin/vocab.txt"
736
+ if not os.path.isfile(file_vocab):
737
+ return f"the file {file_vocab} not found !"
738
+
739
+ with open(file_vocab, "r", encoding="utf-8-sig") as f:
740
+ data = f.read()
741
+ vocab = data.split("\n")
742
+ vocab = set(vocab)
743
+
744
+ if not os.path.isfile(file_metadata):
745
+ return f"the file {file_metadata} not found !"
746
+
747
+ with open(file_metadata, "r", encoding="utf-8-sig") as f:
748
+ data = f.read()
749
+
750
+ miss_symbols = []
751
+ miss_symbols_keep = {}
752
+ for item in data.split("\n"):
753
+ sp = item.split("|")
754
+ if len(sp) != 2:
755
+ continue
756
+
757
+ text = sp[1].lower().strip()
758
+
759
+ for t in text:
760
+ if t not in vocab and t not in miss_symbols_keep:
761
+ miss_symbols.append(t)
762
+ miss_symbols_keep[t] = t
763
+ if miss_symbols == []:
764
+ info = "You can train using your language !"
765
+ else:
766
+ info = f"The following symbols are missing in your language : {len(miss_symbols)}\n\n" + "\n".join(miss_symbols)
767
+
768
+ return info
769
+
770
+
771
+ def get_random_sample_prepare(project_name):
772
+ name_project = project_name
773
+ path_project = os.path.join(path_data, name_project)
774
+ file_arrow = os.path.join(path_project, "raw.arrow")
775
+ if not os.path.isfile(file_arrow):
776
+ return "", None
777
+ dataset = Dataset_.from_file(file_arrow)
778
+ random_sample = dataset.shuffle(seed=random.randint(0, 1000)).select([0])
779
+ text = "[" + " , ".join(["' " + t + " '" for t in random_sample["text"][0]]) + "]"
780
+ audio_path = random_sample["audio_path"][0]
781
+ return text, audio_path
782
+
783
+
784
+ def get_random_sample_transcribe(project_name):
785
+ name_project = project_name
786
+ path_project = os.path.join(path_data, name_project)
787
+ file_metadata = os.path.join(path_project, "metadata.csv")
788
+ if not os.path.isfile(file_metadata):
789
+ return "", None
790
+
791
+ data = ""
792
+ with open(file_metadata, "r", encoding="utf-8-sig") as f:
793
+ data = f.read()
794
+
795
+ list_data = []
796
+ for item in data.split("\n"):
797
+ sp = item.split("|")
798
+ if len(sp) != 2:
799
+ continue
800
+ list_data.append([os.path.join(path_project, "wavs", sp[0] + ".wav"), sp[1]])
801
+
802
+ if list_data == []:
803
+ return "", None
804
+
805
+ random_item = random.choice(list_data)
806
+
807
+ return random_item[1], random_item[0]
808
+
809
+
810
+ def get_random_sample_infer(project_name):
811
+ text, audio = get_random_sample_transcribe(project_name)
812
+ return (
813
+ text,
814
+ text,
815
+ audio,
816
+ )
817
+
818
+
819
+ def infer(file_checkpoint, exp_name, ref_text, ref_audio, gen_text, nfe_step):
820
+ global last_checkpoint, last_device, tts_api
821
+
822
+ if not os.path.isfile(file_checkpoint):
823
+ return None, "checkpoint not found!"
824
+
825
+ if training_process is not None:
826
+ device_test = "cpu"
827
+ else:
828
+ device_test = None
829
+
830
+ if last_checkpoint != file_checkpoint or last_device != device_test:
831
+ if last_checkpoint != file_checkpoint:
832
+ last_checkpoint = file_checkpoint
833
+ if last_device != device_test:
834
+ last_device = device_test
835
+
836
+ tts_api = F5TTS(model_type=exp_name, ckpt_file=file_checkpoint, device=device_test)
837
+
838
+ print("update", device_test, file_checkpoint)
839
+
840
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
841
+ tts_api.infer(gen_text=gen_text, ref_text=ref_text, ref_file=ref_audio, nfe_step=nfe_step, file_wave=f.name)
842
+ return f.name, tts_api.device
843
+
844
+
845
+ def check_finetune(finetune):
846
+ return gr.update(interactive=finetune), gr.update(interactive=finetune), gr.update(interactive=finetune)
847
+
848
+
849
+ def get_checkpoints_project(project_name, is_gradio=True):
850
+ if project_name is None:
851
+ return [], ""
852
+ project_name = project_name.replace("_pinyin", "").replace("_char", "")
853
+ path_project_ckpts = os.path.join("ckpts", project_name)
854
+
855
+ if os.path.isdir(path_project_ckpts):
856
+ files_checkpoints = glob(os.path.join(path_project_ckpts, "*.pt"))
857
+ files_checkpoints = sorted(
858
+ files_checkpoints,
859
+ key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
860
+ if os.path.basename(x) != "model_last.pt"
861
+ else float("inf"),
862
+ )
863
+ else:
864
+ files_checkpoints = []
865
+
866
+ selelect_checkpoint = None if not files_checkpoints else files_checkpoints[0]
867
+
868
+ if is_gradio:
869
+ return gr.update(choices=files_checkpoints, value=selelect_checkpoint)
870
+
871
+ return files_checkpoints, selelect_checkpoint
872
+
873
+
874
+ def get_gpu_stats():
875
+ gpu_stats = ""
876
+
877
+ if torch.cuda.is_available():
878
+ gpu_count = torch.cuda.device_count()
879
+ for i in range(gpu_count):
880
+ gpu_name = torch.cuda.get_device_name(i)
881
+ gpu_properties = torch.cuda.get_device_properties(i)
882
+ total_memory = gpu_properties.total_memory / (1024**3) # in GB
883
+ allocated_memory = torch.cuda.memory_allocated(i) / (1024**2) # in MB
884
+ reserved_memory = torch.cuda.memory_reserved(i) / (1024**2) # in MB
885
+
886
+ gpu_stats += (
887
+ f"GPU {i} Name: {gpu_name}\n"
888
+ f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
889
+ f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
890
+ f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
891
+ )
892
+
893
+ elif torch.backends.mps.is_available():
894
+ gpu_count = 1
895
+ gpu_stats += "MPS GPU\n"
896
+ total_memory = psutil.virtual_memory().total / (
897
+ 1024**3
898
+ ) # Total system memory (MPS doesn't have its own memory)
899
+ allocated_memory = 0
900
+ reserved_memory = 0
901
+
902
+ gpu_stats += (
903
+ f"Total system memory: {total_memory:.2f} GB\n"
904
+ f"Allocated GPU memory (MPS): {allocated_memory:.2f} MB\n"
905
+ f"Reserved GPU memory (MPS): {reserved_memory:.2f} MB\n"
906
+ )
907
+
908
+ else:
909
+ gpu_stats = "No GPU available"
910
+
911
+ return gpu_stats
912
+
913
+
914
+ def get_cpu_stats():
915
+ cpu_usage = psutil.cpu_percent(interval=1)
916
+ memory_info = psutil.virtual_memory()
917
+ memory_used = memory_info.used / (1024**2)
918
+ memory_total = memory_info.total / (1024**2)
919
+ memory_percent = memory_info.percent
920
+
921
+ pid = os.getpid()
922
+ process = psutil.Process(pid)
923
+ nice_value = process.nice()
924
+
925
+ cpu_stats = (
926
+ f"CPU Usage: {cpu_usage:.2f}%\n"
927
+ f"System Memory: {memory_used:.2f} MB used / {memory_total:.2f} MB total ({memory_percent}% used)\n"
928
+ f"Process Priority (Nice value): {nice_value}"
929
+ )
930
+
931
+ return cpu_stats
932
+
933
+
934
+ def get_combined_stats():
935
+ gpu_stats = get_gpu_stats()
936
+ cpu_stats = get_cpu_stats()
937
+ combined_stats = f"### GPU Stats\n{gpu_stats}\n\n### CPU Stats\n{cpu_stats}"
938
+ return combined_stats
939
+
940
+
941
+ with gr.Blocks() as app:
942
+ gr.Markdown(
943
+ """
944
+ # E2/F5 TTS AUTOMATIC FINETUNE
945
+
946
+ This is a local web UI for F5 TTS with advanced batch processing support. This app supports the following TTS models:
947
+
948
+ * [F5-TTS](https://arxiv.org/abs/2410.06885) (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching)
949
+ * [E2 TTS](https://arxiv.org/abs/2406.18009) (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS)
950
+
951
+ The checkpoints support English and Chinese.
952
+
953
+ for tutorial and updates check here (https://github.com/SWivid/F5-TTS/discussions/143)
954
+ """
955
+ )
956
+
957
+ with gr.Row():
958
+ projects, projects_selelect = get_list_projects()
959
+ tokenizer_type = gr.Radio(label="Tokenizer Type", choices=["pinyin", "char"], value="pinyin")
960
+ project_name = gr.Textbox(label="project name", value="my_speak")
961
+ bt_create = gr.Button("create new project")
962
+
963
+ cm_project = gr.Dropdown(choices=projects, value=projects_selelect, label="Project", allow_custom_value=True)
964
+
965
+ bt_create.click(fn=create_data_project, inputs=[project_name, tokenizer_type], outputs=[cm_project])
966
+
967
+ with gr.Tabs():
968
+ with gr.TabItem("transcribe Data"):
969
+ ch_manual = gr.Checkbox(label="audio from path", value=False)
970
+
971
+ mark_info_transcribe = gr.Markdown(
972
+ """```plaintext
973
+ Place your 'wavs' folder and 'metadata.csv' file in the {your_project_name}' directory.
974
+
975
+ my_speak/
976
+
977
+ └── dataset/
978
+ ├── audio1.wav
979
+ └── audio2.wav
980
+ ...
981
+ ```""",
982
+ visible=False,
983
+ )
984
+
985
+ audio_speaker = gr.File(label="voice", type="filepath", file_count="multiple")
986
+ txt_lang = gr.Text(label="Language", value="english")
987
+ bt_transcribe = bt_create = gr.Button("transcribe")
988
+ txt_info_transcribe = gr.Text(label="info", value="")
989
+ bt_transcribe.click(
990
+ fn=transcribe_all,
991
+ inputs=[cm_project, audio_speaker, txt_lang, ch_manual],
992
+ outputs=[txt_info_transcribe],
993
+ )
994
+ ch_manual.change(fn=check_user, inputs=[ch_manual], outputs=[audio_speaker, mark_info_transcribe])
995
+
996
+ random_sample_transcribe = gr.Button("random sample")
997
+
998
+ with gr.Row():
999
+ random_text_transcribe = gr.Text(label="Text")
1000
+ random_audio_transcribe = gr.Audio(label="Audio", type="filepath")
1001
+
1002
+ random_sample_transcribe.click(
1003
+ fn=get_random_sample_transcribe,
1004
+ inputs=[cm_project],
1005
+ outputs=[random_text_transcribe, random_audio_transcribe],
1006
+ )
1007
+
1008
+ with gr.TabItem("prepare Data"):
1009
+ gr.Markdown(
1010
+ """```plaintext
1011
+ place all your wavs folder and your metadata.csv file in {your name project}
1012
+ my_speak/
1013
+
1014
+ ├── wavs/
1015
+ │ ├── audio1.wav
1016
+ │ └── audio2.wav
1017
+ | ...
1018
+
1019
+ └── metadata.csv
1020
+
1021
+ file format metadata.csv
1022
+
1023
+ audio1|text1
1024
+ audio2|text1
1025
+ ...
1026
+
1027
+ ```"""
1028
+ )
1029
+ ch_tokenizern = gr.Checkbox(label="create vocabulary from dataset", value=False)
1030
+ bt_prepare = bt_create = gr.Button("prepare")
1031
+ txt_info_prepare = gr.Text(label="info", value="")
1032
+ txt_vocab_prepare = gr.Text(label="vocab", value="")
1033
+ bt_prepare.click(
1034
+ fn=create_metadata, inputs=[cm_project, ch_tokenizern], outputs=[txt_info_prepare, txt_vocab_prepare]
1035
+ )
1036
+
1037
+ random_sample_prepare = gr.Button("random sample")
1038
+
1039
+ with gr.Row():
1040
+ random_text_prepare = gr.Text(label="Pinyin")
1041
+ random_audio_prepare = gr.Audio(label="Audio", type="filepath")
1042
+
1043
+ random_sample_prepare.click(
1044
+ fn=get_random_sample_prepare, inputs=[cm_project], outputs=[random_text_prepare, random_audio_prepare]
1045
+ )
1046
+
1047
+ with gr.TabItem("train Data"):
1048
+ with gr.Row():
1049
+ bt_calculate = bt_create = gr.Button("Auto Settings")
1050
+ lb_samples = gr.Label(label="samples")
1051
+ batch_size_type = gr.Radio(label="Batch Size Type", choices=["frame", "sample"], value="frame")
1052
+
1053
+ with gr.Row():
1054
+ ch_finetune = bt_create = gr.Checkbox(label="finetune", value=True)
1055
+ tokenizer_file = gr.Textbox(label="Tokenizer File", value="")
1056
+ file_checkpoint_train = gr.Textbox(label="Pretrain Model", value="")
1057
+
1058
+ with gr.Row():
1059
+ exp_name = gr.Radio(label="Model", choices=["F5TTS_Base", "E2TTS_Base"], value="F5TTS_Base")
1060
+ learning_rate = gr.Number(label="Learning Rate", value=1e-5, step=1e-5)
1061
+
1062
+ with gr.Row():
1063
+ batch_size_per_gpu = gr.Number(label="Batch Size per GPU", value=1000)
1064
+ max_samples = gr.Number(label="Max Samples", value=64)
1065
+
1066
+ with gr.Row():
1067
+ grad_accumulation_steps = gr.Number(label="Gradient Accumulation Steps", value=1)
1068
+ max_grad_norm = gr.Number(label="Max Gradient Norm", value=1.0)
1069
+
1070
+ with gr.Row():
1071
+ epochs = gr.Number(label="Epochs", value=10)
1072
+ num_warmup_updates = gr.Number(label="Warmup Updates", value=5)
1073
+
1074
+ with gr.Row():
1075
+ save_per_updates = gr.Number(label="Save per Updates", value=10)
1076
+ last_per_steps = gr.Number(label="Last per Steps", value=50)
1077
+
1078
+ with gr.Row():
1079
+ mixed_precision = gr.Radio(label="mixed_precision", choices=["none", "fp16", "fpb16"], value="none")
1080
+ start_button = gr.Button("Start Training")
1081
+ stop_button = gr.Button("Stop Training", interactive=False)
1082
+
1083
+ txt_info_train = gr.Text(label="info", value="")
1084
+ start_button.click(
1085
+ fn=start_training,
1086
+ inputs=[
1087
+ cm_project,
1088
+ exp_name,
1089
+ learning_rate,
1090
+ batch_size_per_gpu,
1091
+ batch_size_type,
1092
+ max_samples,
1093
+ grad_accumulation_steps,
1094
+ max_grad_norm,
1095
+ epochs,
1096
+ num_warmup_updates,
1097
+ save_per_updates,
1098
+ last_per_steps,
1099
+ ch_finetune,
1100
+ file_checkpoint_train,
1101
+ tokenizer_type,
1102
+ tokenizer_file,
1103
+ mixed_precision,
1104
+ ],
1105
+ outputs=[txt_info_train, start_button, stop_button],
1106
+ )
1107
+ stop_button.click(fn=stop_training, outputs=[txt_info_train, start_button, stop_button])
1108
+
1109
+ bt_calculate.click(
1110
+ fn=calculate_train,
1111
+ inputs=[
1112
+ cm_project,
1113
+ batch_size_type,
1114
+ max_samples,
1115
+ learning_rate,
1116
+ num_warmup_updates,
1117
+ save_per_updates,
1118
+ last_per_steps,
1119
+ ch_finetune,
1120
+ ],
1121
+ outputs=[
1122
+ batch_size_per_gpu,
1123
+ max_samples,
1124
+ num_warmup_updates,
1125
+ save_per_updates,
1126
+ last_per_steps,
1127
+ lb_samples,
1128
+ learning_rate,
1129
+ epochs,
1130
+ ],
1131
+ )
1132
+
1133
+ ch_finetune.change(
1134
+ check_finetune, inputs=[ch_finetune], outputs=[file_checkpoint_train, tokenizer_file, tokenizer_type]
1135
+ )
1136
+
1137
+ with gr.TabItem("reduse checkpoint"):
1138
+ txt_path_checkpoint = gr.Text(label="path checkpoint :")
1139
+ txt_path_checkpoint_small = gr.Text(label="path output :")
1140
+ ch_safetensors = gr.Checkbox(label="safetensors", value="")
1141
+ txt_info_reduse = gr.Text(label="info", value="")
1142
+ reduse_button = gr.Button("reduse")
1143
+ reduse_button.click(
1144
+ fn=extract_and_save_ema_model,
1145
+ inputs=[txt_path_checkpoint, txt_path_checkpoint_small, ch_safetensors],
1146
+ outputs=[txt_info_reduse],
1147
+ )
1148
+
1149
+ with gr.TabItem("vocab check"):
1150
+ check_button = gr.Button("check vocab")
1151
+ txt_info_check = gr.Text(label="info", value="")
1152
+ check_button.click(fn=vocab_check, inputs=[cm_project], outputs=[txt_info_check])
1153
+
1154
+ with gr.TabItem("test model"):
1155
+ exp_name = gr.Radio(label="Model", choices=["F5-TTS", "E2-TTS"], value="F5-TTS")
1156
+ list_checkpoints, checkpoint_select = get_checkpoints_project(projects_selelect, False)
1157
+
1158
+ nfe_step = gr.Number(label="n_step", value=32)
1159
+
1160
+ with gr.Row():
1161
+ cm_checkpoint = gr.Dropdown(
1162
+ choices=list_checkpoints, value=checkpoint_select, label="checkpoints", allow_custom_value=True
1163
+ )
1164
+ bt_checkpoint_refresh = gr.Button("refresh")
1165
+
1166
+ random_sample_infer = gr.Button("random sample")
1167
+
1168
+ ref_text = gr.Textbox(label="ref text")
1169
+ ref_audio = gr.Audio(label="audio ref", type="filepath")
1170
+ gen_text = gr.Textbox(label="gen text")
1171
+ random_sample_infer.click(
1172
+ fn=get_random_sample_infer, inputs=[cm_project], outputs=[ref_text, gen_text, ref_audio]
1173
+ )
1174
+
1175
+ with gr.Row():
1176
+ txt_info_gpu = gr.Textbox("", label="device")
1177
+ check_button_infer = gr.Button("infer")
1178
+
1179
+ gen_audio = gr.Audio(label="audio gen", type="filepath")
1180
+
1181
+ check_button_infer.click(
1182
+ fn=infer,
1183
+ inputs=[cm_checkpoint, exp_name, ref_text, ref_audio, gen_text, nfe_step],
1184
+ outputs=[gen_audio, txt_info_gpu],
1185
+ )
1186
+
1187
+ bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
1188
+ cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
1189
+
1190
+ with gr.TabItem("system info"):
1191
+ output_box = gr.Textbox(label="GPU and CPU Information", lines=20)
1192
+
1193
+ def update_stats():
1194
+ return get_combined_stats()
1195
+
1196
+ update_button = gr.Button("Update Stats")
1197
+ update_button.click(fn=update_stats, outputs=output_box)
1198
+
1199
+ def auto_update():
1200
+ yield gr.update(value=update_stats())
1201
+
1202
+ gr.update(fn=auto_update, inputs=[], outputs=output_box)
1203
+
1204
+
1205
+ @click.command()
1206
+ @click.option("--port", "-p", default=None, type=int, help="Port to run the app on")
1207
+ @click.option("--host", "-H", default=None, help="Host to run the app on")
1208
+ @click.option(
1209
+ "--share",
1210
+ "-s",
1211
+ default=False,
1212
+ is_flag=True,
1213
+ help="Share the app via Gradio share link",
1214
+ )
1215
+ @click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
1216
+ def main(port, host, share, api):
1217
+ global app
1218
+ print("Starting app...")
1219
+ app.queue(api_open=api).launch(server_name=host, server_port=port, share=share, show_api=api)
1220
+
1221
+
1222
+ if __name__ == "__main__":
1223
+ main()
src/f5_tts/train/train.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # training script.
2
+
3
+ from importlib.resources import files
4
+
5
+ from f5_tts.model import CFM, UNetT, DiT, Trainer
6
+ from f5_tts.model.utils import get_tokenizer
7
+ from f5_tts.model.dataset import load_dataset
8
+
9
+
10
+ # -------------------------- Dataset Settings --------------------------- #
11
+
12
+ target_sample_rate = 24000
13
+ n_mel_channels = 100
14
+ hop_length = 256
15
+
16
+ tokenizer = "pinyin" # 'pinyin', 'char', or 'custom'
17
+ tokenizer_path = None # if tokenizer = 'custom', define the path to the tokenizer you want to use (should be vocab.txt)
18
+ dataset_name = "Emilia_ZH_EN"
19
+
20
+ # -------------------------- Training Settings -------------------------- #
21
+
22
+ exp_name = "F5TTS_Base" # F5TTS_Base | E2TTS_Base
23
+
24
+ learning_rate = 7.5e-5
25
+
26
+ batch_size_per_gpu = 38400 # 8 GPUs, 8 * 38400 = 307200
27
+ batch_size_type = "frame" # "frame" or "sample"
28
+ max_samples = 64 # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
29
+ grad_accumulation_steps = 1 # note: updates = steps / grad_accumulation_steps
30
+ max_grad_norm = 1.0
31
+
32
+ epochs = 11 # use linear decay, thus epochs control the slope
33
+ num_warmup_updates = 20000 # warmup steps
34
+ save_per_updates = 50000 # save checkpoint per steps
35
+ last_per_steps = 5000 # save last checkpoint per steps
36
+
37
+ # model params
38
+ if exp_name == "F5TTS_Base":
39
+ wandb_resume_id = None
40
+ model_cls = DiT
41
+ model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
42
+ elif exp_name == "E2TTS_Base":
43
+ wandb_resume_id = None
44
+ model_cls = UNetT
45
+ model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
46
+
47
+
48
+ # ----------------------------------------------------------------------- #
49
+
50
+
51
+ def main():
52
+ if tokenizer == "custom":
53
+ tokenizer_path = tokenizer_path
54
+ else:
55
+ tokenizer_path = dataset_name
56
+ vocab_char_map, vocab_size = get_tokenizer(tokenizer_path, tokenizer)
57
+
58
+ mel_spec_kwargs = dict(
59
+ target_sample_rate=target_sample_rate,
60
+ n_mel_channels=n_mel_channels,
61
+ hop_length=hop_length,
62
+ )
63
+
64
+ model = CFM(
65
+ transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
66
+ mel_spec_kwargs=mel_spec_kwargs,
67
+ vocab_char_map=vocab_char_map,
68
+ )
69
+
70
+ trainer = Trainer(
71
+ model,
72
+ epochs,
73
+ learning_rate,
74
+ num_warmup_updates=num_warmup_updates,
75
+ save_per_updates=save_per_updates,
76
+ checkpoint_path=str(files("f5_tts").joinpath(f"../../ckpts/{exp_name}")),
77
+ batch_size=batch_size_per_gpu,
78
+ batch_size_type=batch_size_type,
79
+ max_samples=max_samples,
80
+ grad_accumulation_steps=grad_accumulation_steps,
81
+ max_grad_norm=max_grad_norm,
82
+ wandb_project="CFM-TTS",
83
+ wandb_run_name=exp_name,
84
+ wandb_resume_id=wandb_resume_id,
85
+ last_per_steps=last_per_steps,
86
+ )
87
+
88
+ train_dataset = load_dataset(dataset_name, tokenizer, mel_spec_kwargs=mel_spec_kwargs)
89
+ trainer.train(
90
+ train_dataset,
91
+ resumable_with_seed=666, # seed for shuffling dataset
92
+ )
93
+
94
+
95
+ if __name__ == "__main__":
96
+ main()