Text-to-Speech
jellybox
Commander commited on
Commit
d2240fb
·
verified ·
1 Parent(s): d15069f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +248 -1
README.md CHANGED
@@ -3,4 +3,251 @@ license: cc-by-nc-4.0
3
  pipeline_tag: text-to-speech
4
  tags:
5
  - jellybox
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  pipeline_tag: text-to-speech
4
  tags:
5
  - jellybox
6
+ ---
7
+ # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
8
+
9
+ [![python](https://img.shields.io/badge/Python-3.10-brightgreen)](https://github.com/SWivid/F5-TTS)
10
+ [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
11
+ [![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://swivid.github.io/F5-TTS/)
12
+ [![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
13
+ [![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
14
+ [![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
15
+ [![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)
16
+ <!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
17
+
18
+ **F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
19
+
20
+ **E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).
21
+
22
+ **Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
23
+
24
+ ### Thanks to all the contributors !
25
+
26
+ ## News
27
+ - **2025/03/12**: 🔥 F5-TTS v1 base model with better training and inference performance. [Few demo](https://swivid.github.io/F5-TTS_updates).
28
+ - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
29
+
30
+ ## Installation
31
+
32
+ ### Create a separate environment if needed
33
+
34
+ ```bash
35
+ # Create a python 3.10 conda env (you could also use virtualenv)
36
+ conda create -n f5-tts python=3.10
37
+ conda activate f5-tts
38
+ ```
39
+
40
+ ### Install PyTorch with matched device
41
+
42
+ <details>
43
+ <summary>NVIDIA GPU</summary>
44
+
45
+ > ```bash
46
+ > # Install pytorch with your CUDA version, e.g.
47
+ > pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124
48
+ > ```
49
+
50
+ </details>
51
+
52
+ <details>
53
+ <summary>AMD GPU</summary>
54
+
55
+ > ```bash
56
+ > # Install pytorch with your ROCm version (Linux only), e.g.
57
+ > pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
58
+ > ```
59
+
60
+ </details>
61
+
62
+ <details>
63
+ <summary>Intel GPU</summary>
64
+
65
+ > ```bash
66
+ > # Install pytorch with your XPU version, e.g.
67
+ > # Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
68
+ > pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu
69
+ >
70
+ > # Intel GPU support is also available through IPEX (Intel® Extension for PyTorch)
71
+ > # IPEX does not require the Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit
72
+ > # See: https://pytorch-extension.intel.com/installation?request=platform
73
+ > ```
74
+
75
+ </details>
76
+
77
+ <details>
78
+ <summary>Apple Silicon</summary>
79
+
80
+ > ```bash
81
+ > # Install the stable pytorch, e.g.
82
+ > pip install torch torchaudio
83
+ > ```
84
+
85
+ </details>
86
+
87
+ ### Then you can choose one from below:
88
+
89
+ > ### 1. As a pip package (if just for inference)
90
+ >
91
+ > ```bash
92
+ > pip install f5-tts
93
+ > ```
94
+ >
95
+ > ### 2. Local editable (if also do training, finetuning)
96
+ >
97
+ > ```bash
98
+ > git clone https://github.com/SWivid/F5-TTS.git
99
+ > cd F5-TTS
100
+ > # git submodule update --init --recursive # (optional, if need > bigvgan)
101
+ > pip install -e .
102
+ > ```
103
+
104
+ ### Docker usage also available
105
+ ```bash
106
+ # Build from Dockerfile
107
+ docker build -t f5tts:v1 .
108
+
109
+ # Run from GitHub Container Registry
110
+ docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main
111
+
112
+ # Quickstart if you want to just run the web interface (not CLI)
113
+ docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main f5-tts_infer-gradio --host 0.0.0.0
114
+ ```
115
+
116
+
117
+ ## Inference
118
+
119
+ ### 1. Gradio App
120
+
121
+ Currently supported features:
122
+
123
+ - Basic TTS with Chunk Inference
124
+ - Multi-Style / Multi-Speaker Generation
125
+ - Voice Chat powered by Qwen2.5-3B-Instruct
126
+ - [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
127
+
128
+ ```bash
129
+ # Launch a Gradio app (web interface)
130
+ f5-tts_infer-gradio
131
+
132
+ # Specify the port/host
133
+ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
134
+
135
+ # Launch a share link
136
+ f5-tts_infer-gradio --share
137
+ ```
138
+
139
+ <details>
140
+ <summary>NVIDIA device docker compose file example</summary>
141
+
142
+ ```yaml
143
+ services:
144
+ f5-tts:
145
+ image: ghcr.io/swivid/f5-tts:main
146
+ ports:
147
+ - "7860:7860"
148
+ environment:
149
+ GRADIO_SERVER_PORT: 7860
150
+ entrypoint: ["f5-tts_infer-gradio", "--port", "7860", "--host", "0.0.0.0"]
151
+ deploy:
152
+ resources:
153
+ reservations:
154
+ devices:
155
+ - driver: nvidia
156
+ count: 1
157
+ capabilities: [gpu]
158
+
159
+ volumes:
160
+ f5-tts:
161
+ driver: local
162
+ ```
163
+
164
+ </details>
165
+
166
+ ### 2. CLI Inference
167
+
168
+ ```bash
169
+ # Run with flags
170
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
171
+ f5-tts_infer-cli --model F5TTS_v1_Base \
172
+ --ref_audio "provide_prompt_wav_path_here.wav" \
173
+ --ref_text "The content, subtitle or transcription of reference audio." \
174
+ --gen_text "Some text you want TTS model generate for you."
175
+
176
+ # Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
177
+ f5-tts_infer-cli
178
+ # Or with your own .toml file
179
+ f5-tts_infer-cli -c custom.toml
180
+
181
+ # Multi voice. See src/f5_tts/infer/README.md
182
+ f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
183
+ ```
184
+
185
+ ### 3. More instructions
186
+
187
+ - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
188
+ - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
189
+
190
+
191
+ ## Training
192
+
193
+ ### 1. With Hugging Face Accelerate
194
+
195
+ Refer to [training & finetuning guidance](src/f5_tts/train) for best practice.
196
+
197
+ ### 2. With Gradio App
198
+
199
+ ```bash
200
+ # Quick start with Gradio web interface
201
+ f5-tts_finetune-gradio
202
+ ```
203
+
204
+ Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
205
+
206
+
207
+ ## [Evaluation](src/f5_tts/eval)
208
+
209
+
210
+ ## Development
211
+
212
+ Use pre-commit to ensure code quality (will run linters and formatters automatically):
213
+
214
+ ```bash
215
+ pip install pre-commit
216
+ pre-commit install
217
+ ```
218
+
219
+ When making a pull request, before each commit, run:
220
+
221
+ ```bash
222
+ pre-commit run --all-files
223
+ ```
224
+
225
+ Note: Some model components have linting exceptions for E722 to accommodate tensor notation.
226
+
227
+
228
+ ## Acknowledgements
229
+
230
+ - [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
231
+ - [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
232
+ - [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
233
+ - [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
234
+ - [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
235
+ - [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
236
+ - [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
237
+ - [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
238
+ - [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)
239
+ - [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)
240
+
241
+ ## Citation
242
+ If our work and codebase is useful for you, please cite as:
243
+ ```
244
+ @article{chen-etal-2024-f5tts,
245
+ title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
246
+ author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
247
+ journal={arXiv preprint arXiv:2410.06885},
248
+ year={2024},
249
+ }
250
+ ```
251
+ ## License
252
+
253
+ Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.