File size: 2,921 Bytes
d556ceb
 
 
 
 
 
 
 
 
 
 
adb9517
 
 
 
d556ceb
 
 
adb9517
d556ceb
46f5882
eebd102
 
06c2080
466f518
06c2080
534e15a
466f518
ea39a83
 
534e15a
ea39a83
06c2080
 
 
 
 
 
 
 
 
 
 
 
 
 
adb9517
06c2080
 
 
 
adb9517
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: mit
tags:
- vocoder
- audio
- speech
- tts
---

# Model Card for Model ID

[HuggingFace 🤗 - Repository](https://huggingface.co/Respair/HiFormer_Vocoder)

**DDP is very un-stable, please use the single-gpu training script** - if you still want to do it, I suggest uncommenting the grad clipping lines; that should help a lot.

This Vocoder, is a combination of [HiFTnet](https://github.com/yl4579/HiFTNet) and [Ringformer](https://github.com/seongho608/RingFormer). it supports Ring Attention, Conformer and Neural Source Filtering etc.
This repository is experimental, expect some bugs and some hardcoded params.

The default setting is 44.1khz - 128 Mel bins. if you want to change it to 24khz, copy the config from HiFTnet (make sure to copy its pitch extractor, both the model + the checkpoint.), then change 128 to 80 in LN-384 of the models.py. then uncomment the "multiscale_subband_cfg" for the 24khz version.

Huge Thanks to [Johnathan Duering](https://github.com/duerig) for his help. I mostly implemented this based on his [STTS2 Fork](https://github.com/duerig/StyleTTS2/tree/main).

**This is highly experimental, I have not conducted a full session training. I just tested that the loss goes down and the eval samples sound reasonable for ~10K steps of minimal training.**

____________________________________________________________________________________


**NOTE**: I have uploaded Two checkpoints so far. one is 24khz for HiFormer, trained for roughly 117K~ steps on LibriTTS (360 + 100) and 40 hours of other English datasets.

the other checkpoint is HiFTNet, 44.1khz on more than 1100 Hours of Multilingual data, sourced privately by myself. it includes Arabic, Persian, Japanese, English and Russian. this one is trained for ~100K steps. 
Ideally both should be trained up to 1M steps, so I strongly recommend you to further fine-tune it on your own downstream task until I pre-train these for more steps.

## Pre-requisites
1. Python >= 3.10
2. Clone this repository:
```bash
git clone https://github.com/Respaired/HiFormer_Vocoder
cd HiFormer_Vocoder/Ringformer
```
3. Install python requirements: 
```bash
pip install -r requirements.txt
```

## Training
```bash
CUDA_VISIBLE_DEVICES=0 python train_single_gpu.py --config config_v1.json --[args]
```
For the F0 model training, please refer to [yl4579/PitchExtractor](https://github.com/yl4579/PitchExtractor). This repo includes a pre-trained F0 model on a Mixture of Multilingual data for the previously mentioned configuration. I'm going to quote the HiFTnet's Author: "Still, you may want to train your own F0 model for the best performance, particularly for noisy or non-speech data, as we found that F0 estimation accuracy is essential for the vocoder performance." 

## Inference
Please refer to the notebook [inference.ipynb](https://github.com/Respaired/HiFormer_Vocoder/blob/main/RingFormer/inference.ipynb) for details.