PyTorch
ONNX
English
Catalan
wetdog commited on
Commit
d4498ee
·
verified ·
1 Parent(s): 5564bcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -3
README.md CHANGED
@@ -1,3 +1,156 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ ---
6
+
7
+ # Wavenext-mel-22khz
8
+
9
+ <!-- Provide a quick summary of what the model is/does. -->
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+ Wavenext is a modification of Vocos, where the last ISTFT layer is replaced with a a trainable linear layer that can directly predict speech waveform samples.
18
+
19
+ This version of Wavenext uses encodec tokens as input features, it's trained using the following bandwidths from encodec (1.5, 3.0, 6.0, 12.0) .
20
+
21
+ ## Intended Uses and limitations
22
+
23
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
24
+ The model is aimed to serve as a vocoder to synthesize audio waveforms from encodec discrete codes. Is trained to generate speech and if is used in other audio
25
+ domain is possible that the model won't produce high quality samples.
26
+
27
+ ## Usage
28
+ ### Installation
29
+
30
+ To use Wavenext only in inference mode, install it using:
31
+
32
+ ```bash
33
+ pip install git+https://github.com/langtech-bsc/wavenext_pytorch
34
+ ```
35
+
36
+ ### Reconstruct audio from encodec tokens
37
+
38
+ You need to provide a bandwidth_id which corresponds to the embedding for bandwidth from the list: [1.5, 3.0, 6.0, 12.0].
39
+
40
+ ```python
41
+ import torch
42
+
43
+ from vocos import Vocos
44
+
45
+ vocos = Vocos.from_pretrained("BSC-LT/wavenext-encodec")
46
+
47
+ audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames
48
+ features = vocos.codes_to_features(audio_tokens)
49
+ bandwidth_id = torch.tensor([2]) # 6 kbps
50
+
51
+ audio = vocos.decode(features, bandwidth_id=bandwidth_id)
52
+
53
+ ```
54
+
55
+ Copy-synthesis from a file:
56
+
57
+ ```python
58
+ import torchaudio
59
+
60
+ y, sr = torchaudio.load(YOUR_AUDIO_FILE)
61
+ if y.size(0) > 1: # mix to mono
62
+ y = y.mean(dim=0, keepdim=True)
63
+ y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
64
+ y_hat = vocos(y, bandwidth_id=bandwidth_id)
65
+ ```
66
+
67
+ ## Training Details
68
+
69
+ ### Training Data
70
+
71
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
72
+
73
+ The model was trained on 4 speech datasets
74
+
75
+ | Dataset | Language | Hours |
76
+ |---------------------|----------|---------|
77
+ | LibriTTS-r | en | 585 |
78
+ | LJSpeech | en | 24 |
79
+ | Festcat | ca | 22 |
80
+ | OpenSLR69 | ca | 5 |
81
+
82
+
83
+ ### Training Procedure
84
+
85
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
86
+ The model was trained for 1M steps and 183 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 1e-4.
87
+
88
+
89
+ #### Training Hyperparameters
90
+
91
+
92
+ * initial_learning_rate: 1e-4
93
+ * scheduler: cosine without warmup or restarts
94
+ * mel_loss_coeff: 45
95
+ * mrd_loss_coeff: 0.1
96
+ * batch_size: 16
97
+ * num_samples: 16384
98
+
99
+ ## Evaluation
100
+
101
+ <!-- This section describes the evaluation protocols and provides the results. -->
102
+
103
+ Evaluation was done using the metrics on the original repo, after 183 epochs we achieve:
104
+
105
+ * val_loss: 3.79
106
+ * f1_score: 0.94
107
+ * mel_loss: 0.27
108
+ * periodicity_loss:0.128
109
+ * pesq_score: 3.27
110
+ * pitch_loss: 31.33
111
+ * utmos_score: 3.20
112
+
113
+
114
+ ## Citation
115
+
116
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
117
+
118
+ If this code contributes to your research, please cite the work:
119
+
120
+ ```
121
+ @INPROCEEDINGS{10389765,
122
+ author={Okamoto, Takuma and Yamashita, Haruki and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
123
+ booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
124
+ title={WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layer},
125
+ year={2023},
126
+ volume={},
127
+ number={},
128
+ pages={1-8},
129
+ keywords={Fourier transforms;Vocoders;Conferences;Automatic speech recognition;ConvNext;end-to-end text-to-speech;linear layer-based upsampling;neural vocoder;Vocos},
130
+ doi={10.1109/ASRU57964.2023.10389765}}
131
+
132
+ @article{siuzdak2023vocos,
133
+ title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
134
+ author={Siuzdak, Hubert},
135
+ journal={arXiv preprint arXiv:2306.00814},
136
+ year={2023}
137
+ }
138
+ ```
139
+
140
+ ## Additional information
141
+
142
+ ### Author
143
+ The Language Technologies Unit from Barcelona Supercomputing Center.
144
+
145
+ ### Contact
146
+ For further information, please send an email to <[email protected]>.
147
+
148
+ ### Copyright
149
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
150
+
151
+ ### License
152
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
153
+
154
+ ### Funding
155
+
156
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).