BSC-LT
/

wavenext-encodec

+---
+license: apache-2.0
+language:
+- en
+---
+# Wavenext-mel-22khz
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+Wavenext is a modification of Vocos, where the last ISTFT layer is replaced with a a trainable linear layer that can directly predict speech waveform samples.
+This version of Wavenext uses encodec tokens as input features, it's trained using the following bandwidths from encodec (1.5, 3.0, 6.0, 12.0) .
+## Intended Uses and limitations
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+The model is aimed to serve as a vocoder to synthesize audio waveforms from encodec discrete codes. Is trained to generate speech and if is used in other audio
+domain is possible that the model won't produce high quality samples.
+## Usage
+### Installation
+To use Wavenext only in inference mode, install it using:
+```bash
+pip install git+https://github.com/langtech-bsc/wavenext_pytorch
+```
+### Reconstruct audio from encodec tokens
+You need to provide a bandwidth_id which corresponds to the embedding for bandwidth from the list: [1.5, 3.0, 6.0, 12.0].
+```python
+import torch
+from vocos import Vocos
+vocos = Vocos.from_pretrained("BSC-LT/wavenext-encodec")
+audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames
+features = vocos.codes_to_features(audio_tokens)
+bandwidth_id = torch.tensor([2])  # 6 kbps
+audio = vocos.decode(features, bandwidth_id=bandwidth_id)
+```
+Copy-synthesis from a file:
+```python
+import torchaudio
+y, sr = torchaudio.load(YOUR_AUDIO_FILE)
+if y.size(0) > 1:  # mix to mono
+    y = y.mean(dim=0, keepdim=True)
+y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
+y_hat = vocos(y, bandwidth_id=bandwidth_id)
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained on 4 speech datasets
+| Dataset             | Language | Hours   |
+|---------------------|----------|---------|
+| LibriTTS-r          | en       | 585     |
+| LJSpeech            | en       | 24      |
+| Festcat             | ca       | 22      |
+| OpenSLR69           | ca       | 5       |
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The model was trained for 1M steps and 183 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 1e-4.
+#### Training Hyperparameters
+* initial_learning_rate: 1e-4
+* scheduler: cosine without warmup or restarts
+* mel_loss_coeff: 45
+* mrd_loss_coeff: 0.1
+* batch_size: 16
+* num_samples: 16384
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+Evaluation was done using the metrics on the original repo, after 183 epochs we achieve:
+* val_loss: 3.79
+* f1_score: 0.94
+* mel_loss: 0.27
+* periodicity_loss:0.128
+* pesq_score: 3.27
+* pitch_loss: 31.33
+* utmos_score: 3.20
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+If this code contributes to your research, please cite the work:
+```
+@INPROCEEDINGS{10389765,
+  author={Okamoto, Takuma and Yamashita, Haruki and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
+  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
+  title={WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layer},
+  year={2023},
+  volume={},
+  number={},
+  pages={1-8},
+  keywords={Fourier transforms;Vocoders;Conferences;Automatic speech recognition;ConvNext;end-to-end text-to-speech;linear layer-based upsampling;neural vocoder;Vocos},
+  doi={10.1109/ASRU57964.2023.10389765}}
+@article{siuzdak2023vocos,
+  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
+  author={Siuzdak, Hubert},
+  journal={arXiv preprint arXiv:2306.00814},
+  year={2023}
+}
+```
+## Additional information
+### Author
+The Language Technologies Unit from Barcelona Supercomputing Center.
+### Contact
+For further information, please send an email to <[email protected]>.
+### Copyright
+Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
+### License
+[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).