Spaces:

AlexK-PL
/

Tacotron2_GST_eng

Sleeping

App Files Files Community

Tacotron2_GST_eng / melgan /README.md

AlexK-PL

Upload 72 files

c61c48a almost 2 years ago

preview code

raw

history blame

3.12 kB

	# MelGAN
	Unofficial PyTorch implementation of [MelGAN vocoder](https://arxiv.org/abs/1910.06711)

	## Key Features

	- MelGAN is lighter, faster, and better at generalizing to unseen speakers than [WaveGlow](https://github.com/NVIDIA/waveglow).
	- This repository use identical mel-spectrogram function from [NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2), so this can be directly used to convert output from NVIDIA's tacotron2 into raw-audio.
	- Pretrained model on LJSpeech-1.1 via [PyTorch Hub](https://pytorch.org/hub).

	![](./assets/gd.png)

	## Prerequisites

	Tested on Python 3.6
	```bash
	pip install -r requirements.txt
	```

	## Prepare Dataset

	- Download dataset for training. This can be any wav files with sample rate 22050Hz. (e.g. LJSpeech was used in paper)
	- preprocess: `python preprocess.py -c config/default.yaml -d [data's root path]`
	- Edit configuration `yaml` file

	## Train & Tensorboard

	- `python trainer.py -c [config yaml file] -n [name of the run]`
	- `cp config/default.yaml config/config.yaml` and then edit `config.yaml`
	- Write down the root path of train/validation files to 2nd/3rd line.
	- Each path should contain pairs of `.wav` with corresponding (preprocessed) `.mel` file.
	- The data loader parses list of files within the path recursively.
	- `tensorboard --logdir logs/`

	## Pretrained model

	Try with Google Colab: TODO

	```python
	import torch
	vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
	vocoder.eval()
	mel = torch.randn(1, 80, 234) # use your own mel-spectrogram here

	if torch.cuda.is_available():
	vocoder = vocoder.cuda()
	mel = mel.cuda()

	with torch.no_grad():
	audio = vocoder.inference(mel)
	```

	## Inference

	- `python inference.py -p [checkpoint path] -i [input mel path]`

	## Results

	See audio samples at: http://swpark.me/melgan/.
	Model was trained at V100 GPU for 14 days using LJSpeech-1.1.

	![](./assets/lj-tensorboard-v0.3-alpha.png)


	## Implementation Authors

	- [Seungwon Park](http://swpark.me) @ MINDsLab Inc. ([email protected], [email protected])
	- Myunchul Joe @ MINDsLab Inc.
	- [Rishikesh](https://github.com/rishikksh20) @ DeepSync Technologies Pvt Ltd.

	## License

	BSD 3-Clause License.

	- [utils/stft.py](./utils/stft.py) by Prem Seetharaman (BSD 3-Clause License)
	- [datasets/mel2samp.py](./datasets/mel2samp.py) from https://github.com/NVIDIA/waveglow (BSD 3-Clause License)
	- [utils/hparams.py](./utils/hparams.py) from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)

	## Useful resources

	- [How to Train a GAN? Tips and tricks to make GANs work](https://github.com/soumith/ganhacks) by Soumith Chintala
	- [Official MelGAN implementation by original authors](https://github.com/descriptinc/melgan-neurips)
	- [Reproduction of MelGAN - NeurIPS 2019 Reproducibility Challenge (Ablation Track)](https://openreview.net/pdf?id=9jTbNbBNw0) by Yifei Zhao, Yichao Yang, and Yang Gao
	- "replacing the average pooling layer with max pooling layer and replacing reflection padding with replication padding improves the performance significantly, while combining them produces worse results"