VAE Tacotron-2:

Unofficial Implementation of Learning latent representations for style control and transfer in end-to-end speech synthesis

Repository Structure:

Tacotron-2
├── datasets
├── LJSpeech-1.1    (0)
│   └── wavs
├── logs-Tacotron   (2)
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── papers
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output (3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
└── training_data   (1)
    ├── audio
    └── mels

The previous tree shows what the current state of the repository.

Requirements

first, you need to have python 3.5 installed along with Tensorflow v1.6.

next you can install the requirements :

pip install -r requirements.txt

else:

pip3 install -r requirements.txt

Dataset:

This repo tested on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording.

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

or

python3 preprocess.py

dataset can be chosen using the --dataset argument. Default is Ljspeech.

Training:

Feature prediction model can be trained using:

python train.py --model='Tacotron'

or

python3 train.py --model='Tacotron'

Synthesis

There are three types of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):

python synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'

or

python3 synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'

Note:

Pretrained model and Samples:

TODO Claimed Samples from research paper : http://home.ustc.edu.cn/~zyj008/ICASSP2019

References and Resources:

Work in progress