TTS
Transformer-based parameter estimator + Parallel WaveGAN vocoder
ref: Transformer TTS & FastSpeech
Base: FastSpeech
- i/o: phoneme sequences + accent -> mel-spectrograms
- model: a six-layer encoder and a six-layer decoder (each was based on 8 multi-head attention)
- opt: RAdam optimizer
- lr: warmup learning rate scheduling (1.0 at start,
- dynamic batch size (average 64) strategy
- training
- epoch: 1000
accent as an external input for pitch accent language (e.g., Japanese) [29]