Close to Human Quality TTS with Transformer (2018)
Transformer (phoneme2spec) + WaveNet vocoder を用いたE2E Neural TTS.
2018のTTS SOTA (this system 4.39 vs human 4.44)

概要

Tacotron系のencoderとDecoder ¹ をTransformerに置き換えたもの.
input sequence (of words or phonemes) => (encoder) => semantic space and generates a sequence of encoder hidden states

hidden states => (decoder) => mel frames

traditional Encoder & Decoder : RNN (LSTM, GRU)

f:id:tarepan5884:20190131165038p:plain — Close to Human Quality TTS with Transformer

Arch

Phoneme生成

Text-to-Phoneme Converter
NNでやってもいいけど、データ数が不足すると網羅性にかけてしまう。今回はrule systemで

学習上の注意

バッチサイズがでかくないとダメ。
Tesla P100 (メモリ16GB or 12GB) 1台だけだと学習がうまくいかない ².
V100も16GBなので、2019年現在ではマルチGPU必須.

The end-to-end neural TTS models contain two components, an encoder and a decoder ↩
We try training on a single GPU, but the procedures are quiet instable or even failed, by which synthesized audios were like raving and incomprehensible ↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

Close to Human Quality TTS with Transformer

概要

Arch

Phoneme生成

学習上の注意