論文解説: Wang (2021) fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

様々な音声合成の公開実装と評価をしました論文 by Fairseqチーム¹。
文字/音素/Unit-to-MelとvocoderでSpeech Synthesisをして、Fairseqお得意の指標群で客観評価。

Models

models Implemented by Fairseq S²
- Text-to-Mel: Tacotron 2, Transformer TTS, FastSpeech 2²
- Mel-to-Wave: Griffin-Lim, WaveGlow, HiFiGAN³

“We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants.” ↩
“FAIRSEQ S² adds state-of-the-art text-to-spectrogram prediction models, Tacotron 2 … and Transformer … which are AR with encoderdecoder model architecture. For … non-AR modeling, we provide FastSpeech 2” from the paper↩
“For spectrogram-to-waveform conversion (vocoding), FAIRSEQ S² has a built-in Griffin-Lim … vocoder for fast model-free generation. It also provides examples … WaveGlow … and HiFiGAN” from the paper↩