様々な音声合成の公開実装と評価をしました論文 by Fairseqチーム1。
文字/音素/Unit-to-MelとvocoderでSpeech Synthesisをして、Fairseqお得意の指標群で客観評価。
Models
- models Implemented by Fairseq S2
手法
Mel生成
- 入力: characters / g2pE phonemes / Phonemizer phonemes / HuBERT units
-
“We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants.” ↩
-
“FAIRSEQ S2 adds state-of-the-art text-to-spectrogram prediction models, Tacotron 2 … and Transformer … which are AR with encoderdecoder model architecture. For … non-AR modeling, we provide FastSpeech 2” from the paper↩
-
“For spectrogram-to-waveform conversion (vocoding), FAIRSEQ S2 has a built-in Griffin-Lim … vocoder for fast model-free generation. It also provides examples … WaveGlow … and HiFiGAN” from the paper↩