たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

論文解説: Wang (2021) fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

様々な音声合成の公開実装と評価をしました論文 by Fairseqチーム1
文字/音素/Unit-to-MelとvocoderでSpeech Synthesisをして、Fairseqお得意の指標群で客観評価。

Models

  • models Implemented by Fairseq S2
    • Text-to-Mel: Tacotron 2, Transformer TTS, FastSpeech 22
    • Mel-to-Wave: Griffin-Lim, WaveGlow, HiFiGAN3

手法

Mel生成

  • 入力: characters / g2pE phonemes / Phonemizer phonemes / HuBERT units

  1. “We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants.”

  2. “FAIRSEQ S2 adds state-of-the-art text-to-spectrogram prediction models, Tacotron 2 … and Transformer … which are AR with encoderdecoder model architecture. For … non-AR modeling, we provide FastSpeech 2” from the paper

  3. “For spectrogram-to-waveform conversion (vocoding), FAIRSEQ S2 has a built-in Griffin-Lim … vocoder for fast model-free generation. It also provides examples … WaveGlow … and HiFiGAN” from the paper