たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

Tacotron 2

主張「TTSしたいならWaveNetを複雑な特徴量で直接条件付けるより "良いchar2specモデル+spec2wave WaveNet" がいいぜ」

概要

Attention Seq-to-Seq で文字列からメルスペクトログラムを生成、WaveNetで波形生成.

LSTM Encoderが文章を丸呑み、最終出力をzとし、attentionにより全時間のEncoder出力からcontextを生成.
LSTM Decoderが (z, context) を使いながらautoregressiveにmel-spec frame生成
AR入力をPre-Net通したり出力spec全体にPost-Netをかけたりはしているが、本質はこんな感じ.

S2Sなので文字長とspec frame長を一致させる縛りは無し.
逆に言うと音素長割り当ても内部で学習してもらう形.
文字列は音素長を持たないが順序関係は持っているため、この構造は活かせると得(文字swapや読み飛ばしが出来ない構造を強制できると効率的)
=> location-sensitive attention

モデル

  • encoder
    • input: pretrained_embedding512(character)1
    • Conv: [Conv1d_k5s1c512-BN-ReLU]x3 2
    • RNN: 1-layer biLSTM256 3
  • attention
  • decoder
    • pre-net
      • input: omain/t-14
      • net: [segFC256-ReLU]x25
    • main
      • input: concat(prenet(z), contextenc)6
      • net: 2-layer uniLSTM10247 + segFC80(concat(oLSTM, contextenc)8)
    • post-net
      • input: concat(all oLSTMt) == (80, Time) as 1d time-series w/ 80 channels9
      • net: Res[(Conv1d_k5s1c512-BN-tanh)x4 - Conv1d_k5s1c80-BN].10

特徴・解釈

Encoder

Encoderでは時間方向に畳みこんだembeddingをbidiLSTMに投入.
ConvStackが周辺数文字の情報を取りこみ11、LSTMが長期依存を付与するといった解釈.

Attention

We use the location-sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature.
Attention probabilities are computed after projecting inputs and location features to 128-dimensional hidden representations.
Location features are computed using 32 1-D convolution filters of length 31.

This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

Decoder

「unidirectional autoregressive RNNで最高なmel-spec生成できます」といったところ.

PreNetがattention学習に不可欠だったと言っている12が、まだピンときてない.
80-dim出力を256のFCx2で処理しているのにinformation bottleneck...?
bottleneckって言葉の意図としては、ARで自己完結せずにconditioningをちゃんと参照して合成する、なのかな…?

結果

demo

興味深い振る舞い

  • 英語スペルミスになぜか強い (デモ有)
  • イントネーション・アクセントにうるさい
    • , 1つでイントネーション変えたりする(デモ有)
    • 学習データに「大文字は強調して読み上げ」を入れておくと、大文字入力でちゃんと強調してくれる(デモ有)
    • 疑問文で文全体のイントネーションをちゃんと変更(デモ有)

🐙チームの他論文s

ネット情報

日本語情報はあまり存在しない、悲しい.
Google https://www.google.com/search?q=%22Tacotron+2%22&lr=lang_ja @2022-06-28


  1. “Input characters are represented using a learned 512-dimensional character embedding” from original paper

  2. “a stack of 3 convolutional layers each containing 512 filters with shape 5 × 1 … followed by batch normalization … and ReLU activations.” from original paper

  3. “The output of the final convolutional layer is passed into a single bi-directional … LSTM … layer containing 512 units (256 in each direction)” from original paper

  4. “The prediction from the previous time step is first passed through a small pre-net” from original paper

  5. “a small pre-net containing 2 fully connected layers of 256 hidden ReLU units.” from original paper

  6. “The prenet output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units.” from original paper

  7. “2 uni-directional LSTM layers with 1024 units.” from original paper

  8. “The concatenation of the LSTM output and the attention context vector is projected through a linear transform to predict the target spectrogram frame.” from original paper

  9. “the predicted mel spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to add to the prediction” from original paper

  10. “Each post-net layer is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer” from original paper

  11. “these convolutional layers model longer-term context (e.g., N-grams) in the input character sequence.” from original paper

  12. “We found that the pre-net acting as an information bottleneck was essential for learning attention” from original paper