Tacotron 2 - たれぱんのびぼーろく

主張「TTSしたいならWaveNetを複雑な特徴量で直接条件付けるより "良いchar2specモデル+spec2wave WaveNet" がいいぜ」

概要

Attention Seq-to-Seq で文字列からメルスペクトログラムを生成、WaveNetで波形生成.

LSTM Encoderが文章を丸呑み、最終出力をzとし、attentionにより全時間のEncoder出力からcontextを生成.
LSTM Decoderが (z, context) を使いながらautoregressiveにmel-spec frame生成
AR入力をPre-Net通したり出力spec全体にPost-Netをかけたりはしているが、本質はこんな感じ.

S2Sなので文字長とspec frame長を一致させる縛りは無し.
逆に言うと音素長割り当ても内部で学習してもらう形.
文字列は音素長を持たないが順序関係は持っているため、この構造は活かせると得（文字swapや読み飛ばしが出来ない構造を強制できると効率的）
=> location-sensitive attention

モデル

encoder
- input: pretrained_embedding512(character)¹
- Conv: [Conv1d_k5s1c512-BN-ReLU]x3 ²
- RNN: 1-layer biLSTM256 ³
attention
decoder
- pre-net
  - input: o_main/t-1⁴
  - net: [segFC256-ReLU]x2⁵
- main
  - input: concat(prenet(z), context_enc)⁶
  - net: 2-layer uniLSTM1024⁷ + segFC80(concat(o_LSTM, context_enc)⁸)
- post-net
  - input: concat(all o_{LSTM_t}) == (80, Time) as 1d time-series w/ 80 channels⁹
  - net: Res[(Conv1d_k5s1c512-BN-tanh)x4 - Conv1d_k5s1c80-BN].¹⁰

特徴・解釈

Encoder

Encoderでは時間方向に畳みこんだembeddingをbidiLSTMに投入.
ConvStackが周辺数文字の情報を取りこみ¹¹、LSTMが長期依存を付与するといった解釈.

Attention

We use the location-sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature.
Attention probabilities are computed after projecting inputs and location features to 128-dimensional hidden representations.
Location features are computed using 32 1-D convolution filters of length 31.

This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

Decoder

「unidirectional autoregressive RNNで最高なmel-spec生成できます」といったところ.

PreNetがattention学習に不可欠だったと言っている¹²が、まだピンときてない.
80-dim出力を256のFCx2で処理しているのにinformation bottleneck...?
bottleneckって言葉の意図としては、ARで自己完結せずにconditioningをちゃんと参照して合成する、なのかな…?

結果

demo

興味深い振る舞い

英語スペルミスになぜか強い (デモ有)
イントネーション・アクセントにうるさい
- , 1つでイントネーション変えたりする（デモ有）
- 学習データに「大文字は強調して読み上げ」を入れておくと、大文字入力でちゃんと強調してくれる（デモ有）
- 疑問文で文全体のイントネーションをちゃんと変更（デモ有）

🐙チームの他論文s

ネット情報

日本語情報はあまり存在しない、悲しい.
Google https://www.google.com/search?q=%22Tacotron+2%22&lr=lang_ja @2022-06-28

ESPNet: jaモデル含め1ポチで推論、ウェイト配布、学習可能。最高！
- pre-trained / training conf / InferenceNotebook
DeNA: jaデモ. 素晴らしいクオリティ
シロワニさん（言わずと知れた）
- en再現実験: NVIDIAの再現実験に成功. デモ有. epoch数等の情報有
- ja学習: 失敗とのこと
- ja転移学習: 成功、デモ有. やっぱりひらための声になる?
- ja転移学習2: en->ja->ja特定話者転移学習. 大成功. かわいい
- 2の改善: モデル一部変更、vocoder変更. かわいい
Lentoさん: en->ja転移学習. 綾波声、声質そっくり
npakaさん
- Japanese Single Speaker Speech転移学習: en->jaの転移学習. デモ無し
- JUST転移学習: en->ja (JSUT) の転移学習. めちゃくちゃ綺麗に合成できとる
- つくよみちゃん転移学習: en->ja->つくよみちゃんの二重転移. 浅いstepでのデモ有. 声質はちゃんとでてる
intage tech: en学習. 日本語をむりやり突っ込んで外人の日本語みたいになってる面白いデモがある
ttslearn 第9章: r9y9さんによるモデルコード解説Notebook. めちゃ良い.
Qiita@mana_b: 学習済みenモデル→ ja転移学習. デモ有. もそもそ喋る?
Qiita@tset-tset-tset: NVIDIAおよびMozillaのja学習. デモ有. vocoder由来？のノイズ多め
Qitta@omuram: ヒカキン音声合成. ja転移学習ぽい. デモ有. ヒカキン感がある、ほんのりノイズ
Qitta@sujoyu: 合成+VC. ja学習. デモ有
tosakaさん/詳細: en->ja転移学習. デモ有
Tacotron 2で音声合成: マヤ・キチェ語という言語のTTSみたい、興味深い
[西野, et al. (2021). 韻律を考慮した end-to-end 方式に基づく自発音声合成.]: 日本語学習に成功してる…？デモ無し

“Input characters are represented using a learned 512-dimensional character embedding” from original paper↩
“a stack of 3 convolutional layers each containing 512 filters with shape 5 × 1 … followed by batch normalization … and ReLU activations.” from original paper↩
“The output of the final convolutional layer is passed into a single bi-directional … LSTM … layer containing 512 units (256 in each direction)” from original paper↩
“The prediction from the previous time step is first passed through a small pre-net” from original paper↩
“a small pre-net containing 2 fully connected layers of 256 hidden ReLU units.” from original paper↩
“The prenet output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units.” from original paper↩
“2 uni-directional LSTM layers with 1024 units.” from original paper↩
“The concatenation of the LSTM output and the attention context vector is projected through a linear transform to predict the target spectrogram frame.” from original paper↩
“the predicted mel spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to add to the prediction” from original paper↩
“Each post-net layer is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer” from original paper↩
“these convolutional layers model longer-term context (e.g., N-grams) in the input character sequence.” from original paper↩
“We found that the pre-net acting as an information bottleneck was essential for learning attention” from original paper↩