たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

論文紹介: Liu (2020) Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

VCC2020 T10モデル1 (top score). ASRベースのrec-synでMOS 4.0 & similarity 3.6 を達成.

Models

ASR

SI-ASR (N10と一緒?)

Conversion model

Encoder-Decoderモデル (≠S2S).

Encoder

LSTM -> 2x time-compressing concat2 -> LSTM

Decoder

Attention付きAR-RNN | Attention無し
1 stepにつき2frame出力3で同じ長さの信号を再構成4.

Duration

Similarity↑ 希望 => 話速が違う話者間は補正しないとsimilarity ↓5 => 推論時にad hocでcontent feature長補正6

モデル学習とは独立しているので、職人の耳で聞きながら調整 (データセット平均で粗い推定7 -> 数字変えながら学習済みモデルで生成、聞いて微調整8).

Results

Attention

Attentionを導入してもほぼ対角行列でローカルのみに注目9.
SpeakerIDに応じた話長変換は見られず10.
Attentionを取っ払っても影響なし11.

(time-compressionはx2程度かつ等長再合成なので、長さをいじらないよう学習するのがむしろ自然では?)
(phoneme2specなTTSだとAttentionS2Sでattentionが音素長を制御しまくるイメージ有り)

Duration

Naturalness/Similarityともに3.7から+0.1程度改善.
(Naturalnessが上がるのは興味深い. 入力分布が学習時に似るから?)


  1. “Our system is denoted as T10.” from original paper

  2. “the hidden cells at consecutive two steps of a layer in PBiLSTM are concatenated and fed into next layer to generate a new hidden cell” from original paper

  3. “At each decoder step, two frames of acoustic features are predicted.” from original paper

  4. “It should be noticed that the lengths of encoder and decoder outputs are equal.” from original paper

  5. “when source and target speakers have obviously different speech rates, the conversion similarity will be decreased.” from original paper

  6. “we perform duration adjustment in conversion stage by interpolating the content-related feature of source speech.” from original paper

  7. “First, a rough coefficient value was estimated by calculating the speech rate ratio between target and source speakers.” from original paper

  8. “Then, … The coefficient value would be modified if speech rate difference was perceived between converted and target speech. This process repeated until no obvious speech rate difference was perceived.” from original paper

  9. “The obtained attention alignment was nearly diagonal and only neighboring frames were focused on.” from original paper

  10. “Although attention was used in our model, we observed no duration conversion was learned.” from original paper

  11. “Furthermore, our internal experiment showed that discarding attention module made no difference to conversion performance. We conducted an experiment by conditioning on the encoder hidden features directly. Comparable performance was obtained.” from original paper