たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

論文解説: Polyak (2021) Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Paper

ニューラルな音響特徴量(content, fo, speaker)からneural vocoder音声合成/変換/圧縮できるかなチャレンジ.

表現学習とボコーダ学習は完全分離 (表現モデルをpretraining -> fix). fixされたモデル出力からボコーダ学習.
content表現モデルはCPC, HuBERT, VQ-VAEの3つを試してる.

フォーマルまとめ

Acoustic Unitを入力とする音声合成(unit-to-speech)、話者変換、圧縮が可能であることを示した.
制御性を重視し、3つの個別Acoustic Unit (content, fo unit, speaker) を入力とし、韻律制御時にはfo unitを、VC時には speaker Identityを操作してHifiGANによる音声合成をおこなった.

デモ

link

背景

research question「生成面で高品質・高制御性・高圧縮な離散SSLはどれか」

Acoustic Unitの評価はASRベースがほとんど。
デモとして音声再合成が示されてはいるが、unitの特性(含まれている情報 (e.g. phoneme・fo・speaker) 、disentanglement、合成精度への貢献など)は研究が進んでいない。

音声合成のうち reconstruction が目的であれば content と fo がentangleしていても問題ない。
しかしピッチ操作や話者変換に利用したい場合、disentangleしていないと困る(ピッチの制御性が必要)。

提案手法

content/fo/spk-to-wave合成モデル。
この学習を通じ、様々な SSL unit が 充分なcontentをdisentangleした形で1 2 3 4高圧縮に持っているか検証。

S2u

  • content encoder Ec: waveform::RT -> discrete representation seq::Rn*T'
    • CPC | HuBERT | VQ-VAE
    • units zc::{0, 1, ..., K}L: k-means from continuous Ec output or direct discrete Ec output
  • fo encoder Efo: waveform::RT -> (YAAPT) -> fo series p :: ?T' -> discrete representation zfo ::{0, 1, ..., K'}L'
  • speaker identity encoder Espk: ? -> single global representation zspk ::R256: d-vector

u2S

ベースモデル: HiFi-GAN

  • PreNet
    • inputs: (zc, zFo, zspk)
    • step1 / Embedding: discrete_AU zc | prosody zFo => (LUTc | LUTFo) => embedding_vector
    • step2 / time-scaling: upsample emb(zc)/emb(zFo)/zspk for HiFi-GAN input, then concat
    • outputs: (Feat, Frame)
  • HiFi-GAN
    • G: original HiFi-GAN V1
      • 'up-MRF'レイヤーを1層追加、各層でのup↑倍率は控えめ
    • D
      • MPD(period=2|3|5|7|11)
      • MSD(scale=x1|x2|x4)
  • loss
    • adversarial loss
    • mel-spec L1 loss (mel-STFT loss)
    • feature matching loss

loss balance: λfm = 2 / λr = 45.

Experiments

Data

All audio is 16kHz.

  • training
    • Content Encoder
      • CPC: Libri-light clean 6k hours (LL6k-e-loCTC)5
      • HuBERT: LibriSpeech 960 hours6
      • VQVAE: Libri-light clean 6k hours (LL6k-e-loCTC)7
      • k-means: LibriSpeech clean-100h
    • fo Encoder: VCTK
    • Decoder
  • evaluation8
    • data
      • LJSpeech 16kHz (single speaker)
      • VCTK 16kHz (multiple speakers)
    • metrics
      • VDE / Voicing Decision Error: VUV誤判定率9
      • FFE / F0 Frame Error: ピッチエラー率(VUV取り違え or 20%以上のfo誤差)10
    • viewpoint
      • reconstruction
        • metrics: VDE↓ / FFE
      • controllability
        • fo: fo特徴量をflatにして入力
          • metrics: VDE↑ / FFE11(操作性が高ければflat foが反映されGTとの誤差が拡大)

Model

CPC

  • model: modified CPC (CPC2) 12
  • hop size: 160 (10msec)
  • k-means: K=100

HuBERT

  • HuBERT base
  • n_hop: 320
  • Encoder equivalent: layer 6

VQVAE

  • n_hop: 160
  • n_z: 256 tokens
  • model: Conv Encoder + HiFi Decoder
  • training: VQVAE-GAN? (> Finally, we used HiFiGAN (architecture and objective) )

k-means

CPCおよびHuBERTの出力はcontinuousなので、k-meansを用いて離散ユニット化13
discrete unit series zc::{0, 1, ..., K}L, K=dictionarySize, L=seriesLength
k-meansの学習にはLibriSpeech clean-100hを利用14
Kは100固定15

Encoderfo

  • total n_hop = 1280 (80x16, equal 12.5 Hz, see also Fig.1)
  • YAAPT
    • window: 20 msec
    • n_hop: 80 (5msec hop in sr=16000Hz -> 80sample/frame)
    • output: 200 Hz fo series
  • VQAVE
    • down sampling: x16 (4 layers of x2↓ Conv)
    • n_z K': 20 token

結果

VQVAEと比較してk-CPCとk-HuBERTはdisentangleなcontentになってる.

再構成タスクだと自然性MOSに顕著な差がない.
要素ごとに見ていくとVQ-VAEはcontent精度が微妙な代わりにFoの精度が非常に高い.
k-CPCとk-HuBERTは各指標についてHuBERTの方がデータセットによってはちょっといい、という感じ

変換タスクだと顕著な差が出る.
VC時にはVQVAEのMOSが3以下にまで落ち、content精度も低い.
Fo変換では逆に変換がほとんどかからず、content representationにFoがentangleされていることが伺える.
k-CPCとk-HuBERTはHuBERTの方が優勢といった感じ.

公式追加情報

  • 学習済みモデル配布無し16
  • fo statisticsは各自で用意しろとのこと17

Original Paper

Paper

@misc{2104.00355,
Author = {Adam Polyak and Yossi Adi and Jade Copet and Eugene Kharitonov and Kushal Lakhotia and Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
Title = {Speech Resynthesis from Discrete Disentangled Self-Supervised Representations},
Year = {2021},
Eprint = {arXiv:2104.00355},
}

  1. "To generate disentangled representation, we separately extract ... speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner." from the paper
  2. "in the context of expressive and controllable generation, it is unknown to what extent the speaker identity and F0 information are encoded in the learned representations." from the paper
  3. "proposed method ... allows the evaluation of the learned units with respect to speech content, speaker identity, and F0 information, as well as better control the audio synthesis." from the paper
  4. "The study by [23] ... The authors demonstrated ... overall generation quality. In contrast, we ... show these representations are better disentangled" from the paper
  5. "CPC ... was trained on a 'clean' 6k hour sub-sample of the LibriLight dataset [Kahn2020Libri-light, Riviere2020Towards]." from the paper
  6. "HuBERT ... trained ... on 960 hours of LibriSpeech corpus" from the paper
  7. "Similarly to CPC models, we trained the VQ-VAE content encoder model on the “clean” 6K hours subset from the LibriLight dataset." from the paper
  8. "We employ two datasets: LJ ... single speaker dataset and VCTK ... multi-speaker dataset. All datasets were resampled to a 16kHz sample rate." from the paper
  9. "Voicing Decision Error (VDE) ... which measures the portion of frames with voicing decision error" from the paper
  10. "F0 Frame Error (FFE) ...measures the percentage of frames that contain a deviation of more than 20% in pitch value or have a voicing decision error" from the paper
  11. "F0 manipulation results ... for VDE, and FFE higher is the better since F0 was flattened." from the paper
  12. "For CPC, we used the model from [Riviere2020towards] ... We extract a downsampled representation from an intermediate layer with a 256-dimensional embedding and a hop size of 160 audio samples." Polyak, et al. (2021).
  13. "Since the representations learned by CPC and HuBERT are continuous, a k-means algorithm is applied over the models’ outputs to generate discrete units" from the paper
  14. "the k-means algorithm is trained on LibriSpeech clean-100h" from the paper
  15. "We quantize both learned representations with K = 100 centroids" from the paper
  16. "Unfortunately, we are unable to release pre-trained models." official GitHub issue
  17. "Yes you need to get the statistics for the f0" official GitHub issue