論文解説: Polyak (2021) Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

ニューラルな音響特徴量（content, f_o, speaker）からneural vocoderで音声合成/変換/圧縮できるかなチャレンジ.

表現学習とボコーダ学習は完全分離 (表現モデルをpretraining -> fix). fixされたモデル出力からボコーダ学習.
content表現モデルはCPC, HuBERT, VQ-VAEの3つを試してる.

フォーマルまとめ

Acoustic Unitを入力とする音声合成（unit-to-speech）、話者変換、圧縮が可能であることを示した.
制御性を重視し、3つの個別Acoustic Unit (content, f_o unit, speaker) を入力とし、韻律制御時にはf_o unitを、VC時には speaker Identityを操作してHifiGANによる音声合成をおこなった.

デモ

link

背景

research question「生成面で高品質・高制御性・高圧縮な離散SSLはどれか」

Acoustic Unitの評価はASRベースがほとんど。
デモとして音声再合成が示されてはいるが、unitの特性（含まれている情報 (e.g. phoneme・fo・speaker) 、disentanglement、合成精度への貢献など）は研究が進んでいない。

音声合成のうち reconstruction が目的であれば content と f_o がentangleしていても問題ない。
しかしピッチ操作や話者変換に利用したい場合、disentangleしていないと困る（ピッチの制御性が必要）。

提案手法

content/f_o/spk-to-wave合成モデル。
この学習を通じ、様々な SSL unit が充分なcontentをdisentangleした形で¹ ² ³ ⁴高圧縮に持っているか検証。

S2u

content encoder E_c: waveform::R^T -> discrete representation seq::R^n*T'
- CPC | HuBERT | VQ-VAE
- units z_c::{0, 1, ..., K}^L: k-means from continuous E_c output or direct discrete E_c output
f_o encoder E_{f_o}: waveform::R^T -> (YAAPT) -> f_o series p :: ?^T' -> discrete representation z_{f_o} ::{0, 1, ..., K'}^L'
speaker identity encoder E_spk: ? -> single global representation z_spk ::R²⁵⁶: d-vector

u2S

ベースモデル: HiFi-GAN

PreNet
- inputs: (z_c, z_Fo, z_spk)
- step1 / Embedding: discrete_AU z_c | prosody z_Fo => (LUT_c | LUT_Fo) => embedding_vector
- step2 / time-scaling: upsample emb(z_c)/emb(z_Fo)/z_spk for HiFi-GAN input, then concat
- outputs: (Feat, Frame)
HiFi-GAN
- G: original HiFi-GAN V1
  - 'up-MRF'レイヤーを1層追加、各層でのup↑倍率は控えめ
- D
  - MPD(period=2|3|5|7|11)
  - MSD(scale=x1|x2|x4)
loss
- adversarial loss
- mel-spec L1 loss (mel-STFT loss)
- feature matching loss

loss balance: λ_fm = 2 / λ_r = 45.

Experiments

Data

All audio is 16kHz.

training
- Content Encoder
  - CPC: Libri-light clean 6k hours (LL6k-e-loCTC)⁵
  - HuBERT: LibriSpeech 960 hours⁶
  - VQVAE: Libri-light clean 6k hours (LL6k-e-loCTC)⁷
  - k-means: LibriSpeech clean-100h
- f_o Encoder: VCTK
- Decoder
evaluation⁸
- data
  - LJSpeech 16kHz (single speaker)
  - VCTK 16kHz (multiple speakers)
- metrics
  - VDE / Voicing Decision Error: VUV誤判定率⁹
  - FFE / F0 Frame Error: ピッチエラー率（VUV取り違え or 20%以上のf_o誤差）¹⁰
- viewpoint
  - reconstruction
    - metrics: VDE↓ / FFE↓
  - controllability
    - f_o: f_o特徴量をflatにして入力
      - metrics: VDE↑ / FFE↑¹¹（操作性が高ければflat f_oが反映されGTとの誤差が拡大）

Model

CPC

model: modified CPC (CPC2) ¹²
hop size: 160 (10msec)
k-means: K=100

HuBERT

HuBERT base
n_hop: 320
Encoder equivalent: layer 6

VQVAE

n_hop: 160
n_z: 256 tokens
model: Conv Encoder + HiFi Decoder
training: VQVAE-GAN? (> Finally, we used HiFiGAN (architecture and objective) )

k-means

CPCおよびHuBERTの出力はcontinuousなので、k-meansを用いて離散ユニット化¹³。
discrete unit series z_c::{0, 1, ..., K}^L, K=dictionarySize, L=seriesLength
k-meansの学習にはLibriSpeech clean-100hを利用¹⁴。
Kは100固定¹⁵。

Encoder_{f_o}

total n_hop = 1280 (80x16, equal 12.5 Hz, see also Fig.1)
YAAPT
- window: 20 msec
- n_hop: 80 (5msec hop in sr=16000Hz -> 80sample/frame)
- output: 200 Hz f_o series
VQAVE
- down sampling: x16 (4 layers of x2↓ Conv)
- n_z K': 20 token

結果

VQVAEと比較してk-CPCとk-HuBERTはdisentangleなcontentになってる.

再構成タスクだと自然性MOSに顕著な差がない.
要素ごとに見ていくとVQ-VAEはcontent精度が微妙な代わりにFoの精度が非常に高い.
k-CPCとk-HuBERTは各指標についてHuBERTの方がデータセットによってはちょっといい、という感じ

変換タスクだと顕著な差が出る.
VC時にはVQVAEのMOSが3以下にまで落ち、content精度も低い.
Fo変換では逆に変換がほとんどかからず、content representationにFoがentangleされていることが伺える.
k-CPCとk-HuBERTはHuBERTの方が優勢といった感じ.

公式追加情報

学習済みモデル配布無し¹⁶
fo statisticsは各自で用意しろとのこと¹⁷

Original Paper

@misc{2104.00355,
Author = {Adam Polyak and Yossi Adi and Jade Copet and Eugene Kharitonov and Kushal Lakhotia and Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
Title = {Speech Resynthesis from Discrete Disentangled Self-Supervised Representations},
Year = {2021},
Eprint = {arXiv:2104.00355},
}

"To generate disentangled representation, we separately extract ... speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner." from the paper↩
"in the context of expressive and controllable generation, it is unknown to what extent the speaker identity and F0 information are encoded in the learned representations." from the paper↩
"proposed method ... allows the evaluation of the learned units with respect to speech content, speaker identity, and F0 information, as well as better control the audio synthesis." from the paper↩
"The study by [23] ... The authors demonstrated ... overall generation quality. In contrast, we ... show these representations are better disentangled" from the paper↩
"CPC ... was trained on a 'clean' 6k hour sub-sample of the LibriLight dataset [Kahn2020Libri-light, Riviere2020Towards]." from the paper↩
"HuBERT ... trained ... on 960 hours of LibriSpeech corpus" from the paper↩
"Similarly to CPC models, we trained the VQ-VAE content encoder model on the “clean” 6K hours subset from the LibriLight dataset." from the paper↩
"We employ two datasets: LJ ... single speaker dataset and VCTK ... multi-speaker dataset. All datasets were resampled to a 16kHz sample rate." from the paper↩
"Voicing Decision Error (VDE) ... which measures the portion of frames with voicing decision error" from the paper↩
"F0 Frame Error (FFE) ...measures the percentage of frames that contain a deviation of more than 20% in pitch value or have a voicing decision error" from the paper↩
"F0 manipulation results ... for VDE, and FFE higher is the better since F0 was flattened." from the paper↩
"For CPC, we used the model from [Riviere2020towards] ... We extract a downsampled representation from an intermediate layer with a 256-dimensional embedding and a hop size of 160 audio samples." Polyak, et al. (2021).↩
"Since the representations learned by CPC and HuBERT are continuous, a k-means algorithm is applied over the models’ outputs to generate discrete units" from the paper↩
"the k-means algorithm is trained on LibriSpeech clean-100h" from the paper↩
"We quantize both learned representations with K = 100 centroids" from the paper↩
"Unfortunately, we are unable to release pre-trained models." official GitHub issue ↩
"Yes you need to get the statistics for the f0" official GitHub issue ↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

論文解説: Polyak (2021) Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

フォーマルまとめ

デモ

背景

提案手法

S2u

u2S

Experiments

Data

Model

CPC

HuBERT

VQVAE

k-means

Encoder_{f_o}

結果

公式追加情報

Original Paper

フォーマルまとめ

デモ

背景

提案手法

S2u

u2S

Experiments

Data

Model

CPC

HuBERT

VQVAE

k-means

Encoderfo

結果

公式追加情報

Original Paper

Encoder_{f_o}