Scyclone VC - たれぱんのびぼーろく

CycleGAN + linear spectrogram + WareRNN Vocoder => similarity MOS 4.5, naturalness MOS 3後半

[わかる人向け記事]

Overview

Masaya Tanaka, Takashi Nose, Aoi Kanagaki, Ryohei Shimizu, and Akira Ito (2020) Scyclone: High-Quality and Parallel-Data-Free Voice Conversion Using Spectrogram and Cycle-Consistent Adversarial Networks.

linear-Spec conversion with CycleGAN + simplified-WaveRNN Vocoder
Similarity is super good (demo)

prior works

CycleGAN-VC (2 groups)
CycleGAN-VC2
WaveNet family (especially WaveRNN)

Philosophies

prefer E2E: Vocoding error
prefer unified conversion: feature-feature correlation
strong Discriminator cause problems
Encoder-Decoder structure may lose linguistic info & time structure ¹

Why not mel-spec

Better practically in WaveRNN-based vocoder

the low-dimensional linear spectrogram gives a better result than the mel spectrogram as the input of the following WaveRNN-based vocoder

Setups

architecture
- spec2spec: CycleGAN
  - input: 1D spec (channel = frequency)
  - losses: hinge adversarial, L1 cycle-consistency, L1 identity
  - commons
    - network: 1D conv ResNet
      - no EncDoc structure (stride=1)
      - first layer: channel upsampling (channel doubling with pointwiseConv)
    - acticavation: LReLU
  - G specific
    - last layer: channel downsampling (channel half-nize with pointwiseConv)
  - D specific
    - normalization: SN
    - final layer: channel downsampling + global average pooling (to 1 channel with pointwiseConv => pooling)
    - additional input noise: N(0, 0.01)
- spec2wave: simplified-WaveRNN
  - Gaussian probability density function
training
- data
  - format
    - 16 kHz
    - linear spec
      - 254 Hanning window (~16msec)
      - 128-point shift (1/2 slide)
    - size: 160 frames (~1.3 sec)
      - for D, head/tail each 16 frames discarded
  - datum
    - Ayanami: 4,973 utterances
    - F009: 4,973 utterances
- params
  - Conv kernel: (5,)
  - ResNet layerNum: n_G=7, n_D=6
  - m_hinge: 0.5
  - λ_{cycle_consistency}: 10
  - λ_identity: 1
  - Adam_cycleGAN: (α, β1, β2) = (2.0 × 10−4, 0.5, 0.999)
  - Adam_WaveRNN: (α, β1, β2) = (1.0 × 10−4, 0.5, 0.999)
  - batch size: 64 & 160 (cycleGAN, WaveRNN)
evaluation
- subjective metrics (9 JP listener)
  - naturalness MOS
  - similarity MOS

Results

naturalness MOS: 3.9 (vs4.6) & 3.4 (vs4.8)
similarity MOS: 4.4 & 4.5

Note for implementation

Dimension

G: (input) 160x1x128 => (channelx2) => 160x1x256 => (ResNetLoop) => 160x1x256 => (channel 1/2) => 160x1x128
D: (input) 160x1x128 => (head/tail cut) => 128x1x128 => (channelx2) => 128x1x256 => (ResNetLoop) => 128x1x256 => (channel 1) => 128x1x1 => (global average pooling) => 1
V: (input) 160x1x128 =>

8 frames (~64msec) => 8192 units => (reshape) => 128 sampling point * 64-dim vector

" More detailed description and evaluation of Scylone will be presented in our next article." from Scyclone paper

在野情報

Google検索

有用な情報が無

Twitter

発表では女性間のみの評価ですが、男性→女性変換も問題なくできます。また、高速化したWaveRNNによりリアルタイム変換も一応動いているのでそのうちご紹介できればと。
tweet

from poster in tweet:

frame数がG240/D240になってるけど論文と違う.
- 論文だとG160/D128
iter==400K
scheduler: x0.1 per 100K

ノンパラレル声質変換手法Scyclone発表のポスターです。なお発表では女性間のみの評価ですが、男性→女性変換も問題なくできます。また、高速化したWaveRNNによりリアルタイム変換も一応動いているのでそのうちご紹介できればと。 https://t.co/cTzbpX7Pct pic.twitter.com/cpm2KFTsyv
— 能勢隆 (Takashi Nose) (@takashi_nose) 2020年9月14日

My paper read

totally enough except for chapter Ⅲ, which is roughly read (I will use other methods)

check PPG
check "simplified WaveRNN with a single Gaussian probability density function"
check [16] for hinge loss
check [18], info destroy by EncDec
noise effect: input of the discriminator to fix the instability and vanishing gradients issues [21].

My result

My reimplementation (now private)

preliminary: P100で2.37iter/sec（Google Colab Tesla P100-PCIE-16GB）。なので400K iterには46h.
CycleGAN-VC1とそこまで極端には変わらない感触（モデルサイズ的にも妥当だと思われ）

“we think that such high-level abstraction increases the risk of destroying linguistic information and the time structure of input speech.” from Scyclone paper↩