たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

Scyclone VC

CycleGAN + linear spectrogram + WareRNN Vocoder => similarity MOS 4.5, naturalness MOS 3後半

[わかる人向け記事]

Overview

Masaya Tanaka, Takashi Nose, Aoi Kanagaki, Ryohei Shimizu, and Akira Ito (2020) Scyclone: High-Quality and Parallel-Data-Free Voice Conversion Using Spectrogram and Cycle-Consistent Adversarial Networks.

linear-Spec conversion with CycleGAN + simplified-WaveRNN Vocoder
Similarity is super good (demo)

prior works

  • CycleGAN-VC (2 groups)
  • CycleGAN-VC2
  • WaveNet family (especially WaveRNN)

Philosophies

  • prefer E2E: Vocoding error
  • prefer unified conversion: feature-feature correlation
  • strong Discriminator cause problems
  • Encoder-Decoder structure may lose linguistic info & time structure 1

Why not mel-spec

Better practically in WaveRNN-based vocoder

the low-dimensional linear spectrogram gives a better result than the mel spectrogram as the input of the following WaveRNN-based vocoder

Setups

  • architecture
    • spec2spec: CycleGAN
      • input: 1D spec (channel = frequency)
      • losses: hinge adversarial, L1 cycle-consistency, L1 identity
      • commons
        • network: 1D conv ResNet
          • no EncDoc structure (stride=1)
          • first layer: channel upsampling (channel doubling with pointwiseConv)
        • acticavation: LReLU
      • G specific
        • last layer: channel downsampling (channel half-nize with pointwiseConv)
      • D specific
        • normalization: SN
        • final layer: channel downsampling + global average pooling (to 1 channel with pointwiseConv => pooling)
        • additional input noise: N(0, 0.01)
    • spec2wave: simplified-WaveRNN
      • Gaussian probability density function
  • training
    • data
      • format
        • 16 kHz
        • linear spec
          • 254 Hanning window (~16msec)
          • 128-point shift (1/2 slide)
        • size: 160 frames (~1.3 sec)
          • for D, head/tail each 16 frames discarded
      • datum
        • Ayanami: 4,973 utterances
        • F009: 4,973 utterances
    • params
      • Conv kernel: (5,)
      • ResNet layerNum: nG=7, nD=6
      • mhinge: 0.5
      • λcycle_consistency: 10
      • λidentity: 1
      • AdamcycleGAN: (α, β1, β2) = (2.0 × 10−4, 0.5, 0.999)
      • AdamWaveRNN: (α, β1, β2) = (1.0 × 10−4, 0.5, 0.999)
      • batch size: 64 & 160 (cycleGAN, WaveRNN)
  • evaluation
    • subjective metrics (9 JP listener)
      • naturalness MOS
      • similarity MOS

Results

  • naturalness MOS: 3.9 (vs4.6) & 3.4 (vs4.8)
  • similarity MOS: 4.4 & 4.5

Note for implementation

Dimension

G: (input) 160x1x128 => (channelx2) => 160x1x256 => (ResNetLoop) => 160x1x256 => (channel 1/2) => 160x1x128
D: (input) 160x1x128 => (head/tail cut) => 128x1x128 => (channelx2) => 128x1x256 => (ResNetLoop) => 128x1x256 => (channel 1) => 128x1x1 => (global average pooling) => 1
V: (input) 160x1x128 =>

8 frames (~64msec) => 8192 units => (reshape) => 128 sampling point * 64-dim vector

" More detailed description and evaluation of Scylone will be presented in our next article." from Scyclone paper

在野情報

Google検索

有用な情報が無

Twitter

発表では女性間のみの評価ですが、男性→女性変換も問題なくできます。また、高速化したWaveRNNによりリアルタイム変換も一応動いているのでそのうちご紹介できればと。
tweet

from poster in tweet:

  • frame数がG240/D240になってるけど論文と違う.
    • 論文だとG160/D128
  • iter==400K
  • scheduler: x0.1 per 100K

My paper read

totally enough except for chapter Ⅲ, which is roughly read (I will use other methods)

  • check PPG
  • check "simplified WaveRNN with a single Gaussian probability density function"
  • check [16] for hinge loss
  • check [18], info destroy by EncDec
  • noise effect: input of the discriminator to fix the instability and vanishing gradients issues [21].

My result

My reimplementation (now private)

preliminary: P100で2.37iter/sec(Google Colab Tesla P100-PCIE-16GB)。なので400K iterには46h.
CycleGAN-VC1とそこまで極端には変わらない感触(モデルサイズ的にも妥当だと思われ)


  1. “we think that such high-level abstraction increases the risk of destroying linguistic information and the time structure of input speech.” from Scyclone paper