たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

Scyclone VC

CycleGAN + linear spectrogram + WareRNN Vocoder => similarity MOS 4.5, naturalness MOS 3後半

[わかる人向け記事]

Overview

Masaya Tanaka, Takashi Nose, Aoi Kanagaki, Ryohei Shimizu, and Akira Ito (2020) Scyclone: High-Quality and Parallel-Data-Free Voice Conversion Using Spectrogram and Cycle-Consistent Adversarial Networks.

linear-Spec conversion with CycleGAN + simplified-WaveRNN Vocoder
Similarity is super good (demo)

prior works

  • CycleGAN-VC (2 groups)
  • CycleGAN-VC2
  • WaveNet family (especially WaveRNN)

Philosophies

  • prefer E2E: Vocoding error
  • prefer unified conversion: feature-feature correlation
  • strong Discriminator cause problems
  • Encoder-Decoder structure lose linguistic info & time structure

    Why not mel-spec

    Better practically in WaveRNN-based vocoder

    the low-dimensional linear spectrogram gives a better result than the mel spectrogram as the input of the following WaveRNN-based vocoder

Setups

  • architecture
    • spec2spec: CycleGAN
      • input: 1D spec (channel = frequency)
      • losses: hinge adversarial, L1 cycle-consistency, L1 identity
      • commons
        • network: 1D conv ResNet
          • no EncDoc structure (stride=1)
          • first layer: channel upsampling (1x1 kernel with channel doubling)
        • acticavation: LReLU
      • G specific
        • last layer: channel downsampling (1x1 kernel with channel half-nize)
      • D specific
        • normalization: SN
        • final layer: channel downsampling + global average pooling (1x1 kernel with 1 channel => pooling)
        • additional input noise: N(0, 0.01)
    • spec2wave: simplified-WaveRNN
      • Gaussian probability density function
  • training
    • data
      • format
        • 16 kHz
        • linear spec
          • 254 Hanning window (~16msec)
          • 128-point shift (1/2 slide)
        • size: 160 frames (~1.3 sec)
          • for D, head/tail each 16 frames discarded
      • datum
        • Ayanami: 4,973 utterances
        • F009: 4,973 utterances
    • params
      • Conv kernel: (5, 1)
      • ResNet layerNum: nG=7, nD=6
      • mhinge: 0.5
      • λcycle_consistency: 10
      • λidentity: 1
      • AdamcycleGAN: (α, β1, β2) = (2.0 × 10−4, 0.5, 0.999)
      • AdamWaveRNN: (α, β1, β2) = (1.0 × 10−4, 0.5, 0.999)
      • batch size: 64 & 160 (cycleGAN, WaveRNN)
  • evaluation
    • subjective metrics (9 JP listener)
      • naturalness MOS
      • similarity MOS

Results

  • naturalness MOS: 3.9 (vs4.6) & 3.4 (vs4.8)
  • similarity MOS: 4.4 & 4.5

Note for implementation

Dimension

G: (input) 160x1x128 => (channelx2) => 160x1x256 => (ResNetLoop) => 160x1x256 => (channel 1/2) => 160x1x128
D: (input) 160x1x128 => (head/tail cut) => 128x1x128 => (channelx2) => 128x1x256 => (ResNetLoop) => 128x1x256 => (channel 1) => 128x1x1 => (global average pooling) => 1
V: (input) 160x1x128 =>

8 frames (~64msec) => 8192 units => (reshape) => 128 sampling point * 64-dim vector