CycleGAN + linear spectrogram + WareRNN Vocoder => similarity MOS 4.5, naturalness MOS 3後半
[わかる人向け記事]
Overview
Masaya Tanaka, Takashi Nose, Aoi Kanagaki, Ryohei Shimizu, and Akira Ito (2020) Scyclone: High-Quality and Parallel-Data-Free Voice Conversion Using Spectrogram and Cycle-Consistent Adversarial Networks.
linear-Spec conversion with CycleGAN + simplified-WaveRNN Vocoder
Similarity is super good (demo)
prior works
- CycleGAN-VC (2 groups)
- CycleGAN-VC2
- WaveNet family (especially WaveRNN)
Philosophies
- prefer E2E: Vocoding error
- prefer unified conversion: feature-feature correlation
- strong Discriminator cause problems
- Encoder-Decoder structure may lose linguistic info & time structure 1
Why not mel-spec
Better practically in WaveRNN-based vocoder
the low-dimensional linear spectrogram gives a better result than the mel spectrogram as the input of the following WaveRNN-based vocoder
Setups
- architecture
- spec2spec: CycleGAN
- input: 1D spec (channel = frequency)
- losses: hinge adversarial, L1 cycle-consistency, L1 identity
- commons
- network: 1D conv ResNet
- no EncDoc structure (stride=1)
- first layer: channel upsampling (channel doubling with pointwiseConv)
- acticavation: LReLU
- network: 1D conv ResNet
- G specific
- last layer: channel downsampling (channel half-nize with pointwiseConv)
- D specific
- normalization: SN
- final layer: channel downsampling + global average pooling (to 1 channel with pointwiseConv => pooling)
- additional input noise: N(0, 0.01)
- spec2wave: simplified-WaveRNN
- Gaussian probability density function
- spec2spec: CycleGAN
- training
- data
- format
- 16 kHz
- linear spec
- 254 Hanning window (~16msec)
- 128-point shift (1/2 slide)
- size: 160 frames (~1.3 sec)
- for D, head/tail each 16 frames discarded
- datum
- Ayanami: 4,973 utterances
- F009: 4,973 utterances
- format
- params
- Conv kernel: (5,)
- ResNet layerNum: nG=7, nD=6
- mhinge: 0.5
- λcycle_consistency: 10
- λidentity: 1
- AdamcycleGAN: (α, β1, β2) = (2.0 × 10−4, 0.5, 0.999)
- AdamWaveRNN: (α, β1, β2) = (1.0 × 10−4, 0.5, 0.999)
- batch size: 64 & 160 (cycleGAN, WaveRNN)
- data
- evaluation
Results
Note for implementation
Dimension
G: (input) 160x1x128 => (channelx2) => 160x1x256 => (ResNetLoop) => 160x1x256 => (channel 1/2) => 160x1x128
D: (input) 160x1x128 => (head/tail cut) => 128x1x128 => (channelx2) => 128x1x256 => (ResNetLoop) => 128x1x256 => (channel 1) => 128x1x1 => (global average pooling) => 1
V: (input) 160x1x128 =>
8 frames (~64msec) => 8192 units => (reshape) => 128 sampling point * 64-dim vector
" More detailed description and evaluation of Scylone will be presented in our next article." from Scyclone paper
在野情報
Google検索
有用な情報が無
発表では女性間のみの評価ですが、男性→女性変換も問題なくできます。また、高速化したWaveRNNによりリアルタイム変換も一応動いているのでそのうちご紹介できればと。
tweet
from poster in tweet:
- frame数がG240/D240になってるけど論文と違う.
- 論文だとG160/D128
- iter==400K
- scheduler: x0.1 per 100K
ノンパラレル声質変換手法Scyclone発表のポスターです。なお発表では女性間のみの評価ですが、男性→女性変換も問題なくできます。また、高速化したWaveRNNによりリアルタイム変換も一応動いているのでそのうちご紹介できればと。 https://t.co/cTzbpX7Pct pic.twitter.com/cpm2KFTsyv
— 能勢 隆 (Takashi Nose) (@takashi_nose) 2020年9月14日
My paper read
totally enough except for chapter Ⅲ, which is roughly read (I will use other methods)
- check PPG
- check "simplified WaveRNN with a single Gaussian probability density function"
- check [16] for hinge loss
- check [18], info destroy by EncDec
- noise effect: input of the discriminator to fix the instability and vanishing gradients issues [21].
My result
My reimplementation (now private)
preliminary: P100で2.37iter/sec(Google Colab Tesla P100-PCIE-16GB)。なので400K iterには46h.
CycleGAN-VC1とそこまで極端には変わらない感触(モデルサイズ的にも妥当だと思われ)
-
“we think that such high-level abstraction increases the risk of destroying linguistic information and the time structure of input speech.” from Scyclone paper↩