圧倒的なクオリティ・conditioningの楽曲生成を達成したGANSynthを徹底解説！

3行まとめ

NSynthデータセットを用いたpitch-conditional 多楽器音楽生成
compressionless-mel-scale log magnitude rainbowgramによるデータ表現が決め手
GeneratorはConv & upsampling、DiscriminatorはConvするだけで圧倒的に自然な楽曲の生成に成功

概要

Jesse Engel, et al. GANSynth: Adversarial Neural Audio Synthesis. ICLR2019 conference paper.
official implementation(magenta)
著者はGoogle (Brain, AI) が中心.

根底にある問題: 機械学習において効率的な音声合成は、人の感覚がグローバルな構造とローカルの微細な波形コヒーレンスの両方に敏感なために難しい。
既存手法の限界: 自己回帰モデル (e.g. WaveNet) はローカルな構造をモデル化できるがグローバルな潜在条件付けと繰り返しサンプリングの速度を犠牲にしてる。GANはグローバルな潜在条件付けと効率的な平行サンプリングが可能な反面、ローカルにコヒーレントな音声波形を合成するのは難しい。
着目点: より学習しやすいデータ表現があるはず
手法と結果: log magnitudesとinstantenious frequenciesをスペクトルドメインで十分な周波数で記録し結果: WNベースを上回るmetricsを、54000倍早く実現。

ポイント

magnitudeとphase angle、その派生物をデータ表現とするのはかなりシンプルな発想。
しかし今まではうまくいっていなかった (c.f. NSynth)。
低周波数帯で周波数分解能を上げたのが成功の秘訣 (Table1)。

Refs

Wavenet (Global conditioning signal)
- Parallel-WaveNet
- NVIDIA WaveNet
- Engel et al. 2017 ??

Background

音声ARモデルの成功と効率・global面での欠点
画像GANの成功 (高い効率・大域特徴習得)
音声GANはまだ不十分
音声ドメインの特徴: 周期性

AR

自己回帰モデルで音声生成: うまくいっている、が、大域構造および学習効率に問題あり

GAN

画像生成では大成功 (GAN, DCGAN, WGAN, WGAN-GP, BEGAN?, DRAGAN, PGGAN, Spectram Norm)
ドメイン変換でも成功 (pix2pix, CycleGAN, Unsupervised Creation of Parameterized Avatars, Towards the Automatic Anime Characters Creation with Generative Adversarial Networks)
GANで音声生成はまだ微妙 (WaveGAN, SpecGAN)
(※私注: GANSynthと同じタイミングで楽曲生成GANが続々と登場)

音声ドメイン特有の事情

音声は周期的

学習フィルタがlog frequency filterを学ぶことが多い (Dieleman 2014, Zhu 2016)

ヒトは局所的なコヒーレンスにうるさい

人はlocal coherence (1 ms - 100 msの時間スケール) (周期性とその崩れ・不規則性) に敏感
周期性に重きを置いた

周期的連続信号は離散化すると歳差運動する

coherenceが大事なのだが、STFTやCNNでコヒーレントな波を生成するのはphase precessionのせいで難しい.
(coherentであれば絶対値そのものは対して重要ではないのに) 位相値が離散的に歳差運動するため、coherenceを守るために各点でのphase値を正確に推定しなければならなくなる.
多数の周波数成分が混じっている (と解釈できる) ので、frame strideとsignal periodicityはどうしてもずれたものがある.
(これって、位相推定の問題？それともフレーム非一致(c.f. 窓関数) の問題？)
これはSTFTに見られるphase precessionそのもの (STFTはstrided filterbanksでCNNに似てると言っている)

Phase precession also occurs in situations where filterbanks overlap (window or kernel size < stride). ??

frame strideとsignal periodicityの角周波数差で歳差運動

より良い位相表現: Instantaneous frequency

c.f. phase vocoder (Dolson, 1986)

true signal oscillationの解析法である。

a time varying measure of the true signal oscillation

(離散信号の特徴)

strided convによる波形生成より、GANによるlog強度/位相生成のほうがよりコヒーレントな波形を生成
位相推定よりIF推定のほうがよりコヒーレントな音声を生成
低周波数領域で周波数解像度の向上が重要
WaveNet baselineより54,000倍素早く、優れた音声を生成
潜在表現とピッチベクトルへの条件づけで、音色の補間とピッチ間で一貫した音色を実現

Prior Works

Audio Synthesis

WaveGAN/SpecGAN

再現実装

NSynthデータセットに対しV100GPUx1で3~4日¹ (論文では4.5日とのこと²)
V100がFP32: 14.90 TFPS,
K80 がFP32: 4.368 TFPS/core
なので、フルの学習にはGoogle Colaboratoryで15日ほどかかる.
$2/hourくらいで見積もるとV100で１学習に2万円ほど。案外安い？

モデル

PGGAN (ひたすらConv/TransConvするだけ)
WGAN-GP
pixel normalization
progressive trainingはしてもしなくても大きく変わらず (若干良くなる) ³.

ネットワーク

simple decoder-encoder by CNN with upsampling/downsampling

Conv kernel: (4, 4) or (3, 3)
Decoder: Upsampling (2x2 element replication) => (conv channel reduction & LReLU) x2
Encoder: (conv channel increasing & LReLU) x2 => Downsampling (average pooling)

from latent (512, 1, 1) =>
STFT: (256, 512, 2)
STFT: (128, 1024, 2)

ポイント単語

perceptual fidelity
coherent waveform
Phase coherence
phase irregularities
waveform modulo

Rainbowgrams

tarepan.hatenablog.com

harmonic frequency

IF spectra forms solid bold lines where the harmonic frequencies are present. The IF is noisy outside these regions but they have very little effect on the resynthesized sound as there is little magnitude present at those times and frequencies.

data representation (特徴量)

compressionless-mel-scale log magnitude rainbowgram
名前の記法参考
 spectrogramはmagnitudeもpowerも自然
 deepvoice3での言い回しがこれに近い

音響特徴量

全体像は
waveform -> STFT magnitudes & phase angles -> scaling -> GANSynth -> re-scaling -> ISTFT -> waveform

waveform

4 sec / 16 kHz => 64,000 dim

wave to features

TensorFlowのbuilt-in関数(complex64返す) でSTFT

STFT
- stride : 256
- frame size: 1024
  - => 75% overlap

周波数binはNyquist frequencyで打ち切り (x:513 => 512)
timeBinは256になるようpadding (251 => 256) ### 強度/magnitude logを取りスケール調整.
(-1, 1)にスケール.
mel scaleに変換 (frequency bin数は変えず) (内挿？)

位相角/phase angle

phase modelと Instantaneous frequency (IF) modelで実験.

waveform => an "image" of size (256, 512, 2) (x, x, |z|&∠z)
unroll the phase angle and take the finite difference -> "Instantaneous frequency" model

IF

finite difference 有限差分を取ってIFを求めた ⁴
Figure1だとangle0から始まるframe0のIFは0扱いされているので、初期位相をframe0のIFとして残りは差分で表現するのが順当か？

Dataset

subset of NSynth (70,379 samples from total 300,000 musical notes)
criterion: acoustic instruments and fundamental pitches ranging from MIDI 24-84 (∼32-1000Hz) (there are sound natural to an average listener)
80:20 = train : test

NSynth

位相推定をNNでやろうとしたがあんまりだった

We also explored several representations of phase,
including instantaneous frequency and circular normal cost functions ...,
but in each case independently estimating phase and magnitude led to poor sample quality due to phase errors.

結局、FFT -> magnitudeを推定 -> Griffin Limで反復位相推定

estimating only the magnitude
reconstruct the phase (Griffin & Lim, 1984)
a large FFT size (1024) relative to the hop size (256) and ran the algorithm for 1000 iterations

FFT size: kernel size的な
hopsize: stride的な
1区間が1フレーム的な

numpyでスペクトログラムによる音楽信号の可視化 - Qiita

要整理

音声合成

(Neural) Audio synthesis

GANSynth can train on the NSynth dataset in ~3-4 days on a single V100 GPU. ref ↩
We train each GAN variant for 4.5 days on a single V100 GPU, with a batch size of 8. (from paper) ↩
We also try training both progressive and nonprogressive variants, and see comparable quality in both. ↩
We optionally unroll the phase angle and take the finite difference↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

徹底解説！GANSynth