サーベイ: STFT損失 in 音声波形ドメイン

音声波形生成タスクにおいて生成された波形に対するSTFTを損失関数に使う研究のサーベイ

Parallel WaveGAN
NSF
HiFi-GAN
MultiBand-MelGAN
StyleMelGAN

GAN系でSoTAなvocoderはみんな採用してるイメージある.

model	loss name	reference	loss	intent
PWG¹	multi-resolution STFT auxiliary loss	spec	L_sc & L_mag	stability and speed¹
HiFi-GAN¹	Mel-Spectrogram Loss	mel-spec	L1	stability¹ & perceptual quality²
MB-MelGAN¹	multi-resolution STFT loss	spec	L_sc & L_mag	speed¹
StyleMelGAN¹	multi-scale spectral reconstruction loss	spec	L_sc & L_mag	prevent adversarial artifacts¹

spectral convergence; L_sc
log STFT magnitude; L_mag

Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss.
STFT SPECTRAL LOSS FOR TRAINING A NEURAL SPEECH WAVEFORM MODEL

Adversarial Loss + Recoustruction Loss はGANの常套手段¹.
multi-resolution STFT loss: 異なるn_fftのSTFTを組み合わせて全体lossにする

GANの安定化・高速化に寄与する².
最終精度に貢献するかはよくわからず (STFTってわりと粗いlossなのは確かに)

c.f. MB-MelGAN ablation study
c.f. HiFi-GAN ablation study (MOS 3.25しかないので学習失敗気味なんだと思う)

“ we propose a multi-resolution STFT auxiliary loss.” from the PWG paper↩
“Referring to previous work (Isola et al., 2017), applying a reconstruction loss to GAN model helps to generate realistic results” from the HiFi-GAN paper↩
“In addition to the GAN objective, we add a mel-spectrogram loss to improve the training efficiency of the generator and the fidelity of the generated audio” from the HiFi-GAN paper↩
“The mel-spectrogram loss helps the generator to … stabilizes the adversarial training process from the early stages.” from the HiFi-GAN paper↩
“we adopt the multi-resolution STFT loss” from the MB-MelGAN paper↩
“To improve the stability and efficiency of the adversarial training process” from the PWG paper↩
“the convergence process extremely slow. To solve this problem, we adopt” from the MB-MelGAN paper↩
“regularized by a multi-scale spectral reconstruction loss.” from the StyleMelGAN paper↩
“to prevent the emergence of adversarial artifacts.” from the StyleMelGAN paper↩
“also be expected to have the effect of focusing more on improving the perceptual quality” from the HiFi-GAN paper↩
“we observed that the quality improves more stably when the loss is applied.” from the HiFi-GAN paper↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

サーベイ: STFT損失 in 音声波形ドメイン