音声波形生成タスクにおいて生成された波形に対するSTFTを損失関数に使う研究のサーベイ
- Parallel WaveGAN
- NSF
- HiFi-GAN
- MultiBand-MelGAN
- StyleMelGAN
GAN系でSoTAなvocoderはみんな採用してるイメージある.
model | loss name | reference | loss | intent |
---|---|---|---|---|
PWG1 | multi-resolution STFT auxiliary loss | spec | Lsc & Lmag | stability and speed1 |
HiFi-GAN1 | Mel-Spectrogram Loss | mel-spec | L1 | stability1 & perceptual quality2 |
MB-MelGAN1 | multi-resolution STFT loss | spec | Lsc & Lmag | speed1 |
StyleMelGAN1 | multi-scale spectral reconstruction loss | spec | Lsc & Lmag | prevent adversarial artifacts1 |
spectral convergence; Lsc
log STFT magnitude; Lmag
Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss.
STFT SPECTRAL LOSS FOR TRAINING A NEURAL SPEECH WAVEFORM MODEL
Adversarial Loss + Recoustruction Loss はGANの常套手段1.
multi-resolution STFT loss: 異なるn_fftのSTFTを組み合わせて全体lossにする
GANの安定化・高速化に寄与する2.
最終精度に貢献するかはよくわからず (STFTってわりと粗いlossなのは確かに)
c.f. MB-MelGAN ablation study
c.f. HiFi-GAN ablation study (MOS 3.25しかないので学習失敗気味なんだと思う)
-
“ we propose a multi-resolution STFT auxiliary loss.” from the PWG paper↩
-
“Referring to previous work (Isola et al., 2017), applying a reconstruction loss to GAN model helps to generate realistic results” from the HiFi-GAN paper↩
-
“In addition to the GAN objective, we add a mel-spectrogram loss to improve the training efficiency of the generator and the fidelity of the generated audio” from the HiFi-GAN paper↩
-
“The mel-spectrogram loss helps the generator to … stabilizes the adversarial training process from the early stages.” from the HiFi-GAN paper↩
-
“we adopt the multi-resolution STFT loss” from the MB-MelGAN paper↩
-
“To improve the stability and efficiency of the adversarial training process” from the PWG paper↩
-
“the convergence process extremely slow. To solve this problem, we adopt” from the MB-MelGAN paper↩
-
“regularized by a multi-scale spectral reconstruction loss.” from the StyleMelGAN paper↩
-
“to prevent the emergence of adversarial artifacts.” from the StyleMelGAN paper↩
-
“also be expected to have the effect of focusing more on improving the perceptual quality” from the HiFi-GAN paper↩
-
“we observed that the quality improves more stably when the loss is applied.” from the HiFi-GAN paper↩