たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも

サーベイ: STFT損失 in 音声波形ドメイン

音声波形生成タスクにおいて生成された波形に対するSTFTを損失関数に使う研究のサーベイ

  • Parallel WaveGAN
  • NSF
  • HiFi-GAN
  • MultiBand-MelGAN
  • StyleMelGAN

GAN系でSoTAなvocoderはみんな採用してるイメージある.

model loss name reference loss intent
PWG1 multi-resolution STFT auxiliary loss spec Lsc & Lmag stability and speed1
HiFi-GAN1 Mel-Spectrogram Loss mel-spec L1 stability1 & perceptual quality2
MB-MelGAN1 multi-resolution STFT loss spec Lsc & Lmag speed1
StyleMelGAN1 multi-scale spectral reconstruction loss spec Lsc & Lmag prevent adversarial artifacts1

spectral convergence; Lsc
log STFT magnitude; Lmag

Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss.
STFT SPECTRAL LOSS FOR TRAINING A NEURAL SPEECH WAVEFORM MODEL

Adversarial Loss + Recoustruction Loss はGANの常套手段1.
multi-resolution STFT loss: 異なるn_fftのSTFTを組み合わせて全体lossにする

GANの安定化・高速化に寄与する2.
最終精度に貢献するかはよくわからず (STFTってわりと粗いlossなのは確かに)

c.f. MB-MelGAN ablation study
c.f. HiFi-GAN ablation study (MOS 3.25しかないので学習失敗気味なんだと思う)


  1. “ we propose a multi-resolution STFT auxiliary loss.” from the PWG paper

  2. “Referring to previous work (Isola et al., 2017), applying a reconstruction loss to GAN model helps to generate realistic results” from the HiFi-GAN paper

  3. “In addition to the GAN objective, we add a mel-spectrogram loss to improve the training efficiency of the generator and the fidelity of the generated audio” from the HiFi-GAN paper

  4. “The mel-spectrogram loss helps the generator to … stabilizes the adversarial training process from the early stages.” from the HiFi-GAN paper

  5. “we adopt the multi-resolution STFT loss” from the MB-MelGAN paper

  6. “To improve the stability and efficiency of the adversarial training process” from the PWG paper

  7. “the convergence process extremely slow. To solve this problem, we adopt” from the MB-MelGAN paper

  8. “regularized by a multi-scale spectral reconstruction loss.” from the StyleMelGAN paper

  9. “to prevent the emergence of adversarial artifacts.” from the StyleMelGAN paper

  10. “also be expected to have the effect of focusing more on improving the perceptual quality” from the HiFi-GAN paper

  11. “we observed that the quality improves more stably when the loss is applied.” from the HiFi-GAN paper