VQ-VAE audio (Oord, 2017) - たれぱんのびぼーろく

Architecture

Encoder

[Conv1d (k4, s2)] x6¹
情報量: ±16msecのhalf-overlap

Decoder

DilatedConv, similar to WaveNet

dilated convolutional architecture similar to WaveNet decoder

The decoder is conditioned on both the latents and a one-hot embedding for the speaker

Results

Because the dimensionality of the discrete representation is 64 times smaller, the original sample cannot be perfectly reconstructed sample by sample.

the reconstruction has the same content (same text contents), but the waveform is quite different and prosody in the voice is altered.

This experiment confirms our observations from before that important features are often those that span many dimensions in the input data space (in this case phoneme and other high-level content in waveform).

16 kHz to	latent	dataset	reconst.	VC	phoneme match
250 Hz (x64)	1 x 512 x time	VCTK	☑	-	-
125 Hz (x128)	1 x 128 x time	Librispeech	☑	-	-
? Hz (x?)	?	?	-	☑	-
25 Hz (x640)	1 x 128 x time		☑	-	49.3%

VC: change speaker ID

the decoder using a separate speaker id.

“We train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4. This yields a latent space 64x smaller than the original waveform. The latents consist of one feature map and the discrete space is 512-dimensional.” ↩