4.3節がaudio. official samples
Architecture
Encoder
[Conv1d (k4, s2)] x61
情報量: ±16msecのhalf-overlap
Decoder
DilatedConv, similar to WaveNet
dilated convolutional architecture similar to WaveNet decoder
The decoder is conditioned on both the latents and a one-hot embedding for the speaker
Results
Because the dimensionality of the discrete representation is 64 times smaller, the original sample cannot be perfectly reconstructed sample by sample.
the reconstruction has the same content (same text contents), but the waveform is quite different and prosody in the voice is altered.
This experiment confirms our observations from before that important features are often those that span many dimensions in the input data space (in this case phoneme and other high-level content in waveform).
16 kHz to | latent | dataset | reconst. | VC | phoneme match |
---|---|---|---|---|---|
250 Hz (x64) | 1 x 512 x time | VCTK | ☑ | - | - |
125 Hz (x128) | 1 x 128 x time | Librispeech | ☑ | - | - |
? Hz (x?) | ? | ? | - | ☑ | - |
25 Hz (x640) | 1 x 128 x time | ☑ | - | 49.3% |
VC: change speaker ID
the decoder using a separate speaker id.
-
“We train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4. This yields a latent space 64x smaller than the original waveform. The latents consist of one feature map and the discrete space is 512-dimensional.” ↩