着想はFFTNetからとのこと25。
This is similar to the effect of analysis-by-synthesis in CELP [18, 19] and greatly reduces the artifacts in the synthesized speech.
“ We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of↩
“In this work, we limit the input of the synthesis to just 20 features: 18 Bark-scale … cepstral coefficients, and 2 pitch parameters (period, correlation).” from original paper↩
“ For low-bitrate coding applications, the cepstrum and pitch parameters would be quantized …, whereas for text-to-speech, they would be computed from the text using another neural network” from original paper↩
“As a second step, we subtract a constant from the distribution” from the paper↩
“renormalizes the distribution to unity, both between the two steps and on the result.” from the paper↩
“t T = 0.002 provides a good trade-off” from the paper↩
“This prevents impulse noise caused by low probabilities.” from the paper↩
“To keep the complexity low, we use sparse matrices for the largest GRU … We find that 16x1 blocks provide good accuracy, while making it easy to vectorize the products.” from the paper↩
“Training starts with dense matrices and the blocks with the lowest magnitudes are progressively forced to zero until the desired sparseness is achieved” from the paper↩
“In addition to the non-zero blocks, we also include all the diagonal terms in the sparse matrix”↩
“since those are the most likely to be non-zero.” from the paper↩
“Including the diagonal terms avoids forcing 16x1 non-zero blocks only for a single element on the diagonal.” from the paper↩
“α = 0.85 providing good results.” from the paper↩
“This significantly reduces the perceived noise … and makes 8-bit µ-law output viable for high-quality synthesis.” from the paper↩
“When synthesizing speech, the network operates in conditions that are different from those of the training because the generated samples are different (more imperfect) than those used during training.” from the paper↩
“The use of linear prediction makes the details of the noise injection particularly important.” from the paper↩
“When injecting noise in the signal, but training the network on the clean excitation, we find that the system produces artifacts similar to those of the pre-analysis-by-synthesis vocoder era, where the noise has the same shape as the synthesis filter 1/1−P (z).” from the paper↩
“To make the network more robust to the mismatch, we add noise to the input during training” from the paper↩
“To make the noise proportional to the signal amplitude, we inject it directly in the µ-law domain.”↩
“we find that by adding the noise as shown in Fig. 2, the network effectively learns to minimize the error in the signal domain” ↩
“we add noise … as suggested in [6]. … [6] Z. Jin, … ‘FFTNet: Areal-time speaker-dependent neural vocoder.’”↩
“The simplifications above essentially make the computational cost of all the non-recurrent inputs to the main GRU negligible” from the paper↩