論文解説: Valin (2018) LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

LPCNet: 線形予測ボコーダーにexcitation/残差予測のWaveRNNを組み合わせ¹、 full neural Vocoders より省パラメータで同精度
スパース化やノイズあり学習、全結合層の工夫など色々最適化してそんな強くないCPUでもリアルタイム合成に成功.

speech synthesis." from the paper

Abstract

Paper: LPCNET: Improving Neural Speech Synthesis through Linear Prediction
- doi: 10.1109/ICASSP.2019.8682804
- arXiv: 1810.11846
Demo: jmvalin.ca

Vocoderのefficiencyに注目した論文.
RNN + LPCでパラメータ効率を向上, 16 kHz waveform リアルタイムCPU合成に成功.

既存モデル

efficiency面での先行研究: FFTNet, WaveRNN
古典モデル: 線形予測ボコーダ/ソースフィルタモデル

着想: LPCの良くて軽い部分を拝借

波形モデリング = スペクトル包絡モデリング + 励起モデリング
LPC: スペクトル包絡/線形部は良く近似できても、励起/残差は上手くモデル化/予測できない.

Model

x_t = linearPredictor(x_<t|θ) + ResidualByWaveRNN()

inputs²
- 18 Bark-scale cepstrum³
- pitch parameters
  - 1 period
  - 1 correlation

Compression用途だと1frameを20次元に落とし込めるし、TTSだと20次元の生成で楽という観点みたい（vocoder用途に寄せていない）⁴.

Coefficintsの計算

スペクトル変換による係数推定（BFC -> (iDCT) -> BarkSpec -> (interp) -> PSD -> (iFT) -> Autocorr -> (Yule–Walker with Levinson-Durbin) -> Coeffs) で計算。
Barkの線形補完でlinear PSDを出すためLPC精度は落ちるけど、LPCはあくまで手段で、RNNの残差算出で補償して全体のefficiencyが上がっていればいい。

Excitation算出

純粋なLPCとみなす場合、excitation部はスペクトル包絡や生成サンプルと無関係.
それで試した場合の精度がイマイチだったので両方ともfeature的な感じで入れてる.

入力は f, Q(p_t), Q(s_t−1), e_t−1 (quant in nature) ⁵
pとsをQuantにする意義はいまいちわかっていない（embeddingのsize減らしたい？)

concatしたfeatureを GRU-GRU-DualFC-softmax してサンプリング.
GRU-GRUはBigSparse-SmallDense.
サンプリングはレアサンプルによるノイズを避けるためにtemperatureを導入.

DualFC

必須ではないが同計算量で性能をちょっとあげる工夫.

従来: softmax(W₂ ReLU(W₁h_t))
今回: softmax(DualFC(h_t)) = softmax(a₁⦿tanh(W₁h_t)+a₂⦿tanh(W₂h_t))

直感的説明: ステップ関数2つを組み合わせてbin推定
今回のタスクは「量子化されたどのbinに入るか」を当てるものなので究極的には正答binのone-hotベクトルをsoftmaxに入れたい.
「X以上」つまり閾値がxのステップ関数と「X+1以下」つまり閾値がx+1のステップ関数を用意すれば、その差分はone-hotになる.
非線形で特定区間のみ値がデカくなるよう学習させなくても、反対方向のステップ関数2つ用意すれば済むという発想.
あくまで直感的説明なので実際にそう学習しているかはわからない (重みの監視をしたらそれっぽく振る舞ったらしい. Figure無し)

サンプリング

テンパリング

確率分布 p(e_t) から直接サンプリングをする場合、稀によく大きなノイズが入る⁶。
LPCNetでは「確率分布c乗→再正規化→閾値カット→再正規化」でテンパリングをする⁷ ⁸ ⁹。
乗数をpitch correlation g_p (0~1) 値で明示的に制御（c.f. v/uvでバイナリ制御）して、ランダム性を上下。
c = 1 + max(0, 1.5 g_p - 0.5) で制御するので、corr=0でc=1, corr=1でc=2。
Thresholdを0.002 (0.2%) ¹⁰で適用し、稀なimpluseを予防¹¹。

お気持ちとしては、pitchが周期性を持ってるvoicing部ではランダム要素が重要でないから c を大きくして温度を下げる、と思われる。

スパース化

GRU_A対象に16x1 ブロックスパース化を採用¹²。学習の進展に合わせて徐々に刈り込み¹³。

更に対角要素の保護も採用¹⁴。経験的に対角要素は非ゼロになりやすい¹⁵が、これがブロックスパース化時に「ほぼゼロ要素群+非ゼロ対角要素1つ」なブロックが刈り込まれない状況を生んでしまう¹⁶。なのでスパース化時には対角要素を度外視（除外）してスパース化対象ブロックを決め、対角ブロックがスパース化された場合でも対角要素は個別計算するようにしている。
例えばh192は2304ブロックからなり、density_A=0.1 から230ブロックのみ生きている。16x1ブロック（~縦ベクトル）は行方向に192個あるので、対角要素を含むブロックが192個あることになる。対角要素保護無しの場合、最悪で230ブロック中192ブロックが「ほぼゼロ対角ブロック」に占められてしまう。h640だと640/2560ブロック (25%) が占められるので、サイズ依存はあるがどちらにせよ有用な可能性が高い。

前処理

プリエンファシス

α=0.85でプリエンファシスを適用¹⁷。
8bit量子化を用いた際の高周波帯で聞こえるノイズを大きく抑制¹⁸。

ノイズ挿入

自己回帰モデル (LP部も残差予測部もAR) には exposure bias 問題がつきまとう¹⁹。特にLPはノイズに弱い²⁰。
ノイズ無し信号（から導き出される e_t_clean) でモデルを学習し推論時にノイズつきAR入力を渡してみたところ、LPフィルタと同じ形のノイズが見られた²¹（ホワイトノイズがLPフィルタ型色付きノイズになった。SampleNetが補正等を学習しないと線形予測の重ね合わせ特性で素直に色付きノイズが出たということ。）

なのでロバストなモデルとして学習するために、入力へのノイズ挿入をおこなった²²。s_t-1 入力がノイジーだと仮定しているので、モデルは
p_{t, noisy} = LP(a, s_{t-k, noisy}); s^{^}_t = p_{t, noisy} + NN(s_{t-1, noisy}, p_{t, noisy}, e_t-1)
という枠組みになる。この環境下で理想出力を出してほしいので、s^{^}_t = s_{t, clean} 出力を学習させる。
またe_t-1は s_t-1 から p_{t-1, noisy} を引くのがモデル想定から導かれる。論文では e_t-1 = s_{t-1, clean} - p_{t-1, noisy} としている。なお、s_t-1 がclean/noisyどちらであるべきか、e_t-1 にノイズ挿入はしなくていいのか、の判断基準が私は理解できていない。NNのexposure bias対策ではなくLPで歪んだ分の補正を学習させているというニュアンス…?

s_t-1 強度とノイズ強度を比例させたいので、ノイズはu-lawドメインで載せている²³。

狙い通り、ノイズ挿入は実際に効果的に歪みを減らせた²⁴。

着想はFFTNetからとのこと²⁵。
This is similar to the effect of analysis-by-synthesis in CELP [18, 19] and greatly reduces the artifacts in the synthesized speech.

Embedding事前計算

EmbeddingはLUTなので下流含めた事前計算が可能。なので W_ihs_t-1, W_ihe_t-1, W_ihp_t のLUT化により、計算とW_ihメモリ転送をほぼ全部削減できる。
また W_ihf もフレーム内は同じ値なので、メモ化できる。
結果、GRU_AのInput部はほぼ計算・メモリ転送のコストを無視できる²⁶。

実験

データ

NTT Multi-Lingual Speech Database for Telephonometry: 21言語から成り、計4時間のspeech。
なので Speaker-Independent タスク。

official docs

jmvalin.ca
jmvalin.ca

derivatives

松原, et al. (2020). Full-band LPCNet：48kHz リアルタイムニューラルボコーダ.
FeatherWave
An Efficient Subband Linear Prediction for LPCNet-based Neural Synthesis

IBM High quality, lightweight and adaptable Text-to-Speech (TTS) using LPCNet | IBM Research Blog

LPCNet君が充分有能 & C実装まである、が遠因なのか、派生形はあんまり盛り上がってない印象がある（私の感想）

Full-band LPCNet

model	input for analy/synth	input for TTS
WORLD	input for analy/synth	acoustic feat. (57 dim)
WaveNet	mel-spec (80 dim, 0-7.6kHz)	acoustic feat. (80 dim)
PWG	mel-spec (80 dim, 0-7.6kHz)	acoustic feat. (80 dim)
LPCNet	BerkCep. (50 dim), fo, pitch corr.	acoustic feat. (52 dim)

OriginalがMOS 4.25くらい。TTSモデルはそれと完全に同じ. Analy/Synthが逆に4.0どまり.

“ We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of↩
“In this work, we limit the input of the synthesis to just 20 features: 18 Bark-scale … cepstral coefficients, and 2 pitch parameters (period, correlation).” from original paper↩
BFCCじゃなくBFC (Barkスケール化の時点で18次元に落とし込んでいる)↩
“ For low-bitrate coding applications, the cepstrum and pitch parameters would be quantized …, whereas for text-to-speech, they would be computed from the text using another neural network” from original paper↩
Fig.2参照 (実装確認済み) ↩
“Directly sampling from the output distribution can sometimes cause excessive noise.” from the paper↩
“multiplying the logits” from the paper↩
“As a second step, we subtract a constant from the distribution” from the paper↩
“renormalizes the distribution to unity, both between the two steps and on the result.” from the paper↩
“t T = 0.002 provides a good trade-off” from the paper↩
“This prevents impulse noise caused by low probabilities.” from the paper↩
“To keep the complexity low, we use sparse matrices for the largest GRU … We find that 16x1 blocks provide good accuracy, while making it easy to vectorize the products.” from the paper↩
“Training starts with dense matrices and the blocks with the lowest magnitudes are progressively forced to zero until the desired sparseness is achieved” from the paper↩
“In addition to the non-zero blocks, we also include all the diagonal terms in the sparse matrix”↩
“since those are the most likely to be non-zero.” from the paper↩
“Including the diagonal terms avoids forcing 16x1 non-zero blocks only for a single element on the diagonal.” from the paper↩
“α = 0.85 providing good results.” from the paper↩
“This significantly reduces the perceived noise … and makes 8-bit µ-law output viable for high-quality synthesis.” from the paper↩
“When synthesizing speech, the network operates in conditions that are different from those of the training because the generated samples are different (more imperfect) than those used during training.” from the paper↩
“The use of linear prediction makes the details of the noise injection particularly important.” from the paper↩
“When injecting noise in the signal, but training the network on the clean excitation, we find that the system produces artifacts similar to those of the pre-analysis-by-synthesis vocoder era, where the noise has the same shape as the synthesis filter 1/1−P (z).” from the paper↩
“To make the network more robust to the mismatch, we add noise to the input during training” from the paper↩
“To make the noise proportional to the signal amplitude, we inject it directly in the µ-law domain.”↩
“we find that by adding the noise as shown in Fig. 2, the network effectively learns to minimize the error in the signal domain” ↩
“we add noise … as suggested in [6]. … [6] Z. Jin, … ‘FFTNet: Areal-time speaker-dependent neural vocoder.’”↩
“The simplifications above essentially make the computational cost of all the non-recurrent inputs to the main GRU negligible” from the paper↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも