論文解説: Valin (2022) Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet

LPCNet、効率化しました (x2.5~)。

背景 - ボトルネックはわかっている、観念しろ

LPCNetはモバイルCPUリアルタイム推論ができるほど速い。巨大化すれば品質も良い。しかし速度制約を満たす中での品質には改善の余地があり¹、一層の効率化が求められている。

original LPCNetでは計算時間のうちSamplingRateNetが98.2%を占め、その中でGRUa/GRUb/DualFCは47.8%/26.9%/21.3%になっている²。 GRUaのスパース化に留まっていたので、改善の余地あり。
またSamplingRateNetでstepごとに発生するweight転送でL2キャッシュ帯域はあっぷあっぷになっており、ここにボトルネックがある³。そのうえweightがL2キャッシュに載りきらず更にキツいボトルネックになっていると考えられる⁴。ここにweight軽量化の王道手法を採用しうる。
第2のボトルネックは活性化関数であり、N_A=384でstepごとに2000 activationsの計算が発生している⁵。活性化関数の軽量化はこれを軽減できる。

LPCNetはまだまだ速くなれる、観念して効率化されるのだ⁶。

提案手法

モデル改良
- Hierarchical Probability Distribution
- Increasing second GRU capacity
計算改良
- 量子化
  - int8 dynamic quantization
  - Quantization aware training
- tanh近似

Hierarchical Probability Distribution

原理

WaveRNNの DualSoftmax やBunched LPCNetの Bit bunching に着想を得た分布分割手法。これは
P(s) = ΠB(L_k|L_<k)
で定式化される。

「ビット L_k が上位ビット L_<k 確定下で 0/1 どちらの値を取るか」が基本要素。
これは「上位ビットで条件付けられたビットのベルヌーイ分布 B(L_k|L_<k)」に定式化される。

そのうえで離散値 s_t がビットの階層 (Hierarchical) 構造をもつと見なす。
つまり 0 ≦ s_t ≦ 2^Q-1 (Q bit) はQ階層のビット木と見なせる（例: 5 (Q=3) は1_0_1）。
この際、階層 k だけをみると 0/1 どちらかしかとらないが、上位層が 0/1 どちらかで確率は変わる、つまり上位ビットに条件付けされている。
つまり基本要素である B(L_k|L_<k) が階層構造に現れている。

s_t は全ビットの同時確率であり、これをビット木とみなせば確率分布の因数分解ができる。つまり
P(s) = B(L₁|-) * B(L₂|L₁) * ... * B(L_q|L₁, L₂,...,L_q-1) = ΠB(L_k|L_<k)
の式に帰着する。
総合すると Hierarchical Probability Distribution とは「離散値をビット木とみなし、上位ビットで条件付けられたビットのベルヌーイ分布 B(L_k|L_<k) を用い、同時確率の因数分解で確率分布をモデル化する」手法と言える。

例えば Q=3bit でs_t=5 の確率分布は
P(s_t=5=1_0_1) = B(L₁=1|-) * B(L₂=0|L₁=1) * B(L₃=1|L₁=1,L₂=0) で求められる。

利点

なんでこんな面倒（だけど素直）な定式化をしたかというと、計算量を圧倒的に削減できるから。

Q bitの確率分布からサンプリングするには、まず2^Q個のEnergy (e^x) を出し、softmaxのために総和（分配関数）を出して全要素を割る必要がある。これで得た確率分布からサンプリングする。

階層構造を持たせると話は簡単。L1用の入力 (スカラ) をシグモイド関数に入れて B(L1) とし、サンプリング。次にL1=0/1 givenでのL2用 Energy (スカラ) をシグモイド関数に入れて（以下略）。

つまり 2^Q +α -> Q へ分布の計算量が激減する。

これはまだまだ序ノ口で、最大の計算量削減はFC部にある。
DualFC（1層FC2並列）の計算量は 2 * N_B * 2^Q。なぜならEnergy用に 2^Q 次元のベクトルを出す必要があるから。
しかし階層サンプリングの場合、ベクトルの各要素は B(L_k|L_<k=Bs_<k) 用の値になっている。
だから先に上位層をサンプリングして L_<k != Bs_<k になったら、その値はそもそも使われないので計算しなくていい⁷。
つまりFC部の計算量が 2 * N_B * 2^Q -> 2 * N_B * Q に激減する⁸。Q=8なら256 -> 8で 1/32にまで減る。もちろんメモリ転送量も1/32。

Temperature

original LPCNetはtemperatureライクなバイアスをかけるためにpitchを使ってEnergyをいじっていた.
階層化したことで同時確率に一括バイアスはかけられなくなった⁹ので (かけようとすると部分評価のメリットを捨てることになる）、P(L_k|L_<k) ごとに閾値で切ってバイアスをかけている¹⁰。

GRU容量

For small models, the complexity shifts away from the main GRU For large models, the activation functions start taking an increasing fraction of the complexity, again suggesting that we can increase the density at little cost.

階層サンプリングによりFCの計算量がかなり小さくなったので、GRU_Bの出力サイズ (N_B) をそんなに気にしなくて良くなった（FCを未使用分までL2に載せたいなら配慮が必要）。

GRU_Aがボトルネックではなくなったのでsparsityを小さめに変更。
GRU_Bはサイズを上げつつsparsity導入。
結果として実効weight数は上昇。

Demo

Methods

task: Speaker-independent, Language-independent speech synthesis
models
- B192/B384/B640 (Baseline model, h_GRUa = 192/384/640)
- P192/P384/P640 (Proposed model, h_GRUa = 192/384/640)
Data
- Train: 205 hours of 16-kHz speech from a combination of TTS datasets [19, 20, 21, 22, 23, 24, 25, 26, 27] including more than 900 speakers in 34 languages and dialects
  - To make the data more consistent, we ensure that all training samples have a negative polarity. This is done by estimating the skew of the residual, in a way similar to [28].
- Val
  - PTDB-TUG (en, 10 male, 10 female)
  - NTT: NTT Multi-Lingual Speech Database for Telephonometry (en-US, en, 8 male, 8 female, 12 samples per speaker)
Evaluation
- Speed
  - measure: Synthesis speed
  - env: single core of various CPUs
    - x86: Intel i7-10810U (w/ AVX2, not AVX512-VNNI)
    - N1: a 2.5 GHz ARM Neoverse N1 (similar single-core performance as recent smartphones)
    - A72: a 1.5 GHz ARM Cortex-A72 (similar to older smartphones)
    - A53: a 1.4 GHz ARM Cortex-A53
- Quality
  - measure: naturalness MOS

細かい違い

FrameRateNetwork Residual connection廃止

original LPCNetに存在した、FrameRateNetwork Res[Conv]-FC のResがしれっとFig.1から消滅、特に本文では触れられていない。
official LPCNet@master でもResidual connectionは廃止されている（参考: tarepan/LPCNet - /training_tf2/lpcnet.py ）

conditioning 入力先明示

original LPCNetのFig.1では conditioning f がGRUaのみに入力されているように書かれている。
実際のところ、論文で使われていた official LPCNet @0ddcda0 ではGRUbにもcatして入力されている。
本論文でのFig.1ではこれがきちんと反映され、 f が分岐してGRUaとGRUbに入力している。

フィルタaugmentation

録音環境へのロバスト性をあげるために2次フィルタ (なるもの) を用いてスペクトルをaugmentationしてる¹¹。式はValin (2018) のEq.7¹²。

Original Paper

@misc{2202.11169,
Author = {Jean-Marc Valin and Umut Isik and Paris Smaragdis and Arvindh Krishnaswamy},
Title = {Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet},
Year = {2022},
Eprint = {arXiv:2202.11169},
}

“there is still an inherent tradeoff between synthesis quality and complexity.” from original paper↩
Fig.2 of Kanagawa & Ijima. (2020). Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition.↩
“According to our analysis, the main performance bottleneck is the L2 cache bandwidth required for the matrix-vector products.” from Valin & Skoglund. (2019). A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet.↩
“This is compounded by the fact that these weights often do not fit in the L2 cache of CPUs.” from original paper↩
“ A secondary bottleneck includes about 2000 activation function evaluations per sample (for NA = 384). ” from original paper↩
“In this work, we improve on LPCNet with the goal of making it even more efficient in terms of quality/complexity tradeoff.” from original paper↩
“Even though we still have 255 outputs in the last layer, we only need to sequentially compute 8 of them when sampling” from the paper↩
“compute 8 of them when sampling, making the sampling O (log Q) instead of O (Q).” from the paper↩
“With hierarchical sampling, we cannot directly manipulate individual sample probabilities.” from the paper↩
“each branching decision is biased to render very low probability events impossible” from the paper↩
“To ensure robustness against unseen recording environments, we apply random spectral augmentation filtering using a second-order filter” from the paper↩
“as described in Eq. (7) of [15] … [15] J.-M. Valin, “A hybrid DSP/deep learning approach to realtime full-band speech enhancement,” from the paper↩

たれぱんのびぼーろく

わたしの備忘録、生物学とプログラミングが多いかも