論文解説: Multiband-WaveRNN - たれぱんのびぼーろく

Multiband-WaveRNN は「WaveRNNは表現力を余らせてる」という仮説の下で、サイズを変えていないWaveRNNへサブバンドN個の同時予測を課したモデル¹.
なんと実際にMOS差無しでNバンド予測に成功. 動作周波数を1/NにできるのでRTFが大幅に改善.

背景・モデル

WaveRNNのスパース化はいくつも研究があり、MOSへの影響を最小にしつつ大半のパラメータを0にできることが明らかになっている.
この事実は「WaveRNNが表現力を余らせている」ことを示唆している.
「じゃあもっと仕事してくれたまえ」が本研究の着想 ².

WaveRNNのモデルサイズを固定する（同じサイズのモデルにもっと仕事させる）ので、WaveRNNの動作周波数を下げる方針になる.
動作周波数を下げるといえばサブバンド処理.
フィルタバンクと間引きで作った低サンプリング周波数のサブバンド群をWaveRNNへまとめてつっこみ、t+1のサブバンド群を生成させる³.
Nバンドへの分割でサンプリング周波数が1/NになるのでWaveRNNの動作周波数も下がり、このモジュールのcomputation costも1/Nへ.

実装・実験

Original WaveRNN (not sparsified)、すなわち GRU-FC₁-ReLU-FC₂-softmax + 8/8-bit dual softmax を利用.
全Bandをconcatか何かしてGRUへ入力しFC-ReLU、FC₂はバンドごとに用意して各自softmax⁴,⁵.
（Sparse化しない前提だからなのか）GRUおよびFCはdense型 (h_GRU=192, h_FC=192⁶).
理論計算量は Fullband 9.8 GFLOP/s, 4band 3.6 GFLOP/s⁷

RTFが1.3から0.5へ改善 (2.4 GHz single core⁸, avx2使用). ちなみにint8も組み合わせると0.17まで改善 (QATを採用している模様⁹).
Original vs multi-band (vs multi-band int8) でnaturality MOSには一切の違いがなく、主観的な判別も不可.

他のMultibandアプローチ

1. Okamoto, T. Toda, Y. Shiga, and H. Kawai, “Improving FFTNet vocoder with noise shaping and subband approaches,” in 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 304–311, IEEE, 2018.
1. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “An investigation of subband WaveNet vocoder covering entire audible frequency range with limited acoustic features,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5654–5658, IEEE, 2018.

Original Paper

@misc{1909.01700,
Author = {Chengzhu Yu and Heng Lu and Na Hu and Meng Yu and Chao Weng and Kun Xu and Peng Liu and Deyi Tuo and Shiyin Kang and Guangzhi Lei and Dan Su and Dong Yu},
Title = {DurIAN: Duration Informed Attention Network For Multimodal Synthesis},
Year = {2019},
Eprint = {arXiv:1909.01700},
}

“we propose the Multiband-WaveRNN alogorithm” DurIAN paper.↩
“our proposed multi-band WaveRNN algorithm exploits the sparseness of the neural network model” DurIAN paper.↩
“uses a single shared WaveRNN model for all subband signal predictions. More specifically, the shared WaveRNN model takes all subband samples predicted from the previous step as input and predicts next samples in all subbands in one inference step” DurIAN paper.↩
論文中の計算コストからFC₂を各バンドごとに用意するよう読み取れた. ちなみにコスト式がたぶん間違ってる（N_G＊N_BでなくN_F＊N_Bでは?）↩
“predict samples for all subbands simultanously through multiple output (and softmax) layers.” DurIAN paper.↩
“N_G is the size of the two GRUs, N_F is the width of affine layer connected with final fully-connected layer … Using N_G = 192, N_G = 192” from the paper↩
“we obtain a total complexity around 9.8 GFLOPS. When we set NB = 4, the total complexity is 3.6 GFLOPS.” from the paper↩
“All the RTF values were measured on a single Intel Xeon CPU E5-2680 v4 core.” from the paper↩
“Since direct quantization of the network causes the deterioration of synthesized sound quality, we use a quantitative loss learning mechanism during training to minimize the deterioration caused by quantization.”↩