を意味しており³、SF1 ~ SF3, SM1 & SM2, TF1 ~ TF3, TM1 & TM2の10話者データがある。
同じファイル名 (100001.wavなど) は同じ内容の発話 ⁴
16 kHz⁵, 16-bit⁶, RIFF/WAVE format⁷の形式。 and 54 utterances for evaluation from each of 5 source and 5 target speakers, ref

ダウンロード

ここ
VCC training data: training data released to participants during the challenge (23.30Mb)には10話者各162発話の (challenge時にtrainingとして使われた) データがある。
evaluationがなんか歯抜けで入っており、よくわからない

url_prefix = 'https://datashare.is.ed.ac.uk/bitstream/handle/10283/2211/'
data_files = ['vcc2016_training.zip', 'evaluation_all.zip']

このリンクを使ってダウンロードすると全部取ってこれるのだが…なんなんだ？

1~6の引用元はここ
ref

> Each speaker utters the same sentence ↩
> a common dataset consisting of 162 utterances for training↩
> ’S' denotes ‘source’, ’T' denotes ‘target’, while ’M' and ‘F’ for ‘male’ and ‘female’, respectively. ↩
> The same file name means the same linguistic content ↩
> The sampling rate is 16 kHz↩
> stored in 16-bit format.↩
> The waveforms in the directory are in RIFF/WAVE format. ↩

2018-10-25

Rainbowgramsで音を可視化

レインボーグラム (Rainbowgrams) とは、音声を構成する周波数成分の強さおよび位相変化率を時間ごとに可視化した図である。正確には、音声の周波数領域時系列がもつ強度およびInstantaneous frequency (IF) を可視化した図である。
レインボーグラムは、強度の時間変化を表現する(パワー) スペクトログラムなどの手法と比べ、音声の重要な構成要素である位相も強度と同時に可視化できる利点がある¹。
横軸に時間、縦軸に周波数をとり、各点の対数パワーを輝度で²、IFを色調で³表現する。既存の音声可視化グラフと関連づけるならば、IFの色調表現を追加したスペクトログラムと言える⁴。

www.youtube.com

見方

暗い ~ 明るい == 強度の弱い ~ 強い
色が一定 == 位相は一定 (coherent)
色が変化 == 位相が変化
斑点状の色変化 (speckled noise) がある == discontinuities, incoherent

スペクトログラムではない理由

スペクトログラムを見て非常に似ている場合でも、対応する音声が全く異なる場合があり、それは位相の違いに起因しているから⁵。

位相ではなくInstantaneous freqeuncyな理由

図にしたときphaseは見た目上バラバラになるから。
周波数表現の時系列を求める際、波形は必ずframe (bin) 列へと分割されることになる (frameが一部重なることもある)。その際のframeストライド (hop size)と各成分の周期が整数倍にならない場合、frameごとに波の始まりがずれる、すなわち初期位相はずれることになる。
これを色調として表現した場合、ズレに合わせて色が変わっていくことになり、位相そのものの時系列変化がわかりづらくなる (変化する要因が2つあるから)。
幸い、このズレは線形であるから、位相に変化がなければ、位相の変化率すなわち位相の微分は一定値になるはずである (微分値がずれていれば位相変化を意味する)。
この、位相をunwrapして微分した値がInstantaneous frequencyであり、ゆえに位相変化を認知しやすいIFを位相変化の指標として色調にしている。

名前の由来

レインボーグラムという呼称は、IFを色調表現したスペクトログラム (レインボーグラム) が縦軸方向へ虹色に変化しやすいことに由来する⁶。
_{IFはf_frameとf_{signal component}の違いに由来する。ゆえに自然と近傍の周波数成分…ん？初期位相表現できない…？}

plots of the constant-q transform (CQT) (Brown, 1991), which is useful because it is shift invariant to changes in the fundamental frequency.

位相変化実験・位相検出実験

困難さの根源

Frame-based estimation of audio waveformsの難しさ。

いくつかのサブタイプ

log(power)を輝度、IFを色調で表現するのは共通。
ただし何のpowerかはいくつか種類がありそう

NSynth: constant-q transform (CQT)⁷

参考文献

NSynth paper
- 特に4. Evaluationの章
GANSynth paper

As phase plays such an essential part in sample quality, we have attempted to show both magnitude and phase on the same plot. (from NSynth paper) ↩
The intensity of lines is proportional to the log magnitude of the power spectrum (from NSynth paper) ↩
the color is given by the derivative of the unrolled phase (‘instantaneous frequency’) (from NSynth paper) ↩
instantaneous frequency colored spectrograms (from NSynth paper) ↩
two spectrograms that appear very similar to the eye can correspond to audio that sound drastically different due to phase differences. ↩
We affectionately refer to these instantaneous frequency colored spectrograms as ”Rainbowgrams” due to their tendency to form rainbows as the instantaneous frequencies modulate up and down.↩
in our analysis we present examples as plots of the constant-q transform (CQT) ↩

2018-10-22

声質変換 (Voice Conversion, ボイチェン) とは

AI 科学

声質変換（こえしつへんかん、せいしつへんかん¹）とは、声がもつ意味を変えずに質感のみを変えること。正確には、「入力音声に対して, 発話内容を保持しつつ, 他の所望の情報を意図的に変換する処理」²のこと。
英語では「Voice Conversion」や「Voice Transformation」と呼ばれる [^1] 。
話者質感変換 (例. 男声から女声)は一般にボイスチェンジ（ボイチェン）と言われる場合が多い。

概要

声質変換とは、音声の持つ言葉の意味を保持しながら、話者情報・声に乗る感情・イントネーションなどを意図的に変換する処理である。
いわゆる「ボイチェン」は話者情報変換であり、声質変換の一種と言える。
声質変換は音響変換(Audio Transformation)の一種であり、様々な技術・知識の融合である。

原理

声質変換では、音声がもつ言語情報成分を保ちながら非言語情報成分を変換するという原理に基づいている。
音声は言語情報 (linguistic)、パラ言語情報 (para-linguistic)、非言語情報 (non-linguistic)を含むと考えられている。
これらを含む音声 (波形) は、sequential structuresとhierarchical structuresを有している。
voiced/unvoiced segments, phonemes/morphemes

技術・手法

音響学、信号処理技術、統計処理技術^p.9などに基づいて様々な手法が提案されているが、決定的な手法は未だ存在しない (完全に自由な声質変換は実現していない。)

信号処理
- ピッチ・フォルマント変換
  - 手法例: 恋声
統計モデル statistical model
- GMM
- neural network
  - restricted Bolzmann machine (RBM)
  - FFNN
  - RNN
  - CNN
  - exemplar-based methods
    - non-negative matrix factorization (NMF)

not enough

simple conversion function (for modifying the spectral envelope)
- global linear transformation
- frequency warping with a constant warping rate

データの種類

source-domainデータの関係
- parallel: basically needs alignment
- non-parallel
システム構成
- w/ extra data
  - transcripts
  - reference speech
- w/ modules
  - ASR
    - non-parallelでも対応frameを見つけられる。問題としてverbal info以外を落とす、ASR精度がつきまとう.
- w/o extra data

しばしば発生する問題:
* over-smoothing:
+ explicit density estimation

統計モデルにおけるデータ処理

生の音声波形を統計モデルで直接変換する方法 (end2end, wave2wave)
人間が設計した音響特徴量へ (信号処理で) 変換しこれを統計モデルで変換する方法 (Vocoder)

に大別される

音響特徴量への変換 (Vocoder)

vocoder
- WORLD
- STFT
- WaveNet Vocoder

評価手法

最終的には人の感覚によるものになるが、効率よく客観的に声質変換を評価するための指標がさまざま提案されている。

Mel-cepstral distortion
global variance (GV)
d modulation spectra (MS)

ライブラリ

sprocketGitHub Hands on

手法

CVAE-VC
VAE-GAN
CycleGAN-VC
StarGAN-VC
VQ-VAE

入力に入力のlabelは必要か

利用

声質変換は様々な目的に利用できる。

エンターテインメント、身体拡張
- 声優
- ボイチェン, バ美肉
医療
- 発話補助 (先天的・後天的声帯機能不全)
ユーティリティ
- 音量非依存発話 (ささやき声を大きな声へ)

歴史

少なくとも1980年代後半から研究が行われてきた³。元々は、TTS結果への話者特性付与、特に翻訳目的のTTSでsource言語話者の声をtarget言語へ付与するために研究が進んだ⁴。

著作権

社団法人日本音響学会 -- The Acoustical Society of Japan --

参考文献

後日まとめること

?
vocoder-free VC
K. Kobayashi, T. Toda, and S. Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,” in Proc. SLT, 2016, pp. 693–700.

? adaptation techniques
incorporating pre-constructed speaker space
-> needs parallel data among reference speakers

low-dimensional embeddings

contextual information

揺らぎ成分
音韻依存要因、声質依存要因
知覚特性

Vocoderの影響について書いてあるref

戸田智基. 日本音響学会誌 72 巻 6 号（2016），pp. 324–331 ↩
戸田智基. 2017年度人工知能学会全国大会 ↩
> VC research has a relatively long history from the late 1980s onwards (VCC2016 paper)↩
Originally it was studied to achieve speaker conversion to make it possible to synthesize various speakers’ voices in a TTS system, in particular focusing on cross-language VC enabling a user to produce his/her own voice in a different language for speech-to-speech translation (VCC2016 paper)↩

2018-10-15

PyTorchのnn.Moduleを読み解く

レイヤーをattributeとして設定する必要がある理由

__setattr__でフック掛けて処理をしているから
フック内ではattribute valueの種類に基づいて内部登録がなされる.
module.parameters()ではparamsのみではなくmodulesへも再帰的にアクセスしてparamsを拾ってきてくれる。__setattr__フックによるmodule登録がこれを可能にしてくれている.
moduleは他にも.to(device)を提供しており、配下のParametersをGPUへ転送してくれる.
このparams取得も同様に子モジュール再帰で実現されている

torch.nn.modules.module — PyTorch master documentation

    def __setattr__(self, name, value):
        def remove_from(*dicts):
            for d in dicts:
                if name in d:
                    del d[name]
        params = self.__dict__.get('_parameters')
        # 以下、つらつらとsetされたvalueをチェック
        # “Parameter”
        if isinstance(value, Parameter):
            if params is None:
                raise AttributeError(
                    "cannot assign parameters before Module.__init__() call")
            remove_from(self.__dict__, self._buffers, self._modules)
            self.register_parameter(name, value)
        # 既存paramの更新
        elif params is not None and name in params:
            if value is not None:
                raise TypeError("cannot assign '{}' as parameter '{}' "
                                "(torch.nn.Parameter or None expected)"
                                .format(torch.typename(value), name))
            self.register_parameter(name, value)
        # Paramじゃない新規attribute
        else:
            modules = self.__dict__.get('_modules')
            ## Module in Moduleの場合
            if isinstance(value, Module):
                if modules is None:
                    raise AttributeError(
                        "cannot assign module before Module.__init__() call")
                remove_from(self.__dict__, self._parameters, self._buffers)
                modules[name] = value
            ## 既存moduleの更新
            elif modules is not None and name in modules:
                if value is not None:
                    raise TypeError("cannot assign '{}' as child module '{}' "
                                    "(torch.nn.Module or None expected)"
                                    .format(torch.typename(value), name))
                modules[name] = value
            ## Buffer扱い
            else:
                buffers = self.__dict__.get('_buffers')
                if buffers is not None and name in buffers:
                    if value is not None and not isinstance(value, torch.Tensor):
                        raise TypeError("cannot assign '{}' as buffer '{}' "
                                        "(torch.Tensor or None expected)"
                                        .format(torch.typename(value), name))
                    buffers[name] = value
                else:
                    object.__setattr__(self, name, value)

2018-10-14

PyTorchのLearningRate Scheduler

PyTorchではoptimizerの学習率 (Learning Rate) を動的に変更するUtilityがある。
このUtilityはSchedulerと呼ばれ、Class名では○○LRと名付けられている.

Schedulerのタイプ

更新の仕方によって以下のように分類される。

LambdaLR:
StepLR: x epochごとにlrをγ倍するタイプ. (例: 30 epochごとに0.1倍) *MultiStepLR: StepLRで epoch間隔を任意に設定できるタイプ
ExponentialLR: 毎epochでγ倍するタイプ, 要はexponential decay
CosineAnnealingLR:
ReduceLROnPlateau:

動作

optimizerを保持するインスタンスとして生成し、scheduler.step()でepochの進展を通知する。設定したepochに応答してoptimizerの学習率が自動更新される。

実装

_LRSchedulerがベース.
step()すると内部epoch++を行ったうえでoptimizer.param_groupsのparam_group["lr"]をget_lr()で更新.

LambdaLR

base_lr * labmda(epoch)がlrになる模様.

2018-10-13

mnet基本思想

ネットワークはネットワーク

ネットワークの本質はネットワーク構造にある。
学習は別物。
同じネットワーク構造に異なる学習 (Backprop+optim vs non-BP手法) を行うことが可能。
そもそも推論だけしたい人には学習周りは不要.
PyTorchそのものがネットワーク構造の記述を強くサポートしてくれるので、それはお任せ

trainは定型化できる

trainは

モデルによる推論 (前向き演算)
誤差関数を用いた誤差演算
学習器によるパラメータ更新

と大まかに分けられる。
モデル、誤差関数、学習器と抽象化すればDIできる。
GANはtrainが特殊なので、train_GANで抽象化

結論

デフォルトのショートカットでスイスイ動く

ショートカット動かない場合

参考文献

基本情報

ダウンロード

見方

スペクトログラムではない理由

位相ではなくInstantaneous freqeuncyな理由

名前の由来

位相変化実験・位相検出実験

困難さの根源

いくつかのサブタイプ

参考文献

概要

原理

技術・手法

データの種類

統計モデルにおけるデータ処理

音響特徴量への変換 (Vocoder)

評価手法

ライブラリ

手法

利用

歴史

参考文献

関連項目

後日まとめること

レイヤーをattributeとして設定する必要がある理由

Schedulerのタイプ

動作

実装

LambdaLR

ネットワークはネットワーク

trainは定型化できる