PyTorchのプロファイラ - たれぱんのびぼーろく

PyTorch Lightning:

Class	device	unit	internal
`SimpleProfiler`		L's action
`PyTorchProfiler`			PyTorch’s Autograd Profiler
`AdvancedProfiler`
`XLAProfiler`	TPU
(`PassThroughProfiler`)	any		(default) Just pass-through, do nothing

SimpleProfilerとDataLoader

データロードの経過時間は _TrainingEpochLoop.train_dataloader_next アクションとして記録される。
worker有りのDataLoaderはprefetchをするため、理想的にはこのアクションがほぼ0になる。
ここが非ゼロということはデータ供給が追い付いていないことを意味する。

Trainerの奥深くにしまい込まれており、実装の理解は面倒。
call関係は以下になっており、Trainer.fitの前処理でprofilerがフックされている。

<Trainer>.__init__(...):
    self.fit_loop = _FitLoop(...)

<Trainer>.fit(...):
    call._call_and_handle_interrupt(self._fit_impl, ...)

<Trainer>._fit_impl(...):
    self._run(model, ...)

<Trainer>._run(model, ...):
    results = self._run_stage()

<Trainer>._run_stage():
    self.fit_loop.run()


<_FitLoop>.__init__(...):
    self.epoch_loop = _TrainingEpochLoop(trainer)

<_FitLoop>.run():
    while not self.done:
        self.advance()

<_FitLoop>.advance():
    self.epoch_loop.run(self._data_fetcher)


<_TrainingEpochLoop>.run(data_fetcher):
    self.on_run_start(data_fetcher)
    while not self.done:
        self.advance(data_fetcher)

<_TrainingEpochLoop>.on_run_start(data_fetcher):
    data_fetcher._start_profiler = self._on_before_fetch
    data_fetcher._stop_profiler  = self._on_after_fetch
<_TrainingEpochLoop>._on_before_fetch():
    self.trainer.profiler.start(f"[{self.__class__.__name__}].train_dataloader_next")
<_TrainingEpochLoop>._on_after_fetch():
    self.trainer.profiler.stop(f"[{self.__class__.__name__}].train_dataloader_next")

<_TrainingEpochLoop>.advance(data_fetcher):
    next(data_fetcher)


<_DataFetcher>.__next__():
    self._start_profiler()
    data = next(self.iterator)
    self._stop_profiler()
    return data

<_PrefetchDataFetcher>.__next__():
    batch = self.batches.pop(0)
    self._fetch_next_batch(self.iterator)
    return batch

<_PrefetchDataFetcher>._fetch_next_batch(iterator):
    self._start_profiler()
    batch = next(iterator)
    self._stop_profiler()
    self.batches.append(batch)

_data_fetcherの初期化に関して:

<_FitLoop>.__init__(...):
    self._combined_loader = None
    self._data_fetcher = None

<_FitLoop>.run():
    self.setup_data()
    self.reset()
    self.on_run_start()
    while not self.done:
        self.on_advance_start()
        self.advance()
        self.on_advance_end()
    self.on_run_end()

<_FitLoop>.on_run_start():
    self._data_fetcher = _select_data_fetcher(trainer)

<_FitLoop>.advance():
    combined_loader = self._combined_loader
    self._data_fetcher.setup(combined_loader)
    self.epoch_loop.run(self._data_fetcher)


def _select_data_fetcher(trainer) -> _DataFetcher:
    lightning_module = trainer.lightning_module
    if trainer.testing:
        step_fx_name = "test_step"
    elif trainer.training:
        step_fx_name = "training_step"
    elif trainer.validating or trainer.sanity_checking:
        step_fx_name = "validation_step"
    elif trainer.predicting:
        step_fx_name = "predict_step"
    else:
        raise RuntimeError(f"DataFetcher is unsupported for {trainer.state.stage}")
    step_fx = getattr(lightning_module, step_fx_name)
    if is_param_in_hook_signature(step_fx, "dataloader_iter", explicit=True):
        rank_zero_warn(
            f"Found `dataloader_iter` argument in the `{step_fx_name}`. Note that the support for "
            "this signature is experimental and the behavior is subject to change."
        )
        return _DataLoaderIterDataFetcher()
    return _PrefetchDataFetcher()