-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
The class pytorch_lightning.profiler.PyTorchProfiler
doesn't work correctly with schedule
parameter. It doesn't take into account the repeat
parameter torch.profiler.schedule
, which is very important in long-term learning.
To Reproduce
Reproduce with the BoringModel:
UPD: new link with example of main incorrect work and work with commit changes:
https://colab.research.google.com/drive/1UbbLx5N5Th0MsXu1olQwWqo7QLGd-lRY?usp=sharing
Expected behavior
I used pytorch_lightning.profiler.PyTorchProfiler
with schedule=torch.profiler.schedule(wait=2, warmup=1, active=3, repeat=5,)
. And according to torch docs, i expect that there will be 5 cycles, each of which consists of 2 wait + 1 warmup + 3 active = 6 steps (per cycle), but in fact PyTorchProfiler
records information about fewer cycles.
In code terms:
profiler = PyTorchProfiler(
schedule=torch.profiler.schedule(
wait=2,
warmup=1,
active=3,
repeat=5,
),
)
model = BoringModel()
trainer = Trainer(
max_epochs=1,
profiler=profiler,
)
trainer.fit(model, train_dataloaders=train_data)
in this case the profiler should return information about 15 steps (3 active * 5 cylcles) to me, however it returns information about fewer steps because it doesn't record some cycles.
Environment
* CUDA:
- GPU:
- Tesla T4
- available: True
- version: 11.3
* Packages:
- lightning: None
- lightning_app: None
- numpy: 1.21.6
- pyTorch_debug: False
- pyTorch_version: 1.12.0+cu113
- pytorch-lightning: 1.7.0
- tqdm: 4.64.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.13
- version: #1 SMP Sun Apr 24 10:03:06 PDT 2022
Additional context
I found the same issue #12611 .
cc @carmocca @kaushikb11 @ninginthecloud @rohitgr7 @nbcsm @guotuofeng