-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP-related improvements to data module and logging #594
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #594 +/- ##
==========================================
- Coverage 83.75% 83.24% -0.51%
==========================================
Files 11 11
Lines 357 376 +19
==========================================
+ Hits 299 313 +14
- Misses 58 63 +5
☔ View full report in Codecov by Sentry. |
Seems like the checks are using python 3.8. Is there a way to make them use 3.10? |
Also I just realized I might need to take this issue into account Lightning-AI/pytorch-lightning#12862. This affects the train script because
I think this warning is quite important to deal with since we wouldn't want inaccurate test metrics to be reported because of accidentally re-using the same trainer that was initialized to use a DDP strategy. A possible solution is to initialize a new trainer separately for testing. The only issue I can think of with this approach is that it will create a new Logger for testing, meaning that you won't have all your train, validation, and test results neatly presented in a single log (e.g., a single TensorBoard log, or a single WandB log, or a single Aim log). Lightning doesn't save Logger objects in checkpoints, meaning they can't be restored from checkpoints (although I'm aware of your PR for this, but it's specific to the Tensorboard Logger it seems). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I like this new logger. What's the default way for user to limit the logging to only master rank? Can you add something like optional
log_master_only
arg toget_ranked_pylogger(...)
? - Sure, we can have this
- Good idea. Are you sure logging to the same file from many processes doesn't need synchronisation? Won't there by any conflicts leading to logs getting lost sometimes?
Getting to it now! Sorry for the hold up. |
|
yea |
…ter, avoiding weird hijacking of logging functions.
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I forgot to report back about the multiprocessing logging concern. In short, for the context of distributed logging where it's ok to have interleaving logs from multiple processes (i.e., you're not expecting a guaranteed ordering of logs across processes), then all seems to be ok. Check this stack overflow post for more info https://stackoverflow.com/questions/12238848/python-multiprocessinglogging-filehandler. Furthermore, there was some discussions surrounding Hydra configuring its logging setup to support logging in a distributed setting, but it didn't result in any concrete change facebookresearch/hydra#1148. From what I and others have experienced, nothing bad has happened to our logs when logging from multiple processes to a single file. That being said, it might be worth looking into the officially-recommended way of logging to a single file from multiple processes: https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes |
What does this PR do?
train.log
would be created for the rank 0 process andtrain_ddp_process_{rank}.log
would be created for all the other ranks, making it confusing to read through logs in a DDP setup.Before submitting
pytest
command?pre-commit run -a
command?Did you have fun?
y
Make sure you had fun coding 🙃