-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various updates to graphium #449
Conversation
Codecov Report
@@ Coverage Diff @@
## main #449 +/- ##
==========================================
+ Coverage 64.60% 64.62% +0.01%
==========================================
Files 93 93
Lines 8417 8423 +6
==========================================
+ Hits 5438 5443 +5
- Misses 2979 2980 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@maciej-sypetkowski , do you think that moving the val/test metrics computations to CPU would significantly hurt parallelization? We are currently constrained as we don't use the torchmetrics ability to update stats due to their inability to handle NaNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those yaml files seem like duplicate from the files in the neurips2023
folder. What do you think? Is there a reason to have them separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I curated the configs to make LargeMix configs easy to use within our hydra logic. In particular, the modularity with the added files will facilitate finetuning on pretrained models on LargeMix. As far as I see, neurips2023 configs can only be used for training on LargeMix, but one needs to write new configs for, e.g., finetuning or preparing data.
graphium/config/_loader.py
Outdated
if "model_checkpoint" in cfg_trainer.keys(): | ||
dirpath = cfg_trainer["model_checkpoint"]["dirpath"] | ||
path = fs.join(dirpath, path) | ||
dirpath = cfg_trainer["model_checkpoint"]["dirpath"] + str(cfg_trainer["seed"]) + "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a unique identifier that you can associate with a specific run? Perhaps take the identifier from Wandb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you want to add an ID that is only programmatically accessible, I would leave it to the config and not hard-code the seed in the path since it is pretty subjective where you want to save the models. Note that setting the dirpath to include the seed can currently also be achieved by changing the config - either directly or through the CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the issue that we want identifiable checkpoints, and non- overwriting checkpoints. How do you suggest we move forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the use-case!
As I said before, I suggest using the config. I would set a sensible default, but then leave it up to the user.
E.g. the line that is hard coded here can already be configured with hydra
s override syntax, using OmegaConf's interpolation:
graphium-train ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${constants.seed}'"
In case of a sweep, you might want to use something more specific, e.g.:
graphium-train -m ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${now:%Y-%m-%d_%H-%M-%S}'"
To prevent most users from having to set a value themselves, you could include a sensible default in the config files. Current defaults (e.g. models_checkpoints/neurips2023-small-gcn/
) are too specific and not consistent. I would e.g. save to the current working dir by default:
trainer:
model_checkpoint:
dirpath: ${hydra.run.dir}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I'm still getting used to hydra and didn't know you could put the date-time directly in the Yaml! What do you think of this commit? Making it the default in all the hydra-configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we have to leave the checkpoint path configuration somewhat up to the user.
However, the seed is an important culprit leading to overwriting (or unclear identity) of model checkpoints. I post-pended it to the dirpath
because we also used to have the following in load_trainer
:
if "model_checkpoint" in cfg_trainer.keys():
cfg_trainer["model_checkpoint"]["dirpath"] += str(cfg_trainer["seed"]) + "/"
callbacks.append(ModelCheckpoint(**cfg_trainer["model_checkpoint"]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using graphium-train ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${constants.seed}'"
as proposed by Cas to reasonbly structure checkpoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the full date and time? Much better than the random seed. And avoids collision when doing sweeps with pre-selected seeds. And also interpretable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. It was necessary at the time the branch was created because the seed was hard-coded into the dirpath
back then... I think it is good as of now using the date :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed function discussed here to its current state in the main branch. The function was outdated in this branch because we added it to #441 where it was modified to accommodate further changes :)
@@ -55,8 +55,7 @@ def test_ogb_datamodule(self): | |||
rm(TEMP_CACHE_DATA_PATH, recursive=True) | |||
|
|||
# Reset the datamodule | |||
ds._data_is_prepared = False | |||
ds._data_is_cached = False | |||
ds = GraphOGBDataModule(task_specific_args, **dm_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is properly resetting the datamodule
. The old code didn't reset self.train_ds
, etc., which could lead to wrong testing as, e.g., setup
checks if self.train_ds is None
I think it should be fine, however the predictions are still kept on gpus if I look correctly (the casting is only in
It can be done in torchmetrics, you just need to implement a wrapper for ignoring nans. We'll probably want to do this when we add better multi-device support. |
We already have the |
@DomInvivo I think we can merge. I suggest opening a new branch for the optimizations discussed with @maciej-sypetkowski |
Changelogs
datamodule._make_multitask_dataset()
once)hydra
hydra
cpu
to savegpu
memoryRemarks
Regarding(3.): Single dataset runs are specific to NeurIPS experiment, where we use the same model as for LargeMix, i.e., run them like follows:
Regarding (5): Needs to be properly changed to
typer
following comment by @cwognum (incl. changing documentation) once branch has been updated withmain
.Checklist:
[ ] Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.feature
,fix
ortest
(or ask a maintainer to do it for you).