Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various updates to graphium #449

Merged
merged 31 commits into from
Sep 1, 2023
Merged

Various updates to graphium #449

merged 31 commits into from
Sep 1, 2023

Conversation

WenkelF
Copy link
Collaborator

@WenkelF WenkelF commented Aug 28, 2023

Changelogs

  1. Further improvement of caching logic (only calling datamodule._make_multitask_dataset() once)
  2. Integrated LargeMix into hydra
  3. Integrated single datasets (LargeMix) into hydra
  4. Computing val/test metrics on cpu to save gpu memory
  5. Implemented testing using a model checkpoint

Remarks

Regarding(3.): Single dataset runs are specific to NeurIPS experiment, where we use the same model as for LargeMix, i.e., run them like follows:

graphium-train \
    architecture=largemix \
    tasks=l1000_mcf7 \
    training=largemix \

Regarding (5): Needs to be properly changed to typer following comment by @cwognum (incl. changing documentation) once branch has been updated with main.


Checklist:

  • [ ] Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation is a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

@WenkelF WenkelF added fix PR that fixes an issue feature labels Aug 28, 2023
@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Merging #449 (3cf2fb5) into main (3aa098b) will increase coverage by 0.01%.
Report is 3 commits behind head on main.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main     #449      +/-   ##
==========================================
+ Coverage   64.60%   64.62%   +0.01%     
==========================================
  Files          93       93              
  Lines        8417     8423       +6     
==========================================
+ Hits         5438     5443       +5     
- Misses       2979     2980       +1     
Flag Coverage Δ
unittests 64.62% <66.66%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
ipu 49.14% <ø> (ø)

@DomInvivo
Copy link
Collaborator

@maciej-sypetkowski , do you think that moving the val/test metrics computations to CPU would significantly hurt parallelization? We are currently constrained as we don't use the torchmetrics ability to update stats due to their inability to handle NaNs.

@DomInvivo DomInvivo linked an issue Aug 29, 2023 that may be closed by this pull request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those yaml files seem like duplicate from the files in the neurips2023 folder. What do you think? Is there a reason to have them separately?

Copy link
Collaborator Author

@WenkelF WenkelF Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I curated the configs to make LargeMix configs easy to use within our hydra logic. In particular, the modularity with the added files will facilitate finetuning on pretrained models on LargeMix. As far as I see, neurips2023 configs can only be used for training on LargeMix, but one needs to write new configs for, e.g., finetuning or preparing data.

if "model_checkpoint" in cfg_trainer.keys():
dirpath = cfg_trainer["model_checkpoint"]["dirpath"]
path = fs.join(dirpath, path)
dirpath = cfg_trainer["model_checkpoint"]["dirpath"] + str(cfg_trainer["seed"]) + "/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a unique identifier that you can associate with a specific run? Perhaps take the identifier from Wandb?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you want to add an ID that is only programmatically accessible, I would leave it to the config and not hard-code the seed in the path since it is pretty subjective where you want to save the models. Note that setting the dirpath to include the seed can currently also be achieved by changing the config - either directly or through the CLI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the issue that we want identifiable checkpoints, and non- overwriting checkpoints. How do you suggest we move forward?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the use-case!

As I said before, I suggest using the config. I would set a sensible default, but then leave it up to the user.

E.g. the line that is hard coded here can already be configured with hydras override syntax, using OmegaConf's interpolation:

graphium-train ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${constants.seed}'"

In case of a sweep, you might want to use something more specific, e.g.:

graphium-train -m ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${now:%Y-%m-%d_%H-%M-%S}'"

To prevent most users from having to set a value themselves, you could include a sensible default in the config files. Current defaults (e.g. models_checkpoints/neurips2023-small-gcn/) are too specific and not consistent. I would e.g. save to the current working dir by default:

trainer:
  model_checkpoint:
    dirpath: ${hydra.run.dir}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'm still getting used to hydra and didn't know you could put the date-time directly in the Yaml! What do you think of this commit? Making it the default in all the hydra-configs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we have to leave the checkpoint path configuration somewhat up to the user.

However, the seed is an important culprit leading to overwriting (or unclear identity) of model checkpoints. I post-pended it to the dirpath because we also used to have the following in load_trainer:

if "model_checkpoint" in cfg_trainer.keys():
    cfg_trainer["model_checkpoint"]["dirpath"] += str(cfg_trainer["seed"]) + "/"
    callbacks.append(ModelCheckpoint(**cfg_trainer["model_checkpoint"]))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using graphium-train ... "trainer.model_checkpoint.dirpath='/path/to/checkpoints/${constants.seed}'" as proposed by Cas to reasonbly structure checkpoints

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the full date and time? Much better than the random seed. And avoids collision when doing sweeps with pre-selected seeds. And also interpretable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It was necessary at the time the branch was created because the seed was hard-coded into the dirpath back then... I think it is good as of now using the date :)

Copy link
Collaborator Author

@WenkelF WenkelF Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed function discussed here to its current state in the main branch. The function was outdated in this branch because we added it to #441 where it was modified to accommodate further changes :)

@@ -55,8 +55,7 @@ def test_ogb_datamodule(self):
rm(TEMP_CACHE_DATA_PATH, recursive=True)

# Reset the datamodule
ds._data_is_prepared = False
ds._data_is_cached = False
ds = GraphOGBDataModule(task_specific_args, **dm_args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is properly resetting the datamodule. The old code didn't reset self.train_ds, etc., which could lead to wrong testing as, e.g., setup checks if self.train_ds is None

@maciej-sypetkowski
Copy link
Collaborator

@maciej-sypetkowski , do you think that moving the val/test metrics computations to CPU would significantly hurt parallelization?

I think it should be fine, however the predictions are still kept on gpus if I look correctly (the casting is only in _general_epoch_end).

We are currently constrained as we don't use the torchmetrics ability to update stats due to their inability to handle NaNs.

It can be done in torchmetrics, you just need to implement a wrapper for ignoring nans. We'll probably want to do this when we add better multi-device support.

@DomInvivo
Copy link
Collaborator

@maciej-sypetkowski

It can be done in torchmetrics, you just need to implement a wrapper for ignoring nans. We'll probably want to do this when we add better multi-device support.

We already have the MetricWrapper, but it works on the metric level, not on the update and compute level. I guess that's a good place to start looking in the future.

@WenkelF
Copy link
Collaborator Author

WenkelF commented Aug 31, 2023

@DomInvivo I think we can merge. I suggest opening a new branch for the optimizations discussed with @maciej-sypetkowski

@DomInvivo DomInvivo merged commit 0d24634 into main Sep 1, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature fix PR that fixes an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants