-
Notifications
You must be signed in to change notification settings - Fork 42
feat(models,training): multiple datasets (WIP) #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…/anemoi-core into refactor/multiple-datasets
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very much looking forward for the features added through this PR! I've just left a suggestion for improved accessibility and a minor comment on typing.
data_handlers: | ||
lowres: | ||
dataset: ${hardware.files.lowres_dataset} | ||
# processors including imputers and normalizers are applied in order of definition | ||
normaliser: | ||
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | ||
config: ${data.normalizer} | ||
|
||
hires: | ||
dataset: ${hardware.files.hires_dataset} | ||
# processors including imputers and normalizers are applied in order of definition | ||
normaliser: | ||
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | ||
config: ${data.normalizer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a note that I took over from the discussion on this work we had last week.
data_handlers: | |
lowres: | |
dataset: ${hardware.files.lowres_dataset} | |
# processors including imputers and normalizers are applied in order of definition | |
normaliser: | |
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | |
config: ${data.normalizer} | |
hires: | |
dataset: ${hardware.files.hires_dataset} | |
# processors including imputers and normalizers are applied in order of definition | |
normaliser: | |
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | |
config: ${data.normalizer} | |
data_handlers: | |
- dataset_name: lowres | |
dataset: ${hardware.files.lowres_dataset} | |
# processors including imputers and normalizers are applied in order of definition | |
normaliser: | |
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | |
config: ${data.normalizer} | |
- dataset_name: hires | |
dataset: ${hardware.files.hires_dataset} | |
# processors including imputers and normalizers are applied in order of definition | |
normaliser: | |
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer | |
config: ${data.normalizer} |
I would recommend avoiding user-defined names as keys in the config. I believe it makes defining schemas easier, can avoid name clashes and would be in general more consistent with the definition and naming of graphs. Also, giving the key a name makes it easier to look it up in the docs.
In general, I would also recommend introducing kind of a "type system" for names in the config. E.g. defining the string "lowres" in this example is a "dataset_name" (or similar) which could then be referenced and used as a model.encoder_input_names
, but which is perhaps different to what "lowres" defines in training/src/anemoi/training/config/graph/downscaling.yaml
. Also, separating the different name strings more clearly could help to solve inconsistencies between referencing to things by their string name and by a path like dataset: ${data.data_handlers.lowres.dataset}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Marek, I agree that having some keys user-defined and others constrained may be confusing, and that this would make schemas more useful. However, I don't think it would be consistent. Currently, we have this setup for nodes, node attributes, edge attributes, scalers, and validation metrics.
That said, it may be a good opportunity to change all of these.
training/src/anemoi/training/train/tasks/forecaster_multiple_datasets.py
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
Co-authored-by: Marek Jacob <MeraX@users.noreply.github.com>
…wf/anemoi-core into refactor/multiple-datasets-4
Description
Initial implementation of support for multiple datasets and downscaling functionality.
This is a draft PR to facilitate early feedback and testing. The core functionality is in place, but several aspects remain unsupported or unoptimized. All 3 use cases have been tested using a single GPU.
NOTE: To test this branch, you will need to check out the feature/missing-features-for-observations branch in the
anemoi-datasets
repository.Use cases:
Run training via:
Not supported
Configuration
data/data_handlers
: Specify dataset sources, along with associated normalisers (optional) and imputers (optional).model/sample
: Define which dataset is used as input/target, and configure:Big thanks to @floriankrb and @VeraChristina for their work on this PR, and to the many others who provided valuable input and feedback.
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.