Skip to content

feat(models,training): multiple datasets (WIP) #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 187 commits into
base: main
Choose a base branch
from

Conversation

JPXKQX
Copy link
Member

@JPXKQX JPXKQX commented Jul 29, 2025

Description

Initial implementation of support for multiple datasets and downscaling functionality.

This is a draft PR to facilitate early feedback and testing. The core functionality is in place, but several aspects remain unsupported or unoptimized. All 3 use cases have been tested using a single GPU.

NOTE: To test this branch, you will need to check out the feature/missing-features-for-observations branch in the anemoi-datasets repository.

Use cases:

  • Multiple datasets support: Enables training with distinct datasets (e.g., LAM and global), each potentially having:
    • Different sets of variables
    • Different temporal resolutions or time steps
  • Downscaling: Initial downscaling logic added
  • Autoencoder: Initial autoencoder logic added

Run training via:

anemoi-training train --config-name=debug_downscaling
anemoi-training train --config-name=debug_regional
anemoi-training train --config-name=debug_autoencoder

Not supported

  • No performance (speed and memory) optimization
  • No checkpointing
  • No rollout
  • No diagnostics
  • No model sharding
  • No CRPS. No time interpolation.

Configuration

  • data/data_handlers: Specify dataset sources, along with associated normalisers (optional) and imputers (optional).
data_handlers:
  lowres: # user-defined key
    dataset: ${hardware.files.lowres_dataset}
    normaliser:
      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
      config: ${data.normalizer}
  hires: # user-defined key
    dataset: ${hardware.files.hires_dataset}
    normaliser: 
      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
      config: ${data.normalizer}
  • model/sample: Define which dataset is used as input/target, and configure:
    • Variables per source
    • Temporal offsets per source
sample:
  input:
    lowres:
      variables:
      - "cos_latitude"
      - "cos_longitude"
      - "sin_latitude"
      - "sin_longitude"
      - "lsm"
      - "z"
      - "10u"
      - "10v"
      - "2t"
      - "2d"
      offset: ["0h"]
    hires:
      variables:
      - "cos_latitude"
      - "cos_longitude"
      - "sin_latitude"
      - "sin_longitude"
      - "orog"
      - "lsm"
      offset: ["0h"]
  target:
    hires:
      variables:
      - "10u"
      - "10v"
      - "2t"
      - "2d"
      - "q_100"
      - "tp"
      offset: ["0h"]

Big thanks to @floriankrb and @VeraChristina for their work on this PR, and to the many others who provided valuable input and feedback.

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

@mchantry mchantry added the ATS Approval Needed Approval needed by ATS label Jul 30, 2025
Copy link
Contributor

@MeraX MeraX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much looking forward for the features added through this PR! I've just left a suggestion for improved accessibility and a minor comment on typing.

Comment on lines +8 to +21
data_handlers:
lowres:
dataset: ${hardware.files.lowres_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}

hires:
dataset: ${hardware.files.hires_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a note that I took over from the discussion on this work we had last week.

Suggested change
data_handlers:
lowres:
dataset: ${hardware.files.lowres_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}
hires:
dataset: ${hardware.files.hires_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}
data_handlers:
- dataset_name: lowres
dataset: ${hardware.files.lowres_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}
- dataset_name: hires
dataset: ${hardware.files.hires_dataset}
# processors including imputers and normalizers are applied in order of definition
normaliser:
_target_: anemoi.models.preprocessing.normalizer.InputNormalizer
config: ${data.normalizer}

I would recommend avoiding user-defined names as keys in the config. I believe it makes defining schemas easier, can avoid name clashes and would be in general more consistent with the definition and naming of graphs. Also, giving the key a name makes it easier to look it up in the docs.

In general, I would also recommend introducing kind of a "type system" for names in the config. E.g. defining the string "lowres" in this example is a "dataset_name" (or similar) which could then be referenced and used as a model.encoder_input_names, but which is perhaps different to what "lowres" defines in training/src/anemoi/training/config/graph/downscaling.yaml. Also, separating the different name strings more clearly could help to solve inconsistencies between referencing to things by their string name and by a path like dataset: ${data.data_handlers.lowres.dataset}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Marek, I agree that having some keys user-defined and others constrained may be confusing, and that this would make schemas more useful. However, I don't think it would be consistent. Currently, we have this setup for nodes, node attributes, edge attributes, scalers, and validation metrics.

That said, it may be a good opportunity to change all of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Now In Progress
Development

Successfully merging this pull request may close these issues.

7 participants