feat(models,training): multiple datasets (WIP) #441

JPXKQX · 2025-07-29T14:53:19Z

Description

Initial implementation of support for multiple datasets and downscaling functionality.

This is a draft PR to facilitate early feedback and testing. The core functionality is in place, but several aspects remain unsupported or unoptimized. All 3 use cases have been tested using a single GPU.

NOTE: To test this branch, you will need to check out the feature/missing-features-for-observations branch in the anemoi-datasets repository.

Use cases:

Multiple datasets support: Enables training with distinct datasets (e.g., LAM and global), each potentially having:
- Different sets of variables
- Different temporal resolutions or time steps
Downscaling: Initial downscaling logic added
Autoencoder: Initial autoencoder logic added

Run training via:

anemoi-training train --config-name=debug_downscaling
anemoi-training train --config-name=debug_regional
anemoi-training train --config-name=debug_autoencoder

Not supported

No performance (speed and memory) optimization
No checkpointing
No rollout
No diagnostics
No model sharding
No CRPS. No time interpolation.

Configuration

data/data_handlers: Specify dataset sources, along with associated normalisers (optional) and imputers (optional).

data_handlers:
  lowres: # user-defined key
    dataset: ${hardware.files.lowres_dataset}
    normaliser:
      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
      config: ${data.normalizer}
  hires: # user-defined key
    dataset: ${hardware.files.hires_dataset}
    normaliser: 
      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
      config: ${data.normalizer}

model/sample: Define which dataset is used as input/target, and configure:
- Variables per source
- Temporal offsets per source

sample:
  input:
    lowres:
      variables:
      - "cos_latitude"
      - "cos_longitude"
      - "sin_latitude"
      - "sin_longitude"
      - "lsm"
      - "z"
      - "10u"
      - "10v"
      - "2t"
      - "2d"
      offset: ["0h"]
    hires:
      variables:
      - "cos_latitude"
      - "cos_longitude"
      - "sin_latitude"
      - "sin_longitude"
      - "orog"
      - "lsm"
      offset: ["0h"]
  target:
    hires:
      variables:
      - "10u"
      - "10v"
      - "2t"
      - "2d"
      - "q_100"
      - "tp"
      offset: ["0h"]

Big thanks to @floriankrb and @VeraChristina for their work on this PR, and to the many others who provided valuable input and feedback.

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

…/anemoi-core into refactor/multiple-datasets

Co-authored-by: Florian Pinault <florian.pinault@ecmwf.int>

for more information, see https://pre-commit.ci

dop_dataset.py

MeraX

I'm very much looking forward for the features added through this PR! I've just left a suggestion for improved accessibility and a minor comment on typing.

MeraX · 2025-08-04T13:19:00Z

training/src/anemoi/training/config/data/downscaling.yaml

+data_handlers:
+  lowres:
+    dataset: ${hardware.files.lowres_dataset}
+    # processors including imputers and normalizers are applied in order of definition
+    normaliser:
+      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
+      config: ${data.normalizer}
+
+  hires:
+    dataset: ${hardware.files.hires_dataset}
+    # processors including imputers and normalizers are applied in order of definition
+    normaliser:
+      _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
+      config: ${data.normalizer}


This is a note that I took over from the discussion on this work we had last week.

Suggested change

data_handlers:

lowres:

dataset: ${hardware.files.lowres_dataset}

# processors including imputers and normalizers are applied in order of definition

normaliser:

_target_: anemoi.models.preprocessing.normalizer.InputNormalizer

config: ${data.normalizer}

hires:

dataset: ${hardware.files.hires_dataset}

# processors including imputers and normalizers are applied in order of definition

normaliser:

_target_: anemoi.models.preprocessing.normalizer.InputNormalizer

config: ${data.normalizer}

data_handlers:

- dataset_name: lowres

dataset: ${hardware.files.lowres_dataset}

# processors including imputers and normalizers are applied in order of definition

normaliser:

_target_: anemoi.models.preprocessing.normalizer.InputNormalizer

config: ${data.normalizer}

- dataset_name: hires

dataset: ${hardware.files.hires_dataset}

# processors including imputers and normalizers are applied in order of definition

normaliser:

_target_: anemoi.models.preprocessing.normalizer.InputNormalizer

config: ${data.normalizer}

I would recommend avoiding user-defined names as keys in the config. I believe it makes defining schemas easier, can avoid name clashes and would be in general more consistent with the definition and naming of graphs. Also, giving the key a name makes it easier to look it up in the docs.

In general, I would also recommend introducing kind of a "type system" for names in the config. E.g. defining the string "lowres" in this example is a "dataset_name" (or similar) which could then be referenced and used as a model.encoder_input_names, but which is perhaps different to what "lowres" defines in training/src/anemoi/training/config/graph/downscaling.yaml. Also, separating the different name strings more clearly could help to solve inconsistencies between referencing to things by their string name and by a path like dataset: ${data.data_handlers.lowres.dataset}.

Hi Marek, I agree that having some keys user-defined and others constrained may be confusing, and that this would make schemas more useful. However, I don't think it would be consistent. Currently, we have this setup for nodes, node attributes, edge attributes, scalers, and validation metrics.

That said, it may be a good opportunity to change all of these.

training/src/anemoi/training/data/refactor/datamodule.py

training/src/anemoi/training/train/tasks/forecaster_multiple_datasets.py

for more information, see https://pre-commit.ci

Co-authored-by: Marek Jacob <MeraX@users.noreply.github.com>

…wf/anemoi-core into refactor/multiple-datasets-4

floriankrb and others added 30 commits April 10, 2025 09:59

data-handlers wip

714506b

data-handlers wip

310a5a0

add test

4d36e16

Merge branch 'main' into refactor/multiple-datasets

17a11eb

WIP EncProcDec dictionairy batch input

9624ba2

group tests into tests/refactor

cdde2cb

First data handler version

015b417

Merge branch 'main' into refactor/multiple-datasets

42213b6

WIP loss function / forecaster multiple datasets

023fb21

Fix training_step only wants batch as input

9a5c492

clean dataloader

9385e7f

style

c4b708d

Merge branch 'refactor/multiple-datasets' of https://github.com/ecmwf…

3bf9966

…/anemoi-core into refactor/multiple-datasets

refactor

404bb53

graph update

333a9a8

wip

5d2f83f

Co-authored-by: Florian Pinault <florian.pinault@ecmwf.int>

update config with sample_providers

ce3cf12

wip II

6bd02be

data loading works ...

9e5a01e

Add Sampler to synchronise reference date

7198d93

clean train/val/test stages

ea09afb

Merge branch 'main' into refactor/multiple-datasets

fb10ba6

keep refactoring with Florian

2aee9bb

works: train --config-name=debug_multiple_datasets

d5ed404

missing changes

0df0ac0

pre-commit

6176e1a

models: pre-commit

4c8a7b3

add draft classes for Tensor/arrays

cb24aa4

things, stackedThings, and groupedthings

396bc70

Merge branch 'main' into feature/lat-weighted-attr

54bc4a3

JPXKQX added 8 commits July 22, 2025 11:16

works: downscaling

ef96dfa

fix: dims in loss

1356eb8

add apply function as attriibute to Structure

db50a2c

suppor regional models

53ebd3c

Merge branch 'main' into refactor/multiple-datasets-4

9ddb758

feat: working version (downscaling & multiple datasets)

4379086

pre-commit

5043048

configs

8938c61

JPXKQX assigned floriankrb, JPXKQX and VeraChristina Jul 29, 2025

github-project-automation bot added this to Anemoi-dev Jul 29, 2025

github-project-automation bot moved this to Now In Progress in Anemoi-dev Jul 29, 2025

github-actions bot added training models labels Jul 29, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

411b17b

for more information, see https://pre-commit.ci

JPXKQX mentioned this pull request Jul 29, 2025

Multiple Datasets III - Configure time coverage #232

Closed

mchantry reviewed Jul 29, 2025

View reviewed changes

dop_dataset.py Outdated Show resolved Hide resolved

JPXKQX added 4 commits July 29, 2025 15:07

remove unused

192c664

style

7254f9b

Merge branch 'main' into refactor/multiple-datasets-4

c1d7237

feat: autoencoder configs

e1e7a26

mchantry added the ATS Approval Needed Approval needed by ATS label Jul 30, 2025

MeraX reviewed Aug 4, 2025

View reviewed changes

JPXKQX and others added 6 commits August 4, 2025 14:31

Merge branch 'main' into refactor/multiple-datasets-4

328f87a

[pre-commit.ci] auto fixes from pre-commit.com hooks

e343ba3

for more information, see https://pre-commit.ci

pre-commit

4e114f7

Update training/src/anemoi/training/data/refactor/datamodule.py

eebaf96

Co-authored-by: Marek Jacob <MeraX@users.noreply.github.com>

Merge branch 'refactor/multiple-datasets-4' of https://github.com/ecm…

6d4953b

…wf/anemoi-core into refactor/multiple-datasets-4

imports

60dffed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(models,training): multiple datasets (WIP) #441

feat(models,training): multiple datasets (WIP) #441

Uh oh!

JPXKQX commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

MeraX left a comment

Uh oh!

MeraX Aug 4, 2025

Uh oh!

JPXKQX Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat(models,training): multiple datasets (WIP) #441

Are you sure you want to change the base?

feat(models,training): multiple datasets (WIP) #441

Uh oh!

Conversation

JPXKQX commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Use cases:

Not supported

Configuration

Uh oh!

Uh oh!

MeraX left a comment

Choose a reason for hiding this comment

Uh oh!

MeraX Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

JPXKQX Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JPXKQX commented Jul 29, 2025 •

edited

Loading