Clessig/develop/fix config 1534 #1541

clessig · 2025-12-30T22:28:54Z

Description

Revise config to nested dict; simplify code where possible and where changes are necessary anyway.

This PR also enables a more flexible combination of different loss terms, e.g. of a physical space and latent loss, as demonstrated in the default config. It also decouples training and validation and test as much as possible, so that one can have different objectives for these.

Issue Number

Closes #1534
Closes #1535

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…training_config and validation_config

…Physical ones.

…culator and various other details cleaned up

…of config is passed to LRScheduler, which leads to major simplifications

…ulator config

…e model

… progress

…sig/develop/fix_config_1534

clessig · 2025-12-31T07:30:04Z

forecast, ERA5 + NPP-ATMS + SYNOP

num_samples/batch_size = 1: (520709 : g479xfk6)

0: LossPhysical.ERA5.mse.avg : 9.4735E-02
0: LossPhysical.ERA5.mae.avg : 1.9303E-01
0: LossPhysical.NPPATMS.mse.avg : 1.1044E-02
0: LossPhysical.NPPATMS.mae.avg : 2.9273E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.6745E-01
0: LossPhysical.SurfaceCombined.mae.avg : 2.8849E-01
0: LossPhysical.loss_avg : 1.8611E-01

num_samples/batch_size = 2: 520711 : ypix0nr7

0: LossPhysical.ERA5.mse.avg : 1.1328E-01
0: LossPhysical.ERA5.mae.avg : 2.1473E-01
0: LossPhysical.NPPATMS.mse.avg : 1.8830E-02
0: LossPhysical.NPPATMS.mae.avg : 4.5836E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.9414E-01
0: LossPhysical.SurfaceCombined.mae.avg : 3.0358E-01
0: LossPhysical.loss_avg : 2.0496E-01

clessig added 21 commits December 30, 2025 23:14

Partially revised config; model is still missing but proper setup of …

b98d074

…training_config and validation_config

Changes necessary due to changed position of time keys and of run_id

bb12d5a

Handling of multiple loss terms / target_aux_calculators and non-Loss…

d510956

…Physical ones.

Changed position of run_id in config

793578e

Add function to extract batch size from mode_cfg

6eabd27

Changed position of run_id in config

d350b38

Changes due to revised config. Also proper handling of target_aux_cal…

78b17af

…culator and various other details cleaned up

Revised config structure, in particular for losses, and related changes

d8a1291

Add missing copyright and minor changed to to_device()

0d4e471

Moved sanity checking from trainer here. Also learning_rate sub_part …

b99b5c9

…of config is passed to LRScheduler, which leads to major simplifications

Minor cleanups

9ef940b

Changes due to changed structure of losses in config

868e595

Changes due to changed structure of losses in config

f005ef0

Minor changes due to changed position of run_id in config

7b1d189

Minor changes to accomodate new config, in particular target_aux_calc…

53eb0d0

…ulator config

Support batch_size > 1. Clean up of various smaller parts

cdbb696

Clean up and implementation for batch_size > 1.

0b99f3e

Fix to sharding problem with FSDP2

7d1226f

Removed scatter offset computation which now happens on the fly in th…

0ca381d

…e model

Changes for revised config, simplify overall where possible

4d67ad2

Fix issues with source-target sample generation and matching. Work in…

66c83a2

… progress

github-project-automation bot added this to WeatherGen-dev Dec 30, 2025

github-actions bot added infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Dec 30, 2025

clessig added 6 commits December 30, 2025 23:29

Linting

fff5749

Linting

f28874b

Linting

192930a

Linting

32243f3

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into cles…

0d11f87

…sig/develop/fix_config_1534

Type hint

0b900d3

Linting

132a2be

clessig added 3 commits December 31, 2025 11:23

Linting

070e859

Linting

39d02ef

Renamed loss keys for consistency

6413c0f

clessig moved this to In Progress in WeatherGen-dev Dec 31, 2025

This was referenced Jan 1, 2026

small refactoring for EmbeddingEngine #1542

Open

New config #1342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clessig/develop/fix config 1534 #1541

Clessig/develop/fix config 1534 #1541

Uh oh!

clessig commented Dec 30, 2025 •

edited

Loading

Uh oh!

clessig commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Clessig/develop/fix config 1534 #1541

Are you sure you want to change the base?

Clessig/develop/fix config 1534 #1541

Uh oh!

Conversation

clessig commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clessig commented Dec 30, 2025 •

edited

Loading