Merging mk/fe_experiments into develop #1480

kctezcan · 2025-12-17T08:39:07Z

Description

This branch (mk/develop/fe_experiments) has been used for running all experiments for improving forecast skill. At this current stage it includes mostly i) changes in the default config, ii) addition of the layer norm, iii) ploting of weights/gradient norms

Issue Number

Closes #1470

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…evelop/fe_experiments

Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/streams/era5_1deg/era5.yml

On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/default_config.yml

…evelop/fe_experiments

clessig · 2025-12-17T09:06:21Z

src/weathergen/train/trainer.py

+
+        is_model_sharded = self.cf.with_ddp and self.cf.with_fsdp
+        if is_model_sharded:
+            params = self.model.rename_old_state_dict(params=params)  # For backward compatibility


Can we remove this.

clessig · 2025-12-17T09:06:55Z

config/default_config.yml


 ae_local_dim_embed: 1024
-ae_local_num_blocks: 2
+ae_local_num_blocks: 0


Can we keep the default_config functionally identical and only add the new options but turned off

@kctezcan just suggested to have a separate dev_forecast_config.yml. This sounds clean and I recall you also mentioned this, @clessig.

clessig · 2025-12-17T09:07:41Z

src/weathergen/utils/plot_grad_norms.py

@@ -0,0 +1,525 @@
+import json


Why does it appear here? We did merge the grad_norm PR onto develop before?

MatKbauer

Added some comments to be modified until we can merge. Thanks @kctezcan!

MatKbauer · 2025-12-17T09:15:08Z

scripts/model_weight_progression.py

Not sure whether we need this in develop. I'd leave it out.

MatKbauer · 2025-12-17T09:16:45Z

config/default_config.yml


 ae_local_dim_embed: 1024
-ae_local_num_blocks: 2
+ae_local_num_blocks: 0


@kctezcan just suggested to have a separate dev_forecast_config.yml. This sounds clean and I recall you also mentioned this, @clessig.

MatKbauer · 2025-12-17T09:17:32Z

src/weathergen/utils/plot_grad_norms.py

Should also be removed.

MatKbauer · 2025-12-17T09:18:23Z

config/default_config.yml

 fe_dropout_rate: 0.1
 fe_with_qk_lnorm: True
+fe_layer_norm_after_blocks: []  # Index starts at 0. Thus, [3] adds a LayerNorm after the fourth layer
 impute_latent_noise_std: 0.0  # 1e-4


Let's eventually rename Impute_latent_noise_std to fe_impute_latent_noise_std. I had this on my list for some while.

MatKbauer · 2025-12-17T09:19:28Z

config/default_config.yml

 impute_latent_noise_std: 0.0  # 1e-4

-healpix_level: 5
+healpix_level: 4


I think healpix_level: 5 is what we should have in the forecast config. It changed back and forth recently, but 5 turns out to be better.

MatKbauer · 2025-12-17T09:20:08Z

config/default_config.yml

 log_grad_norms: False

 start_date: 197901010000
 end_date: 202012310000


Change to end_date: 202212310000; I thought I had modified this but maybe forgot to push.

clessig · 2025-12-17T09:29:19Z

Yes, let's have a second config for the forecasting. Thanks for the reminder.

tjhunter · 2025-12-17T10:42:07Z

packages/evaluate/src/weathergen/evaluate/io_reader.py


        # TODO: remove backwards compatibility to "epoch" in Feb. 2026
-        self.mini_epoch = getattr(eval_cfg, "mini_epoch", getattr(eval_cfg, "epoch", -1))
+        self.mini_epoch = getattr(eval_cfg, "mini_epoch", eval_cfg["epoch"])


.get please

tjhunter · 2025-12-17T10:42:59Z

src/weathergen/train/trainer.py


+        # Unweighted loss, real weighted loss, std for losses that need it
+        self.loss_unweighted_hist, self.loss_model_hist, self.stdev_unweighted_hist = [], [], []
+        self.last_grad_norm = 0.0


is this even used?

clessig

Thanks!

* Log gradient norms * Prototype for recording grad norms * Address review changes + hide behind feature flag * Final fixes including backward compatibility * Ruff * More ruff stuff * Update to develop, prepare for new experiment series * forecast config with small decoder * fixed uv.lock * test gradient logging on mutli gpus * Setting o48 as default in era5 config Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/streams/era5_1deg/era5.yml * Updated default config to 256 dim latent size On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/default_config.yml * Update branch to latest develop * Change epochs from 64 to 32 * LayerNorm replication and analysis tools * Rename fe_layer_norm_at_layers to fe_layer_norm_after_blocks * Increase epochs from 32 to 64 and resolve minor bug * Update default_config back to d2048 on the O96 grid * Update ERA5 stream to O96 grid * Resolving bug after merging with develop and updating default_config * Enable loading old model checkpoints after recent merges * Update WeatherGenReader with mini-epoch notation * Minor modifications to latent histogram plotting * Resolve bug in histogram plotting * Replace getattr by cf.get * Change target read-out engine from 1 to 2 layers * Set aux-info for fe-blocks to none * fix a plotting bug (ecmwf#1453) * Update train/val dates, HL=5, fsteps=2, lat-weighting * removed plotting latent histograms * modified configs * removed the eval and train plot configs * added 00 as minutes * lint * added fc config + renamed to fe_impute_latent_noise_std * lint * removed parameter renaming for backward compatibility * removed weight_progression and plot_grad files * corrected end_date * using .get() --------- Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> Co-authored-by: Jubeku <julian.kuehnert@ecmwf.int> Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com> Co-authored-by: Matthias Karlbauer <mkarlbau@santis-ln002.cscs.ch> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>

sophie-xhonneux and others added 30 commits August 6, 2025 12:24

Log gradient norms

a0039ec

Prototype for recording grad norms

e83903b

Address review changes + hide behind feature flag

d2995b4

Final fixes including backward compatibility

26c6869

Merge branch 'develop' into sophiex/dev/log-grad-norms

66da0d7

Ruff

9a66f72

More ruff stuff

22a6fd7

Update to develop, prepare for new experiment series

a1d7a27

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into mk/d…

128aeb1

…evelop/fe_experiments

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into mk/d…

d4be568

…evelop/fe_experiments

Rebase to latest develop

6504fc7

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into mk/d…

4f62e1a

…evelop/fe_experiments

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

87e7d3b

forecast config with small decoder

754d31c

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

cd7948f

fixed uv.lock

7c756a3

test gradient logging on mutli gpus

41716a6

Update branch to latest develop with configured o48 settings

b5ce171

Updated default config to 256 dim latent size

d95277e

On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/default_config.yml

Update branch to latest develop

a734471

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into mk/d…

3ae99dd

…evelop/fe_experiments

Change epochs from 64 to 32

eba89a6

LayerNorm replication and analysis tools

5615634

Rename fe_layer_norm_at_layers to fe_layer_norm_after_blocks

9ccc95e

Increase epochs from 32 to 64 and resolve minor bug

240031d

Merged gradient logging

75b81fe

Update to develop

f65ac37

Update default_config back to d2048 on the O96 grid

20ae505

Update ERA5 stream to O96 grid

2731d29

MatKbauer and others added 13 commits November 19, 2025 14:09

Update WeatherGenReader with mini-epoch notation

4f00cc6

Minor modifications to latent histogram plotting

e44e139

Resolve bug in histogram plotting

c979ab4

Replace getattr by cf.get

d24c4b6

Change target read-out engine from 1 to 2 layers

89670bf

Set aux-info for fe-blocks to none

58474b2

fix a plotting bug (ecmwf#1453)

184dcd9

Update train/val dates, HL=5, fsteps=2, lat-weighting

d3b63d2

removed plotting latent histograms

aa99d04

modified configs

9be9020

removed the eval and train plot configs

64c2c27

added 00 as minutes

fcc3de7

lint

2d2db79

github-project-automation bot added this to WeatherGen-dev Dec 17, 2025

clessig reviewed Dec 17, 2025

View reviewed changes

MatKbauer reviewed Dec 17, 2025

View reviewed changes

kctezcan added 5 commits December 17, 2025 10:53

added fc config + renamed to fe_impute_latent_noise_std

0e6c3e5

lint

bad812c

removed parameter renaming for backward compatibility

8eb5d1f

removed weight_progression and plot_grad files

42d6e0d

corrected end_date

3117d61

tjhunter reviewed Dec 17, 2025

View reviewed changes

kctezcan added 2 commits December 17, 2025 14:23

using .get()

e79d5a8

merge conflicts

4bf11da

clessig approved these changes Dec 17, 2025

View reviewed changes

clessig merged commit 6b83e21 into ecmwf:develop Dec 17, 2025
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Dec 17, 2025

Merging mk/fe_experiments into develop #1480

Merging mk/fe_experiments into develop #1480

Uh oh!

Conversation

kctezcan commented Dec 17, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatKbauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants