Skip to content

Conversation

@kctezcan
Copy link
Contributor

Description

This branch (mk/develop/fe_experiments) has been used for running all experiments for improving forecast skill. At this current stage it includes mostly i) changes in the default config, ii) addition of the layer norm, iii) ploting of weights/gradient norms

Issue Number

Closes #1470

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

sophie-xhonneux and others added 30 commits August 6, 2025 12:24
Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

On branch mk/develop/fe_experiments
Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
  modified:   config/streams/era5_1deg/era5.yml
On branch mk/develop/fe_experiments
Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
	modified:   config/default_config.yml

is_model_sharded = self.cf.with_ddp and self.cf.with_fsdp
if is_model_sharded:
params = self.model.rename_old_state_dict(params=params) # For backward compatibility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this.


ae_local_dim_embed: 1024
ae_local_num_blocks: 2
ae_local_num_blocks: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the default_config functionally identical and only add the new options but turned off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kctezcan just suggested to have a separate dev_forecast_config.yml. This sounds clean and I recall you also mentioned this, @clessig.

@@ -0,0 +1,525 @@
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it appear here? We did merge the grad_norm PR onto develop before?

Copy link
Contributor

@MatKbauer MatKbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to be modified until we can merge. Thanks @kctezcan!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we need this in develop. I'd leave it out.


ae_local_dim_embed: 1024
ae_local_num_blocks: 2
ae_local_num_blocks: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kctezcan just suggested to have a separate dev_forecast_config.yml. This sounds clean and I recall you also mentioned this, @clessig.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be removed.

fe_dropout_rate: 0.1
fe_with_qk_lnorm: True
fe_layer_norm_after_blocks: [] # Index starts at 0. Thus, [3] adds a LayerNorm after the fourth layer
impute_latent_noise_std: 0.0 # 1e-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's eventually rename Impute_latent_noise_std to fe_impute_latent_noise_std. I had this on my list for some while.

impute_latent_noise_std: 0.0 # 1e-4

healpix_level: 5
healpix_level: 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think healpix_level: 5 is what we should have in the forecast config. It changed back and forth recently, but 5 turns out to be better.

log_grad_norms: False

start_date: 197901010000
end_date: 202012310000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to end_date: 202212310000; I thought I had modified this but maybe forgot to push.

@clessig
Copy link
Collaborator

clessig commented Dec 17, 2025

Yes, let's have a second config for the forecasting. Thanks for the reminder.


# TODO: remove backwards compatibility to "epoch" in Feb. 2026
self.mini_epoch = getattr(eval_cfg, "mini_epoch", getattr(eval_cfg, "epoch", -1))
self.mini_epoch = getattr(eval_cfg, "mini_epoch", eval_cfg["epoch"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get please


# Unweighted loss, real weighted loss, std for losses that need it
self.loss_unweighted_hist, self.loss_model_hist, self.stdev_unweighted_hist = [], [], []
self.last_grad_norm = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even used?

Copy link
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@clessig clessig merged commit 6b83e21 into ecmwf:develop Dec 17, 2025
5 checks passed
TillHae pushed a commit to TillHae/WeatherGenerator that referenced this pull request Dec 25, 2025
* Log gradient norms

* Prototype for recording grad norms

* Address review changes + hide behind feature flag

* Final fixes including backward compatibility

* Ruff

* More ruff stuff

* Update to develop, prepare for new experiment series

* forecast config with small decoder

* fixed uv.lock

* test gradient logging on mutli gpus

* Setting o48 as default in era5 config

Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

On branch mk/develop/fe_experiments
Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
  modified:   config/streams/era5_1deg/era5.yml

* Updated default config to 256 dim latent size

On branch mk/develop/fe_experiments
Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
	modified:   config/default_config.yml

* Update branch to latest develop

* Change epochs from 64 to 32

* LayerNorm replication and analysis tools

* Rename fe_layer_norm_at_layers to fe_layer_norm_after_blocks

* Increase epochs from 32 to 64 and resolve minor bug

* Update default_config back to d2048 on the O96 grid

* Update ERA5 stream to O96 grid

* Resolving bug after merging with develop and updating default_config

* Enable loading old model checkpoints after recent merges

* Update WeatherGenReader with mini-epoch notation

* Minor modifications to latent histogram plotting

* Resolve bug in histogram plotting

* Replace getattr by cf.get

* Change target read-out engine from 1 to 2 layers

* Set aux-info for fe-blocks to none

* fix a plotting bug (ecmwf#1453)

* Update train/val dates, HL=5, fsteps=2, lat-weighting

* removed plotting latent histograms

* modified configs

* removed the eval and train plot configs

* added 00 as minutes

* lint

* added fc config + renamed to fe_impute_latent_noise_std

* lint

* removed parameter renaming for backward compatibility

* removed weight_progression and plot_grad files

* corrected end_date

* using .get()

---------

Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com>
Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>
Co-authored-by: Jubeku <julian.kuehnert@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com>
Co-authored-by: Matthias Karlbauer <mkarlbau@santis-ln002.cscs.ch>
Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Preparing mk/develop/fe_experiments to merge into develop

7 participants