-
Notifications
You must be signed in to change notification settings - Fork 49
Merging mk/fe_experiments into develop #1480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging mk/fe_experiments into develop #1480
Conversation
…evelop/fe_experiments
…evelop/fe_experiments
…evelop/fe_experiments
Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/streams/era5_1deg/era5.yml
On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/default_config.yml
…evelop/fe_experiments
src/weathergen/train/trainer.py
Outdated
|
|
||
| is_model_sharded = self.cf.with_ddp and self.cf.with_fsdp | ||
| if is_model_sharded: | ||
| params = self.model.rename_old_state_dict(params=params) # For backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this.
config/default_config.yml
Outdated
|
|
||
| ae_local_dim_embed: 1024 | ||
| ae_local_num_blocks: 2 | ||
| ae_local_num_blocks: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the default_config functionally identical and only add the new options but turned off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -0,0 +1,525 @@ | |||
| import json | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it appear here? We did merge the grad_norm PR onto develop before?
MatKbauer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to be modified until we can merge. Thanks @kctezcan!
scripts/model_weight_progression.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we need this in develop. I'd leave it out.
config/default_config.yml
Outdated
|
|
||
| ae_local_dim_embed: 1024 | ||
| ae_local_num_blocks: 2 | ||
| ae_local_num_blocks: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be removed.
config/default_config.yml
Outdated
| fe_dropout_rate: 0.1 | ||
| fe_with_qk_lnorm: True | ||
| fe_layer_norm_after_blocks: [] # Index starts at 0. Thus, [3] adds a LayerNorm after the fourth layer | ||
| impute_latent_noise_std: 0.0 # 1e-4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's eventually rename Impute_latent_noise_std to fe_impute_latent_noise_std. I had this on my list for some while.
config/default_config.yml
Outdated
| impute_latent_noise_std: 0.0 # 1e-4 | ||
|
|
||
| healpix_level: 5 | ||
| healpix_level: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think healpix_level: 5 is what we should have in the forecast config. It changed back and forth recently, but 5 turns out to be better.
config/default_config.yml
Outdated
| log_grad_norms: False | ||
|
|
||
| start_date: 197901010000 | ||
| end_date: 202012310000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to end_date: 202212310000; I thought I had modified this but maybe forgot to push.
|
Yes, let's have a second config for the forecasting. Thanks for the reminder. |
|
|
||
| # TODO: remove backwards compatibility to "epoch" in Feb. 2026 | ||
| self.mini_epoch = getattr(eval_cfg, "mini_epoch", getattr(eval_cfg, "epoch", -1)) | ||
| self.mini_epoch = getattr(eval_cfg, "mini_epoch", eval_cfg["epoch"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get please
src/weathergen/train/trainer.py
Outdated
|
|
||
| # Unweighted loss, real weighted loss, std for losses that need it | ||
| self.loss_unweighted_hist, self.loss_model_hist, self.stdev_unweighted_hist = [], [], [] | ||
| self.last_grad_norm = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even used?
clessig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* Log gradient norms * Prototype for recording grad norms * Address review changes + hide behind feature flag * Final fixes including backward compatibility * Ruff * More ruff stuff * Update to develop, prepare for new experiment series * forecast config with small decoder * fixed uv.lock * test gradient logging on mutli gpus * Setting o48 as default in era5 config Committer: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 57 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/streams/era5_1deg/era5.yml * Updated default config to 256 dim latent size On branch mk/develop/fe_experiments Your branch is ahead of 'origin/mk/develop/fe_experiments' by 58 commits. (use "git push" to publish your local commits) Changes to be committed: modified: config/default_config.yml * Update branch to latest develop * Change epochs from 64 to 32 * LayerNorm replication and analysis tools * Rename fe_layer_norm_at_layers to fe_layer_norm_after_blocks * Increase epochs from 32 to 64 and resolve minor bug * Update default_config back to d2048 on the O96 grid * Update ERA5 stream to O96 grid * Resolving bug after merging with develop and updating default_config * Enable loading old model checkpoints after recent merges * Update WeatherGenReader with mini-epoch notation * Minor modifications to latent histogram plotting * Resolve bug in histogram plotting * Replace getattr by cf.get * Change target read-out engine from 1 to 2 layers * Set aux-info for fe-blocks to none * fix a plotting bug (ecmwf#1453) * Update train/val dates, HL=5, fsteps=2, lat-weighting * removed plotting latent histograms * modified configs * removed the eval and train plot configs * added 00 as minutes * lint * added fc config + renamed to fe_impute_latent_noise_std * lint * removed parameter renaming for backward compatibility * removed weight_progression and plot_grad files * corrected end_date * using .get() --------- Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int> Co-authored-by: Jubeku <julian.kuehnert@ecmwf.int> Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com> Co-authored-by: Matthias Karlbauer <mkarlbau@santis-ln002.cscs.ch> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Description
This branch (mk/develop/fe_experiments) has been used for running all experiments for improving forecast skill. At this current stage it includes mostly i) changes in the default config, ii) addition of the layer norm, iii) ploting of weights/gradient norms
Issue Number
Closes #1470
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60