Skip to content

Conversation

@tjhunter
Copy link
Collaborator

@tjhunter tjhunter commented Jul 2, 2025

This reverts commit 989ab6e.

Description

Reverts #283

Closes #433

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

Code Compatibility

  • I have performed a self-review of my code

Code Performance and Testing

  • I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
  • If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs
  • I have tested that new dependencies themselves are pip-installable.
  • I have not introduced new dependencies in the inference portion of the pipeline

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

@tjhunter tjhunter requested a review from sophie-xhonneux July 2, 2025 14:02
Copy link
Contributor

@sophie-xhonneux sophie-xhonneux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@sophie-xhonneux sophie-xhonneux merged commit 90cc144 into develop Jul 2, 2025
5 of 6 checks passed
enssow added a commit to enssow/WeatherGenerator that referenced this pull request Jul 3, 2025
Revert "Implement per-channel logging (ecmwf#283)" (ecmwf#434)
tjhunter added a commit that referenced this pull request Nov 18, 2025
* Revert "Implement per-channel logging (#283)" (#434)

This reverts commit 989ab6e1d6e8c0f69594414c7733adf30acd1c54.

* Fix FESOM datareader and int overflow  (#417)

* Fix indexing in DataReaderFesom

* Enforce using only int64 in data loading

* ruff

* ruff2

* Review

* Change int64 back to int32

* changes (#462)

* Fix incorrect handling of empty window (which triggered problem in IO writing code). (#447)

* Update default_config.yml (#446)

analysis_streams_output is missing, which leads to error with val_initial=True and log_validation > 0.

* Re-enabled option to run plot_training as script and fixed -rf argument (#444)

* Re-enabled option to runplot_training as script and removed relative path as default from mutually-exclusive argument -rf.

* Ruffed code.

* Ruff check fix.

* Rename flags for parsing configuration and fixed default handling for standard config YAML-file.

* fix era5 config (#473)

Adding z back in

* [251] Merge new IO class (#469)

* Implement mock IO (#336)

* Adapt score class score class (#339)

* Implement mock IO

* Adapt score class

* Removing unused file (#349)

* remove database folder (#355)

* Small change - CI - pinning the version of formatting (#361)

* changes

* changes

* Update INSTALL.md

* Update INSTALL.md

* Fixed Exxx lint issues (#284)

* Rebased to the latest changes and linted new changes

* addressed review comments

* addressed review comments

* Linted the latest changes.

* corrected the formating

* corrected the formating

* configured ruff to use LF line endings in pyproject.toml

* [357] Sub-package for evaluation (#359)

* working

* changes

* removing deps from non-core project

* changes

* fixes

* comments

* Iluise quick fix stac (#374)

* remove database folder

* fix database

* Simplifying workflow for plot_training (#368)

* Simplifying workflow for plot_training

* Ruffed

* Working on implementing exclude_source

* Remove unused code

* Fixed ruff issue

* Fixing bug in lat handling (377) (#378)

* Fixing bug in lat handling

* Added comment

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* recover num_ranks from previous run to calculate epoch_base (#317)

* recover num_ranks from previous run to calculate epoch_base

* set email settings for commits

* addressing Tim's comment

* make ruff happy

* improve style

* changes (#385)

Linter rule so np.ndarray is not used as type

* changed the script name from evaluate to inference as it simply gener… (#376)

* changed the script name from evaluate to inference as it simply generate infer samples

* changed evaluate to inference in the main scripts and corresponding calls in the config

* update the main function for the inference script

* changed evaluate to inference also in docstring, unit test scripts, and integration test scripts

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Introduce tuples instead for strings to avoid TypeError (#392)

* Exclude channels from src / target (#363)

* Exclude channels from src / target

* Simplified code and added comment that pattern matching is used

* Adding new stream config

* Fixing bug that led to error when accessing self.ds when dataset is empty

* Wokign on exlcude_source

* work in progress

* Fixing incorrect formating for logger (#388)

* Ruffed

* Refactored and cleaned up channel selection. Also added check that channels are not empty

* Cleaned channel parsing and selection

* Adjustments

* Removing asserts incompatible with empty dataset

---------

Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>

* add embed_dropout_rate to config v1 (#358)

* [402] adds checks to the pull request (#403)

* chanegs

* mistake

* mistake

* mistake

* changes

* doc

* Introduce masking class and incorporate in TokenizerMasking (#383)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* Ruffed

* Added warning to capture corner case, likely due to incorrect user settings.

* Fixed incorrect call twice

* Fixed missing conditional for logger statement

* Required changes for better handling of rngs

* Improved handling of rngs

* Improved handling of rng

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Implement per-channel logging (#283)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* [346] Passing options through the slurm script (#400)

* changes

* fixes

* refactor `validation_io.write_validation` to make it more readable

* remove legacy code `validation_io.read_validation`

* encapsulate artifact path logic in config module

* remove redundant attribute `Trainer.path_run`

* use config to look up base_path in `write_validation`

* remove unused `write_validation` args: `base_path`, `rank`

* ensure correct type for pathes

* remove streams initialization from `Trainer`

* remove path logic from `Trainer.save_model`

* simplify conditional

* rename mock io module

* update uv to include dask

* Implement io module to support reading/writing model output

* implement new validation_io routine

* use new write_validation routine

* remove unused code

* rename output routine to `write_output`

* ruffed and added comments

* fixed annotation

* use simple __init__ method for `OutputItem` instead of dataclasses magic

* address reviewers comments

* rename method

* add simple docstrings

* ruffed

* typehint fixes

* refactor names

* update comments and typehints, dont import pytorch

* remove `__post_init__` methods, cache properties

* fixes and integration test

* final fixes :)

* changes

* changes

* changes

* changes

* changes

* more work

* changes

* changes

* changes

* ruffed

* ruffed

* improve logging and comments

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Update to score-class according to internal discussions and feedback in PR.

* Add license header.

* Ruffed code.

* Add doc-string to call-method and provide example usage for efficient graph-construction.

* Some fixes to score-class.

* Some fixes to handling aggregation dimension.

* Add missing import of MockIO.

* changes

* changes

* removing the scores

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: Sindhu-Vasireddy <98752594+Sindhu-Vasireddy@users.noreply.github.com>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>
Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com>
Co-authored-by: ankitpatnala <ankitpatnala@gmail.com>
Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>
Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>
Co-authored-by: Christian Lessig <christian.lessig@ecwmf.int>
Co-authored-by: Till Hauer <till@web-hauer.de>
Co-authored-by: Simon Grasse <s.grasse@fz-juelich.de>
Co-authored-by: Michael <m.langguth@fz-juelich.de>

* [459] Attempt to fix ruff differences (#463)

* changes

* debug

* changes

* changes

* Update pyproject.toml (#457)

* Continue training through slurm script (#395)

* train_continue via slurm

* using __main__ as entry point for slurm script

* reverting config files to match base branch

* reverting config files to match base branch

* removing param_sum control logging before and after loading of model weights

* run ruff

* check whether from_run_id is in arguments

* trigger PR check

* remove block to set reuse_run_id=True

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* added the .python_version file set to python 3.12 (#482)

Co-authored-by: Kerem Can Tezcan <ktezcan0@login07.leonardo.local>

* script (#489)

* Remove print statements for logging (#421) (#439)

* first change

* removed all prints

* changed model.py back

* adding comments and fixes@

* added ruff fixes

* reverting files for PR

* ruff fixes

* removing run_id.py

* formatting changes

* changing comments in check_gh_issue script

---------

Co-authored-by: owens1 <owens1@jwlogin09.juwels>
Co-authored-by: Timothy Hunter <tim.hunter@ecmwf.int>

* Rename batchsize to batchsize_per_gpu (#475)

* Rename batchsize to batchsize_per_gpu

* Fix ruff stuff

* fix (#490)

* add polar orbiters and abi-goes to the stac database (#426)

* testing adding metopa and metopb as placeholder drafts to stac database

* added the actual json files because I think we have to

* updated metopa metopb jsons and ets

* add fy3 and update metops

* updated names of metops

* updated metopb untarred size inodes and end date

* update names to instrument, satellite

* add untarred data size and inodes for metopa

* updated to oscar naming, with format platform, instrument, and added fengyun satellites

* update size and inodes of fy3c mwhs

* add fengyun jsons, missing before, and update unique ids of metopa and b

* add processing_level field to metopa as a test

* adding processing level field

* fix up processing level

* updated jsons and jsonnets for provenance

* actually include provenance

* updated to include processor and provider, remove provenance

* add abi-goes

* fix abi goes geometry

* fix latitude and longitude

* fix typo

* hopefully this time lat is right..

* update catalogue json for develop

* check catalogue on this branch

* jsonneted for develop

---------

Co-authored-by: iluise <luise.ilaria@gmail.com>

* Added naming convention checks to lint (#501)

* Added naming convention checks to lint

* Implemented python naming conventions and corrected code accordingly

---------

Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx>

* Correct the in-code-names for rotation matrices (#516)

* Added naming convention checks to lint

* Implemented python naming conventions and corrected code accordingly

* Corrected renaming of rotation matrices from R to rot instead of to r

---------

Co-authored-by: Matthias Karlbauer <ecm1575@ac6-102.bullx>

* extend format string and timedelta to days (#499)

* extend format string and timedelta to days

* replace with pd.to_timedelta

* import pandas

* ruff

* enforce "HH:MM:SS" format

* ruff

* Mlangguth/develop/issue 251 (#495)

* Add score-class to evaluate-package.

* Add score-class to evaluate-package.

* Lintered and ruffed code.

* Add fix to io.py and update dependencies in common.

* Several small fixes to score-class and fast evaluation.

* Add utils for evaluate.

* Moved to_list to utils and improved doc-strings.

* Improve several doc-strings, avoid formatting of logger and other changes from PR review.

* Add xhistogram and xskillscore to dependencies of evaluate.

* Ruffed code.

* Lintered code.

* Fix incorrect retrieval of validation batch size in validation IO.

* Final minor changes to argument-names

* changes (#471)

* Updated to camel case. (#445)

* Updated to camel case.

* Fixed formatting.

* Revert "Updated to camel case. (#445)" (#530)

This reverts commit 4a8bd49067d86c8c9dd2930544d52cb9db8577af.

* [327] Script to create the links to output directories (results, ...) (#528)

* changes

* fixes

* slash

* slash

* checks

* checks

* Update config parameters lr and grad_clip (#545)

* updated lr and grad_clip in config

* modify lr to 1e-4

* Fixed randomization problem with masking  (#510)

* Fixed randomization problem with masking (needs to be verified)

* Making sure the seed is ok

* Fixed problem with seed init.

* More improvements. But problem still seems to be there.

* Clean up of rng handling. Re-initalization is passed through to masker, which was the issue.

* - Fixed prime numbers
- Cleaned up unnecessary rng init and added further comments.

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Sophiex/dev/upper bound targets (#526)

* recovering my stash

* Fix bug

* Clean up pull request

* Clessig/develop/fix forecasting 448 (#449)

* Removed (second) residual connection for forecasting

* Added init to forecasting engine to small values

* Default values for forecasting experiments

* Updated settings

* Setting local engine to empty

* Fix z settings.

* Revised defaults with larger net

* Revised defaults with larger config

* Restoring default config

* Restoring

* Restoring default

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Restore self.size_time_embedding in tokenizer_forecast.py (#548)

* Restore self.size_time_embedding in tokenizer_forecast.py

Fixes #547

* Remove empty line for ruff

Remove line for ruff?

* Replace cf.rank==0 with utils.distributed.is_root (#535)

Co-authored-by: wang85 <wang85@jwlogin22.juwels>

* Fixed handling of empty streams in plot_train (#552)

* Fixed handling of empty streams

* Fixed

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix train_continue (#556)

* add DocStrings to model (#268)

* added DocStrings for class ModelParams

* added DocStrings for class Model

* Docstring cleanup v1

* Docstring cleanup v2

* Docstring cleanup v3

* Docstring corrections v1

* Docstring corrections v2

* Docstring corrections v3

* ruff check v1

* ruff check v2

* ruff check v3

---------

Co-authored-by: th3002s <till.hauer@alumni.fh-aachen.de>

* Revised structure in metric JSON-file (#549)

* Update score-class to support groupby-operations for per-sample evaluation.

* Update of fast evaluation pipeline to track metrics sample-wise and dump them into the newly structured JSON-files.

* Changes according to PR review and fix for handling situations with a single sample.

* Changes according to PR review and fix to filter channels for score-calculation.

* Fixed handling of empty source/target channels (#558)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix to peel_tar_channels to allow situations where no data for fstep=0 is present. (#572)

* Update era5.yml: token size 8 (#583)

* [DRAFT] CLI for scoring and plotting  (#522)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* 'Handle list input to forecast_steps (Closes #573)' (#581)

* 'fixed bug not handling list input to forecast step #573'

* linted

* replace error with assert

* lint

* roll-back accidental lint

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* remove plot config  (#597)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

* remove plot config + style addition to evaluation package

* ruffed

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* integrate IFS scores from Quaver into FastEvaluation (#600)

* first insterface

* working version

* save json

* add omegaconf

* address comment and clean up interface

* add config

* update scoring class

* Fix to allow for channel-selection in get_data and efficiency improvement to plot_data.

* Avoid circulra dependency issues with to_list-function.

* Fix data selection issues.

* Enable proper handling of lists from omegaconf.

* update to mlangguth89 fork

* refactor forecast step

* ruffed

* add printing summary

* add ZarrData class

* adjust size of the plots

* attempt to solve sorting issue

* Rename model to run in config and in code.

* Fixes to Michael's review comments.

* Ruffed code.

* resync with mlangguth89 + add plot titles

* revert mixed

* remove plot config + style addition to evaluation package

* ruffed

* add option to comment out plotting

* resync utils to develop

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>

* [569] Load eagerly the stream content in order (#585)

* changes

* change

* changes

* Remove loading of streams also from inference.

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* [DRAFT][590] Rename metrics file (#601)

* Implemented backward-compatible function to read and write `{RUN-ID}_train_metrics.json` (new) or `metrics.json` (old)

* Quick fix for #553 NaT from encode_times_target, move offset to before trigs (#589)

* quick fix for 553 NaT from encode_times_target, move offset

* change offset to 10 minutes...

* ruffed

* apply hotfix to deltas_sec

* ruffed

* fix: associate output stream names with correct index (#519)

* fix: associate output stream names with correct index

* ruffed

* fix: iteration over output items

* address comments

* fix: correctly index channels

* fix stream indexing logic, add asserts

* fix: extraction of data/coordinates for sources

* fix assert

* Clessig/develop/channel logging 282 (#615)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

* Implement sending tensors of different shapes

* ruff

* Fix merge

* Fix docstring

* rerun workflow

* Review

* Change default colums name

* Fix merge

* - Added ddp_average_nan that is robust to NaN/0 entries when computing mean
- Switched from all_gather to this function in trainer to robustly average
- Some code cleanup

* use all_to_all communication

* Fixing problem with single-worker (non-DDP) training

* Ruffed

* Re-enabled validation loss output in terminal

* Simplified handling of dist initalized

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>
Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Fix bug in corner case of data reading (#621)

* Changed logging level for some messages.

* Fix bug in data reading and add assert to better detect these problems.

* Loss class refactoring (#533)

* Fix bug with seed being divided by 0 for worker ID=0

* Fix bug causing crash when secrets aren't in private config

* Implement logging losses per channel

* Fix issue with empty targets

* Rework loss logging

* ruff

* Remove computing max_channels

* Change variables names

* ruffed

* Remove redundant enumerations

* Use stages for logging

* Add type hints

* Apply the review

* ruff

* fix

* Fix type hints

* ruff

* Implement sending tensors of different shapes

* ruff

* Fix merge

* Fix docstring

* rerun workflow

* creating loss class

* Adapted varnames in new compute_loss function to match LossModule

* comments and loss_fcts refactoring

* Suggested a separation of mask creation and loss computation

* first working version of LossModule; added unit test

* Modifications and TODOs after meeting with Christian and Julian

* Added Christian's comments and updated code partially

* Julian & Matze further advances to understand shapes

* New mask_t computations. Not yet correct, thus commented

* Resolved reshaping of tensors for loss computation

* small changes in _prepare_logging

* J&M first refactoring version finished, 2 tests ok

* First round of resolving PR comments

* add ModelLoss dataclass, rearrange mask and loss computation

* Integrating new LossCalculator into trainer.py and adding docstrings

* J&M resolved temp.item() error

* Second round of PR comments integrated

* - Fixed loss accumulation
- Cleaned up variable names

* Renamed weight

* Removed unused vars

* Inspected loss normalization for logging

* Minor clean-up

* Removing unused code.

* More refactoring: breaking code down in smaller pieces

* Fix

* Adding missing copyright

* Adding missing copyright

* Fixing incorrect indent

* Fix

---------

Co-authored-by: Kacper Nowak <kacper.nowak@awi.de>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>
Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Update momentum (#633)

* Update momentum

* Remove final GELU in MLP

* Adding assert to catch inconsistent config params (#630)

* Update default_config.yml (#641)

Fix incorrect stream

* Backward compatibility of 'loss_avg_mean' metric name (#637)

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Iluise/develop/plotting issues (#635)

* fix plotted timestamp

* fix crashing when a run is plot only

* ruffed

* implement comments

* Mlangguth/develop/issue 586 (#625)

* Add options to configure the marker size, the marker type and enable marker-scaling with latitude for map-plots

* Update doc-strings to follow standard format.

* Ruffed code.

* Changes due to review comments.

* Less verbose logging and improved handling of setting to plot histograms.

* Corrected error-message in plot_data.

* [DRAFT]: Prediction head architecture clean-up (#481)

* - Avoid time encoding is 0
- eps in layer norms to 10^-3
- bf16

* Make the attention dtype and norm eps configurable

* Fix gitignore and add config files

* Shuffle config files into sensible folders

* Implement first attempt at new prediction heads

* Fix some bugs

* Fix trainer compile + fsdp

* Fix trainer and better defaults

* Choose AdaLN

* Correlate predictions per cell

Previously this pr treated as independent

* Make things more parameter efficient

* Revert "Make things more parameter efficient"

It made things way worse

This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757.

* Improve the prediction heads at small sizes

* Improve the stability of training

Two main changes: better beta 1 and beta 2 values in adam w and remove
gelu

* Adding some more regularisation

In particular to prevent training divergences and overfitting

* Forgot the dropout in MLPs

* Tune the learning rate

* Add the original prediction heads

CAREFUL: Untested!!!

* Fix bugs and ruff

* Restore old version last part

* Start fixing the defaults

* Deleting hpc specific configs

* Deleting hpc specific configs

* Defaults and documentation

* Apply ruff

* Clean up code

* Add one more comment

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Fix bug in loggin buffer reset (#651)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* use config dropout_rate in EmbeddingEngine (#646)

* Make numpy argsort version resilient (#645)

* Fix backward compatibility (#655)

* Implement global and per-cell channel masking (#496)

* creating masking class and adapting tokenizer_masking to use this class

* minor changes to masking.py and tokenizer_masking

* removed old tokenizer_masking

* include masking_strategy in default_config

* change ValueError to assert

* linting formatting changes files

* further linting of docstrings

* create mask_source and mask_target in Masker, and update tokenizer_masking to use these, then style improvements

* linted masking, tokenizer_masking

* modify masker, rng and perm_sel now part of class, remove extra masking_rate, update comments, remove archived class

* remove check if all masked, not masked

* remove self.masking_rate from MultiStreamDS class, and masking args from batchify_source

* update tokenizer utils with description of idx_ord_lens in comment

* remove masking args from batchify_, perm_sel removed now internal to Masker class, remove handling special cases of masking (all masked)

* working implementation of healpix level masking in Masker, with too many prints and hardcoded hl_mask and hl_data

* adding masking_strategy: to config

* remove unused mentions of masking_combination

* removed comment about streams

* changed assert to check self perm_sel is not None

* ruff masking, tokenizer_masking

* implementation of healpix masking code with lots of printing

* removed print statements from masking.py

* minor line change

* remove default for strategy_kwargs

* add strategy_kwargs to config, and pass through masker to pass masking strategy specific args

* vectorise child indices calcs, implement masking_rate_sampling, minorly updated docs

* remove print statements

* cf.strategy_kwargs passed to Masker in multi_stream_data_sampler

* masking_strategy random and strategy kwargs passed to config

* ruffed

* pass cf.get(strategy_kwargs or {}) to the Masker and update masking to reflect this

* update config so it does not include strategy_kwargs, no longer needed

* move asserts for healpix to constructor, rename to masking_strategy_config, update config with example of healpix

* test working version, understanding what is happening

* revert breaking develop merge and conflict in config

* default config put channel masking

* reverting the accidental revert...

* small change to config

* implemented global and per-cell per channel masking in masking, change to config

* remove print statements from multistream

* updated config for compatibility to run immediately

* cleaned code, assert to fail for different number of source and target streams

* updated default config to match latest

* fixed _generate_channel_mask to handle empty cells of data

* fixed docstring of masker

* ruffed linted

* rename l in token_lens

* lint ruff, remove prints

* add assert for source and target channels must be the same

* fix config to develop, new assert, remove assert

* revert assert statement for readability

* clip the values in masking_rate_sampling to 0.01 and 0.99

* revert cell name to tl

* remove empty lines from model

* remove empty line from embeddings

* remove empty line tokenizer_masking

* ruff masking, tokenizer_masking

* update config again to develop version

* update config comment for masking strategies

* update channel masking to handle non-data channels for new loss

* ruffed

* Implemented check that for channel masking source and target channel have to be identical

* Minor code improvements

* Fixed incorrect return type for special case

* Ruffed + and reduced magic constants

* Minor fixes to _generate_healpix_mask

* Cleaned up and optimized mask generation for channel masking

* changed to use mode global or per_cell, improved docstring for masking strategies

* added documented valid examples for masking_strategy_config to default_config

* ruffed

* update example masking_strategy_config in default

* Minor adjustments to default settings

* remove mention of hl_data in masking_strat_config

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Removed that checkpoint is saved at the first batch (#663)

* Clessig/develop/fix data reading anemoi missing date 671 (#672)

* Changed logging level for some messages.

* Fixed unhandled exception with missing dates.

* Fixed debug message

* Make compare_run_config.py usable again (#661)

* Update compare_run_config.py to use existing functions from current repo.

* Ruffed code.

* [595] Changes for running a notebook script  (#598)

* Changes

* Chanegs

* work

* change

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* reverse old changes

* linter

* Implement regional evaluation  (#652)

* Add RegionBoundingBox data class to score-utils to handle evaluation for different regions.

* Implement region-specific evaluation in plot_inference.py.

* Adapted utils.

* Introduction of clean RegionLibrary in score_utils.py.

* Ruffed code.

* Updates following reviewer comments.

* Ruffed code.

* Clessig/develop/fix loss 678 (#679)

* Changed logging level for some messages.

* Fixing bug with incorrect counting

* using config results path instead of fixed path (#631)

* using config results path instead of fixed path

* ruff

* Add forgotten LayerNorm (#687)

* Add forgotten LayerNorm

* Apply ruff

---------

Co-authored-by: Sophie Xhonneux <sxhonneux@clariden-ln001.cscs.ch>

* Fix performance degradation in loss computation (#690)

* Changed logging level for some messages.

* Refactored loss computation to improve performance.

* Working around ruff issue

* - Refactored code to improve structure and readability
- Fixed problem with incomplete normalization over loss functions
- Solved problem with mse_weighted as loss function when mse is specified

* Fixed problems with multi-worker training

* Fixed indentation bug and bug in assert

* [DRAFT] Rename plot_inference.py and entrypoint for evaluation (#683)

* Rename plot_inference.py.

* Rename of main-method and move parsing of arguments for entrypoint.

* Introduce entrypoints to fast evaluation.

* Fix to call of main in run_evaluation.py.

* Rename entrypoint and add dependency to weathergen-evaluate.

* Add missing comma in pyproject.toml.

* Option for non-linear output layer in prediction head (#673)

* Add score-class to evaluate-package.

* Add score-class to evaluate-package.

* Lintered and ruffed code.

* Add fix to io.py and update dependencies in common.

* Several small fixes to score-class and fast evaluation.

* Add utils for evaluate.

* Moved to_list to utils and improved doc-strings.

* Improve several doc-strings, avoid formatting of logger and other changes from PR review.

* Add xhistogram and xskillscore to dependencies of evaluate.

* Ruffed code.

* Lintered code.

* Add helper function to get custom last activation.

* Add option to control stream-specific non-linear output layer.

* Controlling print-statement to model.py.

* Corrected handling of config for prediction head.

* Add support for stream-specific, optional non-linear output actiavtion function.

* Provision of ActivationFactory.

* Ruffed.

* Changes following review comments.

* Fix in parsing final_activation-argument.

* Clessig/develop/fix empty 647 (#675)

* Changed logging level for some messages.

* Removed checks that requires non-empty channels

* Adding warning

* Fixed convergence of training (#696)

* Restored old prediction had functionally. Other adjustments/reverts, in particular in attention.

* Ruff'ed

* Addressed reviewer comments and cleaned up minor details

* Fixed bug in obs data reading (#698)

* Restored old prediction had functionally. Other adjustments/reverts, in particular in attention.

* Ruff'ed

* Fixed bug in obs data reading so that data violated window

* Fix

* Update data_reader_obs.py

* Restoring to develop

* Fix

* Ruffed

* Clessig/develop/fix logging verbosity 564 (#619)

* Changed logging level for some messages.

* Added support for more fine grained output control.

* Changed logging setting for inference.

* Minor improvement to doc string

* include run_id in debug log file

* ruff

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Refactor path-setting for 'model' and 'results' to be dynamic (no relative paths) (Closes #591) (#677)

* temp commit wip

* change model_path and run_path setting to dynamic (independent of HPC) (untested)

* removed unnecessary set_paths references

* linted

* remove commented code

* removed commented lines

* Enable plot_train with dynamic paths

* lint

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

* Fix (#715)

* modified evaluation api, callable as python function (#713)

* Fixed bug for degenerate streams (#723)

NaN-robust min/max computation.

* Fixed (#725)

Resolves config loading error when passing a `model_dir`

* Fix on loading model config (#726)

* Small fix on loading model config

* minor change

* Detect if channels for plotting differ from JSON and recompute if necessary (Closes #701) (#718)

* new branch

* detecting changes in channel spec

* style changes

* style changes

* Delete config/plot_config.yml

* incorporated PR feedback

* added run_evaluation (again)

* Clessig/develop/fix logging 719 (#720)

* Cleaned up to use proper logger

* Cleaned up to use proper logger

* Fix logging: needs to be registered per output stream and not per logging level

* Set logging level consistently with debug to file

* Fixes

* Added FSDP-sharding after loading model for train continue (#729)

* Added FSDP-sharding after loading model for train continue

* Improved consistency

* Fixed resetting FSDP after checkpoint saving

* Update handling of `run_path` and `model_path` (Closes #716) (#732)

* proposed solution, untested

* assert instead of error

* lint

* incorporating PR feedback

* lint

* added explicit argument passing

* lint

* Make cartopy map resources a shared asset to prevent downloading from… (#731)

* Make cartopy map resources a shared asset to prevent downloading from the internet which is not always possible

* Replaced print by logger statement

---------

Co-authored-by: xhonneux2 <xhonneux2@jwlogin22.juwels>
Co-authored-by: karlbauer1 <karlbauer1@jwlogin21.juwels>

* Clessig/develop/fixes hackathon (#736)

* Fixed some comments that generated warnings

* Added to create path for log files if it doesn't exist

---------

Co-authored-by: Christian Lessig <christian.lessig@ovgu.de>

* Revised path defaults and output dirctory structure for fast evaluation (#681)

* First changes to path-handling.

* Consistent path for maps and histograms.

* Update of evaluation scipts for proper path defaults and directory structures.

* Make root-path to repo available via common-package.

* Introduce proper defaults to plot_inference.py and set-up desired directory structure for evaluation output.

* Rename of results_dir-parameter to results_base_dir

* Ruffed code.

* Allow for run-specific results-paths and use config to get defaults.

* Several fixes and consistency improvements.

* Remove manual default usage in plotter.py

* Ruffed code.

* Update __init__.py

Remove _REPO_ROOT.

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Mk/develop/fix plot train 727 (#738)

* Load model_path from private config if not provided

* Use existing function to get private model path

* Incorporated PR comments

* Fix problems with rel paths in logging files (#742)

* Fixed relative path handling for logging files.

* Adding default argument to _load_private_conf()

* Implement first function for latitude weighting (#705)

* Changed logging level for some messages.

* Refactored loss computation to improve performance.

* Working around ruff issue

* - Refactored code to improve structure and readability
- Fixed problem with incomplete normalization over loss functions
- Solved problem with mse_weighted as loss function when mse is specified

* Fixed problems with multi-worker training

* add location weights, first commit

* assertion on mask and len(location_weights)

* restructuring of location weights and fixes in mse_channel_location_weighted function

* fix coords_raw dependency on offset and fstep

* ruff

* addressing review commits and fixing bug

* rm location_weight from default stream config

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Fix failure for notebooks. (#750)

* add proper error message for source_include not equal to target_include (#767)

* Implemented fractional target selection  (#751)

* implemented fractional target selection

* ruffed

* fix up configs and <= to accept target_fraction 0.0

* revert to simple implementation of per stream sampling_rate_target

* restore configs

* Corrected formula for L2-error in score-class. (#721)

* Corrected formula for L2-error in score-class.

* Introduced option to get the original or the squared L2-norm.

* Added doc-string for L2-norm.

* Fix sampling rate (#773)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Update default_config.yml (#776)

* Adding the animations feature, fixing stable colorbars (not per stream) (#692)

* Adding the animations feature
* keep only the animations and max-min functions.

* Sophiex/dev/name modules (#754)

* Add names to modules as prep for freezing

* Add functionality to freeze modules based on added names

* Ruff

* Clean up

* Wrong import path

* Ruff

* Fix animations bug with paths (#781)

* Fix another bug in animations (#783)

* Work around to allow for model freezing (#785)

* Work around to allow for model freezing.

* Ruff

* fix to avoid whole model element of named_modules and hence freeze whole model

---------

Co-authored-by: clessig <christian.lessig@ecwmf.int>
Co-authored-by: Sebastian Hickman <seb.hickman@ecmwf.int>
Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* Fast evaluation for integration tests (#770)

* rename module level constant

* split inference into own method

* use proper fast evaluation pipeline for `evaluate_results`

* ruffed

* remove assert => different bug

* adjust tests for new plot template

* Update checking the value of plot_histograms and plot_animations (#788)

* pass StreamData instances to io.py (#779)

* Rename anemoi directories and built backward compatibility (Closes #709) (#771)

* renamed anemoi dirs and built backward compatibility

* ruff

* removed stream directories and updated logging

* renamed all streams

* ruff

* seviri file name change

* cerra_seviri folder update

* cerra path update

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Fix IO when targets/preds are empty. (#760)

* Modify DataReaderObs to get base_yyyy... from stream config (#794)

* modify DataReaderObs to get base_yyyy... from stream config, and set it in the ctor, with default of 19700101. Use it in _setup_sample_index. Remove loading obs_id attr. Add igra.yml with example usage.

* add license to igra config

* update to ISO base_datetime, parse to read idx from zarr

* fix integration tests (#796)

* Fixed bug for empty source (#800)

Co-authored-by: clessig <christian.lessig@ecwmf.int>

* Train continue function with arguments (#803)

* add train_continue_from_args to call with arguments


---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* remove module common/mock_io (#809)

* Update data_reader_obs.py removing asserts (#817)

* Sgrasse/develop/issue 616 (#648)

* encapsulate extraction of source data

* bundle offseting of key attributes

* consolidate calculation of datapoints indices into method

* encapsulate extraction of coordinate axis in function.

* replace attribute `channels` by `target_channels` and `source_channels`

* ruffed

* ruffed

* fixes

* address michas comments

* reactivate assert

* fix typo / renaming

* small fix

* uncomment source_n_empty and target_n_empty unused variables

* fix untit tests (#814)

* Plot substeps (#789)

* Create subplots with grouping by valid_time.

* Create histograms at substeps with grouping by valid_time.

* Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required.

* Add helper function to get values of keys from stream configs.

* Corrected loading of model config.

* Ruffed code and turning message-level on cartopy path to debug.

* Revisions following reviewer comments.

* fix histograms

* ruffed

---------

Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Add the possibility of common ranges in plots per variable and stream (#801)

* Create subplots with grouping by valid_time.

* Create histograms at substeps with grouping by valid_time.

* Make use of inference run config to distinguish between situations where all datapoints of a sample should be plotted or where sub-stepping is required.

* Add helper function to get values of keys from stream configs.

* Corrected loading of model config.

* Ruffed code and turning message-level on cartopy path to debug.

* Add the possibility of common ranges in plots per variable and stream

* Revisions following reviewer comments.

* fix histograms

* ruffed

* update utils

---------

Co-authored-by: Michael <m.langguth@fz-juelich.de>
Co-authored-by: Ilaria Luise <iluise00@login05.leonardo.local>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Fix to io problems. (#820)

* Enable histograms for data with some NaNs (#823)

* Fix to filter NaNs before histogram creation.

* Removed unused code lines and correct for bug in marker scaling in plotter.py.

* Clessig/develop/fix empty io 819 2 (#822)

* Fix to io problems.

* Fix issues in input

* Iluise/fix empty io 819 plotting (#826)

* Fix to io problems.

* Fix issues in input

* fix plotting

* ruffed

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* fix plotting for partially filled first forecast steps (#828)

Co-authored-by: luise1 <luise1@jrc0288.jureca>

* Fix calculation of scores per fstep (#853)

* fix calculation of scores per fstep

* simplified syntax

---------

Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Fix Issue 835 (#841)

Enable freezing the target coord embedding when it is just a simple
layer

* Improve r3tos2 (#744)

* vectorized r3tos2

* revise comment

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>
Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Sophiex/dev/latent noise (#594)

* - Avoid time encoding is 0
- eps in layer norms to 10^-3
- bf16

* Make the attention dtype and norm eps configurable

* Fix gitignore and add config files

* Shuffle config files into sensible folders

* Implement first attempt at new prediction heads

* Fix some bugs

* Fix trainer compile + fsdp

* Fix trainer and better defaults

* Choose AdaLN

* Correlate predictions per cell

Previously this pr treated as independent

* Make things more parameter efficient

* Revert "Make things more parameter efficient"

It made things way worse

This reverts commit 0f31bf11c82ee9f951810ac6782a4b31b83b8757.

* Improve the prediction heads at small sizes

* Improve the stability of training

Two main changes: better beta 1 and beta 2 values in adam w and remove
gelu

* Adding some more regularisation

In particular to prevent training divergences and overfitting

* Create classes for latent noise

* Add the latent noise after the local engine

* Add the KL loss

* Formatting

* Clean up

* Use the same for loop as before

* Prepare branch for merge

* Remove superfluous configs

* Restore default configs

* Mistake in the merge fixed

* Final beauty changes

* Final clean up

* Ruff

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* [Hotfix] Fix crash when using list of forecasting steps (#824)

* Fix crash when using list of forecasting steps

* Ruff

* Grammar fix

* Fix grammar

* Add checking forecast steps list

* Review

* Allow 0 as forecast step

* Add list length check

* Assert non-negative forecast step integer, added assertion messages

* Ruff

* ruff

* Move check to config

* what the ruff

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>

* add tokenizer base class (#815)

* add tokenizer base class

* ruffed

* ruffed v2

* move calculation of centroids to base_class

* move size_time_embedding initialization

* remove ABC from tokenizer base_class

* renaming

* ruffed

* ruffed v2

* add return value to compute_source_centroids

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* vectorize s2tor3 (#745)

* vectorize s2tor3

* ruff code

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Remove cleaning stream name when logging loss (#763)

* Combine masking strategies during training, with appropriate masking_… (#756)

* combine masking strategies during training, with appropriate masking_strategy_config

* restore config samples per validation

* restore cofigs, and add to masking_strategy_config

* clarify pass to per batch per stream

* updated combination masking to support same masking strategy for all streams in the batch. Strategy resampled for every batch.

* rename so we have masking_strategy and masking_strategy_per_batch

* ruffed

* clean, default to different strategy per batch for combination

* ruff

* remove unused variable

* updated docstrings (#875)

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* Enable correct reading of channels, forecast_step, sample variables in plot config file (Closes #717) (#755)

* adjusted run_evaluation and utils code to take into account forecast_step variable from config

(cherry picked from commit 26c26a923cabc5777bc75ef911f0fc3c61397e1a)

* print statement change

* catching error when fstep not present in zarr file

* upgrades based on PR feedback

* intermediate commit

* intermediate commit

* new functions _get_channels_fsteps_samples and check_metric

* edited plotting

* inter commit

* fixed bug in  get_data

* self review

* refactor

* dummy commit

* inter commit

* feedback appleid

* incorporate review feedback

* removed sorting of fsteps_final

* remove comments

---------

Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* Implement causal masking as MTM strategy (#798)

* first rough implementation of causal masking

* incorporated combine masking strategies

* include per stream sampling rate target in tokenizer_masking based on other PR

* clean up implementation of causal masking

* remove TODO

* remove old causal masking function

* add latest error message for channel

* change if to elif for causal masking

* if to elif in mask_target

* cleaned up causal masking code

* tokenizer_masking small change

* updated config

* fix up config

* restore era5 config

* ruffed

* update config and masking.py with causal masking specific masking rate, and some comments

* ruffed

* roll back causal_masking_rate changes, return to just use masking_rate

* faster version of causal masking, vectorise where possible. Need list comprehension for variable length tokens

* ruffed

* add log scale and refactor plot_summary (#865)

* add log scale and refactor plot_summary

* add plot_utils

* add grid

* ruffed

* fix marker size

* fix global plotting options

* add types

* ruffed

* Fixed stream name factoring (#534)

* Updated to camel case.

* Fixed formatting.

* to reflect upstream develop

* got rid of regex and changed formatting of str names

* pulled recent changes from upstream develop

* Removed refactoring of lf_name.

* clean_name with the new changes

* Fetched latest changes to the branch

* Fixed linting

* Fixed stream name without touching the losses dict

* fixed type annotation

* add srun to integration-test in actions script (#886)

* add srun to integration-test in actions script

* add --offline flag to integration-test in actions.sh

* Merge compare_run_configs.py with markdown table version (#699)

* initial comments to outline implementation

* Refactor config comparison script to support YAML input and enhance output formatting

* remove unused code

* Add example configuration for model run IDs and display patterns

* shorten

* Add 'tabulate' dependency to enhance table formatting capabilities

* add instructions to config

* restore option for command line run ids and model dirs

* ruff

* fix arg parsing

* add option to show specific or all parameters in config comparison

* ruff

* Remove 'tabulate' from dependencies

Removed 'tabulate' dependency from project requirements.

* logging, imports and dependency in compare_run_configs.py

* fix logging and dependencies

* ruff

* set default

* fix arg order and checks

* improve model directory handling and add exception when there is not latest model

* refactor error handling in main function to omit exception details in logs

* ruff

* ruff

* add weathergen dependency

* make file executable

* add private home argument

* revert to default model path argument assuming symlink to shared folder is set

* ruff

* Implement splitting zarr and regex filenames (#524)

* Implement splitting zarr and regex filenames

* Optimize dask reading operations

* Ruff

* Review

* Ruff

* Remove stream name cleaning when logging loss

* Add tolerance to setting std to 1.0

* Implement input column reordering and channel exclusion

* Update stream config

* Ruff

* Add config file

* Implement variable state persistance

* ruffffff

* ruffa

* ruf ruf

* Add select channels method

* updating .gitignore file to include all development directories without / (#900)

* [804] vectorize tokenize_window_space (with test) (#893)

* vectorize tokenize_window_space

* import pad_sequence from torch

* change vari names, remove device, add comments, ruff code

* changes

* changes

* changes

* simplify

* small change

* unit tests

* unit tests

* unit tests

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [812] efficient tcs computation (with tests) (#894)

* efficient tcs computation

* revise vectorize tcs_optimized, ruff code

* add typing

* changes

* changes

* merge

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [811] Improve perf locs to cell coords ctrs (#895)

* optimize locs_to_cell_coords_ctrs

* revise get_target_coords_local_ffast for new optimized locs_to_cell_coords_ctrs

* changes

* changes

* tests

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>
Co-authored-by: Sophie X <24638638+sophie-xhonneux@users.noreply.github.com>

* [810] optimize locs_to_ctr_coords (with tests) (#896)

* optimize locs_to_ctr_coords

* changes

* changes

* changes

* changes

* merge

* changes

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Migrated Config to common (#607)

* Updated to camel case.

* Fixed formatting.

* to reflect upstream develop

* got rid of regex and changed formatting of str names

* pulled recent changes from upstream develop

* migrated config to common

* fixed lint issues

* Corrected all the changes

* syntax err fixed

* Fixed import

* Latest upstream changes pulled

* Fixed Linting errors

* Lint fix

* Pulled latest

* fixed other occurences

* Fix compare_config after config went to common (#903)

* fix after config went to common

* Change argument type for --show option from int to str in main function

* Update default config path in main function to compare_config_list.yml

* Iluise/develop/add io reader (#891)

* first implementation of reader class for evaluation package

* add io reader

* move check_availability to reader

* update to develop

* fix retrive results

* address comments

* Fix minor bug in modules (#909)

---------

Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>

* [908] Harmonize the linter check between the CI and our CLI (#910)

* changes

* changes

* [554] Updates the PR template (#912)

* changes

* changes

* changes

* comments

* [906] Bug fix in tokenizer (#907)

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* cleanups

* changes

* comments

---------

Co-authored-by: Seb Hickman <56727418+shmh40@users.noreply.github.com>

* Add levels (#916)

* changes to include discrete levels in colormap if needed

* Change slightly the position of the feature

* Lint

* changes (#920)

* [926][evaluation] weatherGen reader for evaluation package (#927)

* weatherGen reader for evaluation package

* ruffed

* [939] Fix CI (#940)

* changes

* changes

* Implement forecast activity metrics (#892)

* Add forecast activity calculations and update fstep handling in utils

* Add forecast rate of change metrics (froct, troct) to score calculations

* update description

* add next data to verified data

* move cases for kwargs to score

* refactor froct adn troct to use calc_change_rate

* remove metric specific kwargs in calc_scores_per_stream

* calc_change_rate now gives NaN array when next step is None

* fix nans

---------

Co-authored-by: Julian Kuehnert <julian.b.kuehnert@gmail.com>
Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com>

* added IFS-FESOM streams and updated all stac files using jsonnet (#934)

* added IFS-FESOM streams and updated all stac files using jsonnet

* changes according to comments by Ilaria

* resolved by using providers from common.jsonnet file

* changed refererrence to ecmwf and develop branch

---------

Co-authored-by: Patnala,Ankit <a.patnala@fz-juelich.de>

* [939] Catches failures of labeling CI job (#950)

* changes

* more permissions

* more permissions

* [880] Informative type checks in the CI (#915)

* attempt

* fixing pyrefly

* changes

* changes

* changes

* Sets dropout rate to 0 in eval mode for flash_attn (#923)

* added check for train/eval for setting dropout_p value

* ruff

* rm ceil, conj, floor, and matmul from annotations.json (#951)

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Add the calculation of 10ff for ERA5 and CERRA  (#914)

* Add the calculation of 10ff

* Caring for cases were 10ff cannot be calculated

* Create a new script for derived_channels, minor changes to reader_io

* Remove stream specific settings add regex

* Add more datetime formats (#962)

* fix error when global_plotting_opt does not exist (#964)

* fix error when global_plotting_opt does not exist

* fix linter

* changes (#1009)

* Revert "changes (#1009)" (#1012)

This reverts commit 2af1c09a11e6dd027d247b670737bbac0cd1a766.

* sorcha/dev/500 (#1001)

* lint reformatting + fixing get_channels

* debug messages

* [datasets] move to new cerra and new era5 (#995)

* move to new cerra and new era5

* fix cerra

* Removed the method freeze_weights_forecast and all forecast_freeze_model flag occurences (#924)

* Add Coordinate System Conversion to DataReaderFesom (#1024)

* Add coordinates conversion

* Ruff

* Add check for longitude

* Sophiex/dev/fsdp2 fix (#959)

* Save current state

* Save current state

* Barebone FSDP2 prototype TODO save checkpoints

* First version of saving model

* Fix save_model

* Log everything and log to files

* Remove redundant path creation

* Allow for both slurm and torchrun + fewer log files

* Cleaning up init_ddp

* Ruff

* Attempt to avoid duplicate logging

* FSDP2 with mixed precision policy

* Ruff

* Clean up and logging

* Try to get loggers to behave as we want

* Makes ruff unhappy but works

* Fixed ruff issue

* Fixed problems with multi-node training.

* Fix for interactive/non-DDP runs

* No idea why, but this seems to work so far

Committing simply so it is saved, obviously needs cleanup

* Still works! So which is it memory or the grad scaler?

* Also still works, I now strongly suspect the amp.gradscaler

* This still works, I have no clue anymore why but whatever it works
now....

* Enable loading model from absolute paths

* Enable loading for 1 GPU only

* Fix 1 GPU train continue

* Appease ruff

* Fix saving the model more regularly and perf logging

* Fixed problem when training with 2 nodes.

* Fix data loader seed

* Appease ruff

* Shouldn't overwrite with_fsdp like this

* Potential fix for FSDP2 issue

with different ranks using different model parts

* Fix loss scaling and logging of dummy data loss

* Clean up

* Appease ruff

* Fixed problem when source channels are empty (i.e. with diagnostic trainings).

* Update io.py

* FSDP2 suggestions from Tim (#1015)

* comments

* sophie's comments

* removed logger suggestions

* Clean up deadcode etc

* Removing unused imports that the linter didn't like

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* Ensure sample coordinate is repeated along ipoint for single sample cases in WeatherGenReader (#1026)

Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com>

* Fix change rate calculation by aligning s1 with s0 (#1007)

* Fix change rate calculation by aligning s1 with s0

* Refactor score calculation to remove unnecessary alignment and add sorting function for coordinates

* use .values option

* Optimize pos enc harmonic (#1033)

* add device & dtype

* ruffed

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* [evaluation] fix score computation with empty cerra samples (#1039)

* fix samples

* riffed

* answer comments

* ruffed

* Add score cards plotting feature (#1041)

* add the feature of score_cards

* Refactor, fix error when sample are different for each run, linting

* fix bug, fix sizes when skill difference is huge

* linting

* changes on comments

* linting

* Clessig/develop/fix inference 1049 (#1053)

* add device & dtype

* ruffed

* Fix inference

---------

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Channel weighting in loss computation (#753)

* introducing channel weights

* tested channel weighting

* adding target_channel_weights to data_reader_base

* uncomment target channel parsing in anemoi dataset

* remove channel weights from default stream config

* Adds default config for run_evaluation (#1028)

* adding default config + changing yml path locations

* linter checks

* linter checks

* revert .yml file

* updates

---------

Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>

* Fix CERRA eval breaking with coord sorting in `froct` (#1057)

* Move coord sorting inside score function to be metric-specific

* Linting: Removed unused import

* Return nan data array to prevent crash

* fix nans shape in calc_change_rate

---------

Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* [1059][eval] Fix eval crash of inference models from other HPCs (#1060)

* Read model path from private repo instead of inference config

* Linting: Organized imports

* Interface improvements

* [1022] Getting WG to work on santis (#1023)

* working pytorch

* changes

* Fix for code to work on Alps-Santis

* changes

* cleanups

* changes

* reverting change

* having issues with the latest branch on santis

* changes

* changes

* changes

* override with cpu

* working for cpu

* flash-attn moved to gpu

* remove contstraint

* simplifying

* trying

* working on atos

* changes

* macos

* chanegs

* cleanups

* actions

* actions

* actions

* actions

* changes

---------

Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* fix crash in case of missing streams  (#1058)

* fix issue with empty region

* fix non existing stream

* fix channel order in evaluation (#1066)

* New templates for issues (#1017)

* changes

* changes

* Revert "New templates for issues (#1017)" (#1071)

This reverts commit 3a6e7b826b7b29a6df4af27b6771567474302fb3.

* [1002] Template for issues try 2 (#1072)

* changes

* changes

* issue with template

* updates

* changes

* issue

* issue

* [1073][model] Adds latent noise imputation (#1074)

* Add latent noise imputation to model.py with backwards compatibility

* Linted

* Resetting default_config, except for new flag

* Resetting default_config 2nd try

* add modules to annotations.json (#1035)

Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de>

* Jk/develop/gamma decay (#998)

* Update to develop, prepare for new experiment series

* gamma decay over fsteps first commit

* add gamma decay factor to config

* working gamma decay weighting

* rm breakpoint

* rm eval and plot configs

* reverting default config

---------

Co-authored-by: Matthias Karlbauer <matthias.karlbauer@ecmwf.int>
Co-authored-by: Julian Kuehnert <julian.kuehnert@ecwmf.int>

* Add materialisation of new modules before loading checkpoint (#1030)

* Add materialisation of new modules before loading checkpoint

* Initialize new modules in load_model

* Fix adding new embedding networks

* Clessig/develop/fix kcrps 1077 (#1078)

* Improved robustness for loss fcts where ch loss does not make sense

* Re-enabled kernel CRPS and added weighting options

* Fixes

* Improved tensor reordering

* Sgrasse/develop/issue 898 checkpoint freq conf (#905)

* add new/changed parameters in default_config

* implement backward compatibility

* remove `train_log.log_interval` from default config

* use new configuration arguments in Trainer

* fix: wrong variable name

* ruffed

* Rework method structure

* fix bug

* rename `log_intevals` to `train_log_freq`

* fix integration tests

* fix forgot renaming

* fix rebasing artifact

* Sorcha/dev/571 (#957)

* debug for netcdf pipeline

* zarr_netcdf first draft

* fixing pipeline

* linter checks

* removing debug prints from io.py

* refactoring, found issue with forecast_ref_time

* deleting unnecessary lines

* proper docstrings

* moving filepaths

* linting

* multithread processing added

* debug info

* debugging

* refactoring

* linting

* fstep as argument

* change assert

---------

Co-authored-by: owens1 <owens1@jrlogin09.jureca>
Co-authored-by: iluise <72020169+iluise@users.noreply.github.com>
Co-authored-by: ilaria luise <luise.ilaria@gmail.com>

* pyproject.toml checks (#1042)

* adding chceks for toml files into actions

* lint fixes

* lint checks

* link fixes

* changes

* disabling ruff check

* change to path info instead

* adding E501 and E721 to be ignored for now

---------

Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int>

* Implement EMA of the model (#1005)

* Save current state

* Save current state

* Barebone FSDP2 prototype TODO save checkpoints

* First version of saving model

* Fix save_model

* Log everything and log to files

* Remove redundant path creation

* Allow for both slurm and torchrun + fewer log files

* Cleaning up init_ddp

* Ruff

* Attempt to avoid duplicate logging

* FSDP2 with mixed precision policy

* Ruff

* Clean up and logging

* Try to get loggers to behave as we want

* Makes ruff unhappy but works

* Fixed ruff issue

* Fixed problems with multi-node training.

* Fix for interactive/non-DDP runs

* No idea why, but this seems to work so far

Committing simply so it is saved, obviously needs cleanup

* Still works! So which is it memory or the grad scaler?

* Also still works, I now strongly suspect the amp.gradscaler

* This still works, I have no clue anymore why but whatever it works
now....

* Enable loading model from absolute paths

* Enable loading for 1 GPU only

* Fix 1 GPU train continue

* Appease ruff

* Fix saving the model more regularly and perf logging

* Fixed problem when training with 2 nodes.

* Fix data loader seed

* Appease ruff

* Shouldn't overwrite with_fsdp like this

* Potential fix for FSDP2 issue

with different ranks using different model parts

* Fix loss scaling and logging of dummy data loss

* Clean up

* Appease ruff

* Start implementing EMA, works for 1 GPU

* Make EMA model multi-gpu compatible

…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NCCL timeout issue

3 participants