Abstract loss calculator #1210

Jubeku · 2025-11-05T21:20:55Z

Description

Enables generic loss calculation for a given set of predictions-target pairs which can be in latent and/or physical space and part of student-teacher training or diffusion models.

Proposed structure:

Classes:
- LossCalculator class: iterates over all special loss classes and returns a combined loss object.
- LossCalculatorBase class: generic loss calculator structure
- LossCalculatorPhysical, LossCalculatorLatent, etc.: specific subclasses of LossCalculatorBase
DataClasses:
- LossValues: Predefines the items that are returned by the loss calculator classes
- InputOutput: Predefines the items/structure of model predictions and targets (can include forecast step logic)

Issue Number

Closes #1178

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2025-11-06T20:55:35Z

config/default_config.yml


 loss_fcts:
-  -
+  - 


Let's have a discussion how the config should be structured for this

src/weathergen/train/loss_calculator.py

clessig · 2025-11-06T21:04:24Z

config/default_config.yml

    - "mse"
    - 1.0
+  # -
+  #   - "latent:mse"


@sophie-xhonneux : the latent loss function are largely determined by the SSL strategies (with some flexibility, e.g. if MAE or MSE for JEPA) and they are also . The latents returned by the Teacher are a dict with entries like 'DINO' : torch.tensor and iBOT : torch.Tensor. The loss function should somehow come from the SSLTargetProcessors, not?

That would be an option, definitively. My plan was simply for the loss calculator to have a SSLLossCalculator that has a loss function for each DINO, iBOT, JEPA-L1, JEPA-L2. Because at the end of the day we have to specify it somewhere and there is some tensor reshaping and stuff to do.

src/weathergen/train/loss_calculator_base.py

src/weathergen/train/loss_modules/loss_module_physical.py

src/weathergen/train/loss_calculator_classes.py

config/default_config.yml

…ging

Jubeku · 2025-11-17T11:03:53Z

_log_termial now prints logs for each loss module:

...corresponding to the following toy config:

training_mode_config: {"losses": {LossPhysical: {weight: 0.7, loss_fcts: [['mse', 0.8], ['mae', 0.2]]},
                                  LossPhysicalTwo: {weight: 0.3, loss_fcts: [['mse', 1.0]]},
                                  }
                      }

Jubeku · 2025-11-18T15:10:29Z

src/weathergen/train/loss_modules/loss_functions.py

    return loss, loss_chs


+def mae(


This is a toy function which should be removed.

Let's preserve this nice geometric sketch though =)

The sketch is from mse. We could potentially generalize MSE to implement any L_p norm. Then we could avoid the code duplication.

I will entirely remove the MAE function now as I only used it for testing.

Jubeku · 2025-11-18T15:24:33Z

There is some work to be done on the logging and how to carry all the terms for logging (maybe in a separate PR).

Example log file currently:
v7ad0hi5_train_metrics.json

MatKbauer

Great progress, I have added a couple comments. We can still decide, whether we want to modularize some repetitive code into functions and implement a latent KL loss now already or postpone it to later.

MatKbauer · 2025-11-19T06:58:09Z

config/default_config.yml

-samples_per_mini_epoch: 4096
-samples_per_validation: 512
+samples_per_mini_epoch: 32
+samples_per_validation: 8


Let's revert those back to the original settings

MatKbauer · 2025-11-19T07:05:41Z

src/weathergen/model/model.py

+        latents = {}
+        latents["posteriors"] = posteriors
+
+        return ModelOutput(physical=preds_all, latent=latents)


To satisfy the dict-definition of physical in the ModelOutput dataclass, we can do something like

physical = {"predictions": preds_all}

and pass this dict to the ModelOutput class, i.e., ModelOutput(physical=physical, latent=latents)

Isn't the model output always be predictions? Also latents["posteriors"] sounds duplicate although I can see that we can have latents at different stages in the model and want to compute the loss over these. Just most of them will be posteriors in some sense. But no big thing, we can adjust this later.

Agree, I think we can fix it in the diffusion PR.

MatKbauer · 2025-11-19T07:33:27Z

src/weathergen/train/loss_modules/loss_module_latent.py

+
+        loss_val = loss_fct(target=target, ens=None, mu=pred)
+
+        return loss_val


I stumbled here; can we rename loss_val to just loss or loss_value to prevent confusion with validation?

MatKbauer · 2025-11-19T07:37:51Z

src/weathergen/train/loss_modules/loss_module_base.py

+        Computes loss given predictions and targets and returns values of LossValues dataclass.
+        """
+
+        raise NotImplementedError()


With the super.compute_loss() call in the LossLatent.compute_loss() class below (here), should we implement this function here in the base class?

Each loss module needs to implement its own compute_loss function, which overwrites the one of the base class. If it doesn't the function of the base class will raise this NotImplmentedError(). So this is on purpose.

MatKbauer · 2025-11-19T07:39:42Z

src/weathergen/train/loss_modules/loss_module_physical.py

+        self.loss_fcts = [
+            [getattr(losses, name if name != "mse" else "mse_channel_location_weighted"), w, name]
+            for name, w in loss_fcts
+        ]


Is this always the same and can we move it to the base class?

The exception with "mse_channel_location_weighted" is specific for the physical loss. So I would keep it here for the moment.

src/weathergen/train/trainer.py

MatKbauer · 2025-11-19T08:05:04Z

src/weathergen/train/trainer.py

+                            self.loss_unweighted_hist[loss_name].append(losses_all)
+                        for loss_name, stddev_all in loss_terms.stddev_all.items():
+                            self.stdev_unweighted_hist[loss_name].append(stddev_all)
+                    self.loss_model_hist += [loss.item()]


This is the same as ~100 lines above for training, isn't it? If so, let's move it into a function..

Yes. Will work on this in a separate logging PR.

MatKbauer · 2025-11-19T08:07:05Z

src/weathergen/train/trainer.py

-                            "{}".format(st["name"])
-                            + f" : {losses_all[st['name']].nanmean():0.4E} \t",
+                            f"{loss_name}" + f" : {loss_values.nanmean():0.4E} \t",
                        )


Nice, this is much clearer now!

MatKbauer · 2025-11-19T08:08:26Z

src/weathergen/utils/train_logger.py

+            log_vals += [loss_values[:, :].nanmean().item()]
+        for loss_name, stddev_values in stddev_all.items():
+            metrics[f"loss.{loss_name}.stddev_avg"] = stddev_values.nanmean().item()
+            log_vals += [stddev_values.nanmean().item()]


Can we put this into a function too? Looks like the same is done for training ~50 lines above.

Yes, also to be done in the logging PR.

MatKbauer · 2025-11-19T08:10:05Z

src/weathergen/train/loss_modules/loss_functions.py

    return loss, loss_chs


+def mae(


Let's preserve this nice geometric sketch though =)

clessig

Revert default config back

clessig · 2025-11-20T15:09:57Z

src/weathergen/model/model.py

+        latents = {}
+        latents["posteriors"] = posteriors
+
+        return ModelOutput(physical=preds_all, latent=latents)


Isn't the model output always be predictions? Also latents["posteriors"] sounds duplicate although I can see that we can have latents at different stages in the model and want to compute the loss over these. Just most of them will be posteriors in some sense. But no big thing, we can adjust this later.

clessig · 2025-11-20T15:12:00Z

src/weathergen/train/loss_modules/loss_functions.py

    return loss, loss_chs


+def mae(


The sketch is from mse. We could potentially generalize MSE to implement any L_p norm. Then we could avoid the code duplication.

clessig · 2025-11-20T15:12:44Z

src/weathergen/train/loss_modules/loss_module_base.py

+    """
+    A dataclass to encapsulate the loss components returned by each loss module.
+
+    This provides a structured way to return the primary loss used for optimization,


I think the documentation is outdated. We do not return the opt-loss any longer.

In the LossValues dataclass we do return the opt loss. Here I think the data class makes sense because the base class predefines that any loss module which is implemented in future has to return a LossValues object which return the opt loss, as well as losses_all and stddev_all.
These are collected by the loss calculator which then returns the derived opt loss to the trainer separately (i.e. not within a dataclass).

src/weathergen/train/loss_modules/loss_module_base.py

clessig · 2025-11-20T15:13:54Z

src/weathergen/train/loss_modules/loss_module_base.py

+class LossModuleBase:
+    def __init__(self):
+        """
+        Base class for loss calculators.


This is the base class for LossModules, which correspond to loss terms? The loss calculator is something else.

clessig · 2025-11-20T15:21:45Z

src/weathergen/train/loss_modules/loss_module_latent.py

+
+        # Dynamically load loss functions based on configuration and stage
+        self.loss_fcts = [
+            [getattr(losses, name if name != "mse" else "mse_channel_location_weighted"), w]


mse_channel_location_weighted doesn't make sense for latent loss

clessig · 2025-11-20T15:22:15Z

src/weathergen/train/loss_modules/loss_module_physical.py

+        return LossValues(loss=loss, losses_all=losses_all, stddev_all=stddev_all)
+
+
+class LossPhysicalTwo(LossModuleBase):


Remove before merging.

Also, can't we just specify LossPhysical twice in the config?

clessig · 2025-11-20T15:23:06Z

src/weathergen/train/loss_modules/loss_module_ssl.py

@@ -0,0 +1,38 @@
+# ruff: noqa: T201


I don't think we should merge this with the PR.

Removing this placeholder file.

clessig · 2025-11-20T15:24:09Z

src/weathergen/train/loss_calculator.py


 @dataclasses.dataclass
-class LossValues:
+class LossTerms:


LossTerms is ambigious. For me this is what the LossModules are

I am still wondering if it should be a data class at all as we anyway return the loss separately, i.e. return loss, LossTerms(loss_terms=loss_terms).
In the end, LossTerms are only needed for logging. I would keep it this way for now and update it in the PR which resolved the train logging.

Ok, but let's put it on the todo list for this PR please so that we don't forget.

clessig · 2025-11-20T15:25:24Z

src/weathergen/train/trainer.py

-                kl = torch.cat([posterior.kl() for posterior in posteriors])
-                loss_values.loss += cf.latent_noise_kl_weight * kl.mean()
+                kl = torch.cat([posterior.kl() for posterior in output.latent])
+                loss += cf.latent_noise_kl_weight * kl.mean()


Either we write a LossModule for this or we leave and push this soon after.

…evelop/loss_calc_base

clessig · 2025-11-21T12:07:09Z

src/weathergen/train/loss_calculator.py


 @dataclasses.dataclass
-class LossValues:
+class LossTerms:


Ok, but let's put it on the todo list for this PR please so that we don't forget.

Jubeku added 2 commits November 4, 2025 09:38

adding loss calculator base class

28d9b22

abstract loss calc structure

f1e7132

Jubeku self-assigned this Nov 5, 2025

github-project-automation bot added this to WeatherGen-dev Nov 5, 2025

github-actions bot added the model Related to model training or definition (not generic infra) label Nov 5, 2025

Jubeku added 2 commits November 6, 2025 16:45

add abstract method to loss calculator base class

e822e12

add latent loss class

d24ef48

clessig self-requested a review November 6, 2025 20:54

clessig reviewed Nov 6, 2025

View reviewed changes

update loss calc config and rename files

c259c20

Jubeku commented Nov 7, 2025

View reviewed changes

config/default_config.yml Outdated Show resolved Hide resolved

Jubeku added 2 commits November 11, 2025 15:41

restructure loss modules

a19ee16

add ModelOutput dataclass

bf3e128

MatKbauer mentioned this pull request Nov 12, 2025

Targets for latent diffusion model training #1249

Closed

6 tasks

Jubeku added 2 commits November 14, 2025 10:31

merge develop

0fa60db

mv streams_data declaration under if condition

cab9fbe

MatKbauer mentioned this pull request Nov 14, 2025

Loss calculator for diffusion forecast engine #1263

Closed

6 tasks

Jubeku added 3 commits November 14, 2025 12:07

add weight to loss config, add toy loss class LossPhysicalTwo

20da555

fixed trainer for multiple terms in losses_all, still need to fix log…

d7b326b

…ging

fix _log_terminal

3ffdc60

Jubeku added 8 commits November 17, 2025 17:26

fix logging

beb4d6f

initialize loss as torch tensor with grad

33394ff

remove level in hist losses dict

bda52d8

rename loss.py to loss_functions.py

053dddd

rename loss.py to loss_functions.py

d094ad0

return loss with grads seperately to trainer

8b4cbef

modify log names

d0ef572

add loss_functions.py

c6805c4

Jubeku commented Nov 18, 2025

View reviewed changes

merge develop

0ccce9e

rm loss_fcts in default config

7ac9e6b

Jubeku marked this pull request as ready for review November 18, 2025 15:26

MatKbauer reviewed Nov 19, 2025

View reviewed changes

Merge branch 'develop' into jk/develop/loss_calc_base

cca76e3

clessig reviewed Nov 20, 2025

View reviewed changes

Jubeku added 7 commits November 20, 2025 16:43

rm latent and ssl loss modules which are not yet implemented

88c7328

revert default config (except loss terms)

783c7e3

rm mae in loss functions

e11b865

update __init__

09cdfc7

update base class description

e4e60fe

include comment

0c6ac86

Merge branch 'develop' into jk/develop/loss_calc_base

260f939

Jubeku mentioned this pull request Nov 20, 2025

Update metrics logging to work for new loss modules #1316

Closed

9 tasks

clessig added 7 commits November 21, 2025 11:56

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into jk/d…

b3aa5a7

…evelop/loss_calc_base

Fixed issue with preapre_logging for output

17acdf5

Fixed issue with multiple streams

82608e1

Improve robustness for model development

5d8a488

Fixed problem for multiple streams

47c5df9

Adding copyright notice to file

00b119e

Making type hints more explicit

570338e

clessig approved these changes Nov 21, 2025

View reviewed changes

clessig merged commit 94bc7c9 into develop Nov 21, 2025
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Nov 21, 2025

clessig deleted the jk/develop/loss_calc_base branch November 21, 2025 12:09

Jubeku mentioned this pull request Dec 4, 2025

1316 Update metrics logging #1412

Merged

4 tasks


		loss_val = loss_fct(target=target, ens=None, mu=pred)

		return loss_val

		return LossValues(loss=loss, losses_all=losses_all, stddev_all=stddev_all)


		class LossPhysicalTwo(LossModuleBase):

Abstract loss calculator #1210

Abstract loss calculator #1210

Uh oh!

Conversation

Jubeku commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jubeku commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Nov 18, 2025

Uh oh!

MatKbauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jubeku commented Nov 5, 2025 •

edited

Loading

Jubeku commented Nov 17, 2025 •

edited

Loading