Enable single sample processing #1380

moritzhauschulz · 2025-11-30T19:03:02Z

Description

Basically trying to enable the code to allow processing a single sample so we can overfit the diffusion model to a single sample.

We introduce a --repeat_data flag. When active, and the number of samples in the dataset is lower than the size of the mini epoch, then the data is tiled to fill up the mini epoch. Currently a check is introduced to ensure that in that case the number of samples evenly divides the size of the mini epoch to have a balanced mini epoch.

Some additional checks are introduced to avoid misconfigurations, see code.

Some checks may need to be removed after the overfitting experiment for the diffusion model.

Not currently addressing #1370.

Issue Number

Closes #1379

Checklist before asking for review

Currently getting `ModuleNotFoundError: No module named 'flash_attn'` when running `uv run train`. Worked well before merging though...

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
[] I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

… unrelated)

Jubeku · 2025-12-02T12:28:23Z

src/weathergen/datasets/multi_stream_data_sampler.py

+        else:
+            assert samples_per_mini_epoch, "must specify samples_per_mini_epoch if repeat_data"
+            self.len = samples_per_mini_epoch
+


Not sure if I understand correctly, but do we not want to run an epoch with e.g. samples_per_mini_epoch: 4096 while always repeating the same e.g. 4 samples? In this case we would need to introduce another config parameter, e.g. repeat_num_samples: 4 or repeat_num_idxs: [1,2,3,4] (to define specific indices, not sure if would need this).

I think this is addressed in line 278, but may not cover some edge cases.

Maybe the underlying question is whether we need to adjust start_date and end_date in the config to shorten the dataset or if we can also repeat samples randomly from the entire dataset. We can discuss in 5 minutes :)

moritzhauschulz · 2025-12-03T19:14:28Z

Somehow getting a ModuleNotFoundError which I did not get before merging. Could someone else try this please. Otherwise ready to review.

MatKbauer

Looks good. I have added some comments to retain (fsm + 1). Overall, I'd suggest we use this branch for our single sample experiments.

MatKbauer · 2025-12-04T08:08:38Z

config/default_config.yml

-end_date_val: 202201010000
+end_date: 201401011200
+start_date_val: 201401010000
+end_date_val: 201401011200


Let's set the end_dates for train and val to 201401011800, such that we can keep (fsm + 1) below.

MatKbauer · 2025-12-04T08:10:13Z

src/weathergen/datasets/multi_stream_data_sampler.py

-        forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs
+        forecast_len = (
+            self.len_hrs * (fsm)
+        ) // self.step_hrs  # TODO: check if it should be fsm + 1


With the adjusted end_dates in the config, we can revert this to forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs, i.e., using (fsm + 1) instead of fsm.

MatKbauer · 2025-12-04T08:11:07Z

src/weathergen/model/model.py

Let's remove this comment

moritzhauschulz · 2025-12-04T19:15:05Z

Added some comments and removed one edge case handling that is not necessary anymore due to extended data range.

grassesi · 2025-12-08T09:11:37Z

src/weathergen/datasets/multi_stream_data_sampler.py

-        forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs
+        print(f"idx range end is {idx_end}")
+        print(f"len hrs is {self.len_hrs}, step hrs is {self.step_hrs}, fsm is {fsm}")
+        forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs  # NOTE: why is it fsm +1?


fsm := the maximum number of forecast steps to predict. +1 for the potential forecast_offset of 0 or 1. This guarantees that also the targets are always within the range defined by start_date and end_date.

Thanks for explaining!

grassesi · 2025-12-09T08:09:57Z

@moritzhauschulz @MatKbauer If this PR is ready, lets try and get it merged.

inter commit – runs through except error in validation loop (probably…

37f586f

… unrelated)

github-project-automation bot added this to WeatherGen-dev Nov 30, 2025

inter

d7af544

moritzhauschulz marked this pull request as draft November 30, 2025 19:13

MatKbauer added this to the latent diffusion model milestone Dec 2, 2025

moritzhauschulz added 2 commits December 2, 2025 10:27

implement repeat flag

c7340ab

remove debugging

c756837

Jubeku reviewed Dec 2, 2025

View reviewed changes

moritzhauschulz added 4 commits December 3, 2025 19:41

fix masking bug and assert statements

37b9c76

lint

d0ec01d

Merge branch 'mk/develop/1300_assemble_diffusion_model' into issue1379

fb1b995

new check

df2b885

moritzhauschulz changed the title ~~[DRAFT] Enable single sample processing~~ Enable single sample processing Dec 3, 2025

moritzhauschulz marked this pull request as ready for review December 3, 2025 19:14

minor config change

f168f05

MatKbauer reviewed Dec 4, 2025

View reviewed changes

moritzhauschulz added 4 commits December 4, 2025 13:29

updated asserts

a67b5c3

improve repeat flag

349b1ab

remove comment

f0e09d7

simplifications and comments

7703a12

grassesi reviewed Dec 8, 2025

View reviewed changes

Enable single sample processing #1380

Are you sure you want to change the base?

Enable single sample processing #1380

Uh oh!

Conversation

moritzhauschulz commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Currently getting ModuleNotFoundError: No module named 'flash_attn' when running uv run train. Worked well before merging though...

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moritzhauschulz commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatKbauer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moritzhauschulz commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grassesi commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

moritzhauschulz commented Nov 30, 2025 •

edited

Loading

Currently getting `ModuleNotFoundError: No module named 'flash_attn'` when running `uv run train`. Worked well before merging though...

moritzhauschulz commented Dec 3, 2025 •

edited

Loading