Skip to content

Conversation

@moritzhauschulz
Copy link
Contributor

@moritzhauschulz moritzhauschulz commented Nov 30, 2025

Description

Basically trying to enable the code to allow processing a single sample so we can overfit the diffusion model to a single sample.

We introduce a --repeat_data flag. When active, and the number of samples in the dataset is lower than the size of the mini epoch, then the data is tiled to fill up the mini epoch. Currently a check is introduced to ensure that in that case the number of samples evenly divides the size of the mini epoch to have a balanced mini epoch.

Some additional checks are introduced to avoid misconfigurations, see code.

Some checks may need to be removed after the overfitting experiment for the diffusion model.

Not currently addressing #1370.

Issue Number

Closes #1379

Checklist before asking for review

Currently getting ModuleNotFoundError: No module named 'flash_attn' when running uv run train. Worked well before merging though...

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • [] I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@moritzhauschulz moritzhauschulz marked this pull request as draft November 30, 2025 19:13
@MatKbauer MatKbauer added this to the latent diffusion model milestone Dec 2, 2025
else:
assert samples_per_mini_epoch, "must specify samples_per_mini_epoch if repeat_data"
self.len = samples_per_mini_epoch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand correctly, but do we not want to run an epoch with e.g. samples_per_mini_epoch: 4096 while always repeating the same e.g. 4 samples? In this case we would need to introduce another config parameter, e.g. repeat_num_samples: 4 or repeat_num_idxs: [1,2,3,4] (to define specific indices, not sure if would need this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is addressed in line 278, but may not cover some edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the underlying question is whether we need to adjust start_date and end_date in the config to shorten the dataset or if we can also repeat samples randomly from the entire dataset. We can discuss in 5 minutes :)

@moritzhauschulz moritzhauschulz changed the title [DRAFT] Enable single sample processing Enable single sample processing Dec 3, 2025
@moritzhauschulz
Copy link
Contributor Author

moritzhauschulz commented Dec 3, 2025

Somehow getting a ModuleNotFoundError which I did not get before merging. Could someone else try this please. Otherwise ready to review.

@moritzhauschulz moritzhauschulz marked this pull request as ready for review December 3, 2025 19:14
Copy link
Contributor

@MatKbauer MatKbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have added some comments to retain (fsm + 1). Overall, I'd suggest we use this branch for our single sample experiments.

end_date_val: 202201010000
end_date: 201401011200
start_date_val: 201401010000
end_date_val: 201401011200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the end_dates for train and val to 201401011800, such that we can keep (fsm + 1) below.

forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs
forecast_len = (
self.len_hrs * (fsm)
) // self.step_hrs # TODO: check if it should be fsm + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adjusted end_dates in the config, we can revert this to forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs, i.e., using (fsm + 1) instead of fsm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this comment

@moritzhauschulz
Copy link
Contributor Author

Added some comments and removed one edge case handling that is not necessary anymore due to extended data range.

forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs
print(f"idx range end is {idx_end}")
print(f"len hrs is {self.len_hrs}, step hrs is {self.step_hrs}, fsm is {fsm}")
forecast_len = (self.len_hrs * (fsm + 1)) // self.step_hrs # NOTE: why is it fsm +1?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsm := the maximum number of forecast steps to predict. +1 for the potential forecast_offset of 0 or 1. This guarantees that also the targets are always within the range defined by start_date and end_date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

@grassesi
Copy link
Contributor

grassesi commented Dec 9, 2025

@moritzhauschulz @MatKbauer If this PR is ready, lets try and get it merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants