-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Merging datasets for training and/or inference works according to a set of rules:
- overlapping keys are resolved by order of datasets listed
- time index must align, ie all sample start times and lengths in time must be identical
Additionally, inference datasets can't subset in time, I think because we need a contiguous time coordinate in inference.
The above block the following in the training+inline inference workflow:
- specifying keys form particular datasets in merge, eg both datasets 1 and 2 have keys
aandb; we wantafrom 1 andbfrom 2 - merging datasets that don't have identical time indices, eg, can't subset 2 datasets to a common overlapping time period because inference doesn't allow it
It'd be nice to relax these restrictions; for example, they made it harder to use the original ACE ERA5 + secondary pressure-level ERA5 datasets together in training/fine-tuning.
The first could eg be achieved with an optional list of priority keys attached to a particular dataset config that would cause those keys to be grabbed from that dataset regardless of order.
The second could be achieved by allowing limited inference dataset subsetting, i.e., specifying start/stop (but not step) in inference datasets. Could also check that the resulting dataset is contiguous in time (and maybe even allow dataset concat that meets this too). This task is easier now that https://github.com/ai2cm/full-model/pull/2325 is merged and there is a better defined base class for training+inference datasets.