Merging datasets is tricky for training+inline inference workflow

Merging datasets for training and/or inference works according to a set of rules:

- overlapping keys are [resolved](https://github.com/ai2cm/full-model/blob/822a0a56f5da346d46005a7e241da599dd14ede8/fme/core/dataset/merged.py#L223-L240) by order of datasets listed
- time index must align, ie all sample start times and lengths in time [must be identical](https://github.com/ai2cm/full-model/blob/822a0a56f5da346d46005a7e241da599dd14ede8/fme/core/dataset/merged.py#L30-L42)

Additionally, [inference datasets can't subset in time](https://github.com/ai2cm/full-model/blob/822a0a56f5da346d46005a7e241da599dd14ede8/fme/ace/data_loading/inference.py#L126-L133), I think because we need a contiguous time coordinate in inference. 

The above block the following in the training+inline inference workflow:

- specifying keys form particular datasets in merge, eg both datasets 1 and 2 have keys `a` and `b`; we want `a` from 1 and `b` from 2
- merging datasets that don't have identical time indices, eg, can't subset 2 datasets to a common overlapping time period because inference doesn't allow it

It'd be nice to relax these restrictions; for example, they made it harder to use the original ACE ERA5 + secondary pressure-level ERA5 datasets together in training/fine-tuning.

The first could eg be achieved with an optional list of priority keys attached to a particular dataset config that would cause those keys to be grabbed from that dataset regardless of order.

The second could be achieved by allowing limited inference dataset subsetting, i.e., specifying start/stop (but not step) in inference datasets. Could also check that the resulting dataset is contiguous in time (and maybe even allow dataset `concat` that meets this too). This task is easier now that https://github.com/ai2cm/full-model/pull/2325 is merged and there is a better defined base class for training+inference datasets. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging datasets is tricky for training+inline inference workflow #698

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Merging datasets is tricky for training+inline inference workflow #698

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions