Skip to content

Merging datasets is tricky for training+inline inference workflow #698

@brianhenn

Description

@brianhenn

Merging datasets for training and/or inference works according to a set of rules:

  • overlapping keys are resolved by order of datasets listed
  • time index must align, ie all sample start times and lengths in time must be identical

Additionally, inference datasets can't subset in time, I think because we need a contiguous time coordinate in inference.

The above block the following in the training+inline inference workflow:

  • specifying keys form particular datasets in merge, eg both datasets 1 and 2 have keys a and b; we want a from 1 and b from 2
  • merging datasets that don't have identical time indices, eg, can't subset 2 datasets to a common overlapping time period because inference doesn't allow it

It'd be nice to relax these restrictions; for example, they made it harder to use the original ACE ERA5 + secondary pressure-level ERA5 datasets together in training/fine-tuning.

The first could eg be achieved with an optional list of priority keys attached to a particular dataset config that would cause those keys to be grabbed from that dataset regardless of order.

The second could be achieved by allowing limited inference dataset subsetting, i.e., specifying start/stop (but not step) in inference datasets. Could also check that the resulting dataset is contiguous in time (and maybe even allow dataset concat that meets this too). This task is easier now that https://github.com/ai2cm/full-model/pull/2325 is merged and there is a better defined base class for training+inference datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions