Description
Problem
DataObject
encompasses all kinds of records from different parts of the data model (raw data, processed data, workflow parameters). data_object_type
and data_object_category
are narrow and broad ways of identifying the nature/purpose of a given record.
Neither of those slots are required, which means that some records (especially from older projects) may have NA for this slot. This means that any queries searching for a type of data (e.g. "give me all of the raw proteomics data") have the potential to miss records.
This most importantly has the potential to affect the bulk download on the data portal. In other contexts it just makes queries/filters more complicated than necessary (see NOM notebooks).
Actions
- Identify records that will need these slots backfilled
- Backfill records with a migrator or changesheets depending on how many records it is (?)
- Make
data_object_type
required in the schema - Make
data_category
required in the schema
For reference
https://nmdc-group.slack.com/archives/CFVH4DYGH/p1739989643444869
data_object_type
is a slot on DataObject
with a range of FileTypeEnum
data_category
is also on DataObject
and has a range of DataCategoryEnum