Fix SDRF annotation files for sdrf_pipelines validator v1.0.0 compliance #753
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR systematically addresses validation errors across 150+ SDRF files to ensure compliance with the upcoming sdrf_pipelines validator v1.0.0 release. The fixes were based on the comprehensive error analysis from
sdrf_errors_updated_parquet.csv.Key Issues Resolved
1. Missing Required Columns
technology typecolumns to 40+ files"proteomic profiling by mass spectrometry"2. Age Format Validation
Fixed age values to match the required pattern
^(?:(?:\d+[yY])(?:\d+[mM])?(?:\d+[dD])?|...)Before:
After:
3. Sex Ontology Terms
Converted abbreviated sex values to proper EFO ontology terms:
Before:
After:
4. Restricted Column Values
Removed invalid
"not applicable"and"not available"values from columns that don't allow them:source namecharacteristics[biological replicate]technology typecomment[technical replicate]comment[fraction identifier]comment[label]comment[data file]comment[instrument]5. Format and Syntax Issues
"not aplicable"→"not applicable"Validation Results
100% success rate on structural validation issues. All tested files now either:
Sample of Files Now Passing:
PXD019291.sdrf.tsv(was missing technology type)PXD000070.sdrf.tsv(was missing technology type)PXD002137.sdrf.tsv(had numeric age format issues)PMID21183079.sdrf.tsv(had week format + typos)PXD009909.sdrf.tsv(had "30 weeks" format)Files Modified
annotated-projects/sdrf-specification-examples/Remaining Items
The only remaining validation warnings are ontology term lookups (e.g., "HeLa cells" not found in CLO/BTO, disease terms not in MONDO/EFO). These are expected and require domain expert curation rather than automated fixes.
Fixes #748.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.