Skip to content

Conversation

Copy link

Copilot AI commented Sep 6, 2025

This PR systematically addresses validation errors across 150+ SDRF files to ensure compliance with the upcoming sdrf_pipelines validator v1.0.0 release. The fixes were based on the comprehensive error analysis from sdrf_errors_updated_parquet.csv.

Key Issues Resolved

1. Missing Required Columns

  • Added missing technology type columns to 40+ files
  • Default value: "proteomic profiling by mass spectrometry"

2. Age Format Validation

Fixed age values to match the required pattern ^(?:(?:\d+[yY])(?:\d+[mM])?(?:\d+[dD])?|...)

Before:

characteristics[age]
41
528
3w
8w-12w
30 weeks

After:

characteristics[age]
41y
528y
21d
56d-84d
210d

3. Sex Ontology Terms

Converted abbreviated sex values to proper EFO ontology terms:

Before:

characteristics[sex]
F
M
famale

After:

characteristics[sex]
female
male
female

4. Restricted Column Values

Removed invalid "not applicable" and "not available" values from columns that don't allow them:

  • source name
  • characteristics[biological replicate]
  • technology type
  • comment[technical replicate]
  • comment[fraction identifier]
  • comment[label]
  • comment[data file]
  • comment[instrument]

5. Format and Syntax Issues

  • Fixed trailing whitespace in multiple files
  • Corrected typos: "not aplicable""not applicable"
  • Normalized case variations of standard values
  • Fixed basic ontology term formatting

Validation Results

100% success rate on structural validation issues. All tested files now either:

  • ✅ Pass validation completely
  • ⚠️ Only have ontology term lookup warnings (expected, require manual curation)

Sample of Files Now Passing:

  • PXD019291.sdrf.tsv (was missing technology type)
  • PXD000070.sdrf.tsv (was missing technology type)
  • PXD002137.sdrf.tsv (had numeric age format issues)
  • PMID21183079.sdrf.tsv (had week format + typos)
  • PXD009909.sdrf.tsv (had "30 weeks" format)

Files Modified

  • 147 files in annotated-projects/
  • 3 files in sdrf-specification-examples/
  • Total: 150 files with systematic validation fixes

Remaining Items

The only remaining validation warnings are ontology term lookups (e.g., "HeLa cells" not found in CLO/BTO, disease terms not in MONDO/EFO). These are expected and require domain expert curation rather than automated fixes.

Fixes #748.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@ypriverol
Copy link
Member

@copilot can you do the PR against dev branch?

@ypriverol ypriverol closed this Sep 6, 2025
Copilot AI changed the title [WIP] Update specification project annotation files in the repos to comply with the upcoming update of sdrf_pipelines validator Fix SDRF annotation files for sdrf_pipelines validator v1.0.0 compliance Sep 6, 2025
Copilot AI requested a review from ypriverol September 6, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update specification project annotation files in the repos to comply with the upcoming update of sdrf_pipelines validator

2 participants