Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a single canonical validator, or multiple implementations? #18

Open
cmungall opened this issue Feb 23, 2019 · 15 comments
Open

Comments

@cmungall
Copy link

cmungall commented Feb 23, 2019

The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl

This is 7 year old perl code

The SO wiki has:
http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools

which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code:
https://github.com/genometools/genometools
which is in C

Reciprocal ticket: genometools/genometools#910

There is a question here:
https://www.biostars.org/p/177319/
indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools

Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?

I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.

Understanding how validators use relationships is important for maintenance of SO:
The-Sequence-Ontology/SO-Ontologies#465

There could be a validator registry separate from the spec, and defined conformance tests for the validators

@cmungall
Copy link
Author

@barrymoore
Copy link
Contributor

@cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.

@cmungall
Copy link
Author

cmungall commented Mar 22, 2019 via email

@barrymoore
Copy link
Contributor

barrymoore commented Mar 22, 2019 via email

@murphyte
Copy link

I haven't found any of the GFF3 validators I know of to be definitive or complete. I haven't kept track of which validators have which problems, but I've observed:

  1. missing SO terms
  2. imposing requirements that do commonly show up in code consuming GFF3, but aren't actually part of the spec. In particular, single features spanning multiple rows with the same ID, other than CDS, are allowed by the spec but some validators report as an error.
  3. Not testing as much as we'd like. Part of this is from an overly flexible spec which makes it hard to define what is 'valid'
  4. unacceptable performance (the old modENCODE validator in particular couldn't handle anything larger than a bacteria genome's worth of annotation)

For using GFF3 for annotation submission, we primarily rely on converting it to ASN.1 as we see fit (including allowing deviations from the spec, like CDS rows with no or different IDs) and running our standard ASN.1 validation code on the result, which is much more extensive than possible with GFF3 alone (in part because it can analyze a feature vs. its sequence, which no GFF3 validator I know of can do). We do point users to a couple of validators to do a preliminary check:
https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

For GFF3 that we output, I've occasionally done some bulk analyses with different validators, but we don't routinely run any of the validators on everything that we produce.

Defining a set of tests expected for any validator beyond the most rudimentary would likely re-expose some of the issues with the spec that were never resolved (trans-splicing, anyone?). We'd need to define a better approach for those issues before we could make much progress.

@cmungall
Copy link
Author

Thanks Terrence, this is really useful. Looks like some more work in this area could be beneficial, in refining the spec and creating a fully complete scalable reference validator. This could perhaps be abstracted above format specifics of GFF3 and include the NCBI ASN.1 representation as well as FALDO GFF (cc @JervenBolleman). But not sure who has resources to work on this!

In the interim it leaves SO in a bit of an odd state, hard to evolve it without knowing if changes will result in false positive or false negative changes in any given validator (some kind of containerized workflow / integration test with a large bank of sample GFF3s would be super-useful here)

@cjfields
Copy link

cjfields commented Apr 15, 2019

@cmungall there was a GFF3 working group started at one point (apart from me @barrymoore and @murphyte were also on this I believe?), but it has been several years. Maybe this or something similar is needed again?

I do think it would be very useful to have a repo with example common cases from the spec as well as more problematic 'edge' cases, then build out tests from that as Terrance mentioned. This should help point to problem areas in the current specification, maybe hone in on an 'official' validation tool, and lead to improvements. Similar approaches seem to have worked with other formats with specifications, e.g SAM/BAM, VCF, CWL, etc.

@nathandunn
Copy link

@cmungall Apollo does some basic "validation" when it tries to upload GFF3 by trying to import the structure into a reasonable internal model (which is SO compliant) versus one that is using the SO
explicitly for adherence. I think that is what most validation (e.g., Tripal) does as well, though I'm not familar with how Chado enforces structure, but I'm going to guess it doesn't do validation either. That being said, I like the idea, but it definitely works better for an RDF structure.

Other validator / parsers:

@nathandunn
Copy link

To clarify this what I was saying, currently groups run exported GFF3 through a several validation and merging scripts.

I've been asked to write validators within Apollo, but those validate against the model (which we do a lot of already), which in turn should validate the GFF3 on export (we have a test that re-imports to validate round-trip, as well).

The issue is that a lot of validation steps are very group-dependent (checking status fields, export tags, etc.), though there are some general ones we could add.

Reference to the NCBI valid GFF3 is here: GMOD/Apollo#565

@murphyte
Copy link

Some of the issues that we've seen are:

  1. CDS features incompatible with exon features of the same parent mRNA. A validation check would need to allow for ribosomal slippage
  2. child features outside the range of their parent. This isn't invalid per se by the GFF3 specs, and there are definitely examples within INSDC annotations where it doesn't hold true. But I think it should hold true for certain types of features (e.g. a CDS should fall within the range of its parent mRNA, and same for mRNA within its parent gene).
  3. IDs with discontinuous features. Features spanning multiple rows with the same ID (like how multi-exon CDSes are shown in the current specs) are technically allowed for any feature type. But some code (e..g Cufflinks) expect IDs to be unique. IIRC, One of the existing validators checks for uniqueness, and I think another one does allow the same ID on multiple rows, but I believe (a) they must have the same feature type (which I think should be part of the spec), and (b) I think it checks that they're on the same seqid. The latter also isn't formally part of the specs, although I suspect it's expected by most code. There are some old notes at:
    http://www.sequenceontology.org/so_wiki/index.php/Discontinuous_features
  4. the GFF3 spec doesn't explicitly say how to specify a NULL value. The only sensible way within the spec is using <attribute>=;. But I wouldn't be surprised if some of the existing validators object to that.
  5. files with ;;. This is more of a test for reader implementations to be sure they tolerate it, although I could see having it as a warning in a validator.
  6. validating usage of commas in attributes. We've seen cases where commas in attribute values aren't properly encoded, raising the question of whether they're delimiting multiple values or just an encoding error. A validator could report the set of attributes where commas are observed in the values as a sanity check.
  7. Dbxrefs values. These can be validated to the expected DBTAG:ID format (which allows some checking for unexpected usage of commas). I don't know if all the validators check for that.
  8. Are SO terms in column 3 case sensitive? This is another area where the GFF3 spec and SO are ambiguous.
  9. CDS IDs. As discussed elsewhere, there are two styles of CDS IDs in GFF3: a) those like in the spec, where multiple CDS rows have the same ID, and b) those where multiple CDS rows have distinct IDs (somewhat like exon). We've resorted to taking an approach where CDS rows with the same mRNA parent are all considered to be part of the same CDS in order to handle (b), which does prevent truly annotating multiple CDS features on the same mRNA, but that's incredibly rare and not compatible with lots of code so it's best to discourage anyway. This could be reported as a warning.

Bottom line: the flexibility of the GFF3 format means there aren't many absolute validation rules. But there are a set of best practices and other issues that can be reported as warnings.

@sierra-moxon
Copy link

sierra-moxon commented Apr 1, 2021

As part of the AgBioData consortium, we did a bit of a survey of GFF parsing and validation available. Taking comments on this ticket into account, this tool https://github.com/NAL-i5K/gff3toolkit does create warnings, etc on this flexible format's best practices/specification.

https://gff3-py.readthedocs.io/en/latest/readme.html#features

  • Language: Python, on pypi
  • Supported: Works with Python 2.7 (with that version of python, it was pretty easy to use, might need a bit of development to get it up to Python 3)

http://genometools.org/cgi-bin/gff3validator.cgi

  • Language: C
  • Supported: yes, package managed through an FTP site, with github repo behind it.

https://github.com/NAL-i5K/gff3toolkit - AgBioData parser/validator.

  • Language: Python, on pypi
  • Supported: yes, easy to use and install, python3
  • Best overall

modENCODE validator

  • Language: Perl
  • Supported: yes? Sort of? Distributed with the GFF3 specification

https://sourceforge.net/p/gmod/svn/HEAD/tree/gff-validator/trunk/validate_gff3.pl

  • Language: Perl
  • Supported? No? Is suggested by GMOD, WB, SO but the website doesn’t allow download. Code is in sourceforge

http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread

  • Language: C++
  • Supported: yes. Primary focus seems to be on converting between file types, doesn’t spit out errors in gff

BioFSharp

  • Language: F#
  • Supported: long documentation

https://github.com/daler/gffutils

  • Language: Python
  • Supported: currently suggested by BioPython, last updated Dec 2019. Mostly for parsing? Would have to write the validation on top.

https://pypi.org/project/bcbio-gff/

  • Language: Python
  • Supported: most of the code has been folded into gffutils according to the doc. Mostly used for parsing GFF3, not validating.

@adf-ncgr
Copy link

One more for possible consideration is this:
https://github.com/genometools/genometools/wiki/speck-User-manual

it's different than the genometools gff3-validator @sierra-moxon has in the list above, and kind of interesting in that it allows extensibility via a DSL. I've only explored it lightly, but their examples seem to work as advertised. Might be a nice approach if different "dialects" need to be supported.

@dtdoering
Copy link

I know this thread has trailed off, but I wanted to add to the discussion because I think there's still a need for some sort of "official" GFF validator, test suite, etc. -- at least something where if I'm developing a new bioinformatic tool that can output as GFF, I can have something to check my tool's outputs against, especially if they are structural annotations/gene models as opposed to some arbitrary feature/functional annotation. Hopefully some of these links help push the discussion forward!

Some collections of "real-world" cases, useful for building a testing suite

https://github.com/BioJulia/BioFmtSpecimens - Collection of real-world bioinformatics file format specimens to test against

  • includes GFF3

https://github.com/cmdcolin/oddgenes - Collection of "odd genes" and edge cases

  • maintained by one of the main developers of JBrowse2

Some additional GFF parsers / validators that haven't been mentioned yet

https://gfacs.readthedocs.io/en/latest/index.html - gFACs: Gene Filtering, Analysis, and Conversion

  • Language: Perl
  • Latest release: 2020-07-17 (version 1.1.2)
  • Aims to unify annotations across a number of annotation tools, with a "format script" for each one (e.g. Braker, Maker, Prokka, Gmap, GenomeThreader, Stringtie, Gffread, Exonerate, Evidence modeler, CoGe, and NCBI)

https://agat.readthedocs.io/en/latest/agat_how_does_it_work.html - AGAT: Another GTF/GFF Analysis Toolkit

  • Language: Perl
  • Latest release: 2024-04-05 (version 1.4.0)
  • An extensive suite of tools for parsing, validating, and fixing GTF/GFF data

https://easy-import.readme.io/docs/repairing-gff - easy-import

  • Repo at https://github.com/genomehubs/easy-import
  • Language: Perl
  • Latest commit: 2019-10-16 (easy-import repo), 2024-03-14 (genomehubs repo, which integrates easy-import)
  • Tools to fix many common problems with GFFs, used as a submodule in Genomehubs

https://github.com/TAMU-CPT/CPT_GffParser - a BioPython-compatible library for parsing and fixing GFF data

  • Language: Python (>= 3.6)
  • Latest release: 2022-03-08 (version 1.2)

Some format conversion-specific tools

https://bioconvert.readthedocs.io/en/main/# - 'BioConvert': a collaborative project to facilitate the interconversion of life science data formats

  • Repo at https://github.com/bioconvert/bioconvert
  • Language: Python (3.7)
  • Latest release: 2023-07-18 (version 1.1.1)
  • Doesn't seem to do very extensive parsing, rather focuses on collecting methods to interconvert formats

https://github.com/jorvis/biocode - 'biocode': a collection of bioinformatics code libraries and scripts (see gff subdirectory)

  • Language: Python (3), Perl
  • Latest commit: 2024-01-03
  • Latest release: 2023-11-16 (v0.11.0)
  • Similar in scope/objective to BioConvert

@cmungall
Copy link
Author

cmungall commented May 9, 2024

Great to see there is still a lot of interest in this

I am planning to create a LinkML schema for GFF3. This would have a lot of advantages:

  • clean specification of the core columns plus col9 fields
  • declarative specification of ontology enumerations
  • ability to weave in logic and ontology (e.g. codons must be of length 3)
  • ability to validate at scale using duckdb (coming soon)
  • ability to specify profiles, e.g. euk vs prok

This could serve as a reference against which different validators could indicate conformance, and also directly as a validator

@dtdoering
Copy link

dtdoering commented Jun 10, 2024

That sounds nice! Glad to see that it is YAML-based and not something complex or particularly domain-specific.

However, can you clarify the concept of "profiles"? To my ear, it sounds like it would involve different rule-sets for GFFs from different organisms. If that's the case, (IMO) that seems like something that would be a minor step in the wrong direction (though that discussion may be for another thread).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants