-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a single canonical validator, or multiple implementations? #18
Comments
@barrymoore has a validator in https://github.com/The-Sequence-Ontology/GAL/ - is this used in production? Code for traversing SO graph: |
@cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it. |
Thanks! Do you have a contact?
…On Fri, Mar 22, 2019 at 4:37 AM Barry Moore ***@***.***> wrote:
@cmungall <https://github.com/cmungall> at one point the RefSeq group was
using the GAL based GFF3 validator for their production GFF3 validation.
I'm not sure if there are others using it or if RefSeq is still using it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd>
.
|
Hi Chris,
Terrence Murphy murphyte@ncbi.nlm.nih.gov<mailto:murphyte@ncbi.nlm.nih.gov> was the person at NCBI that I interacted with, but it’s probably been 5+ years ago. He provided several suggestions for updates that were incorporated into the validator. He’s still at NCBI as far as I know, but not sure if he’s still involved with RefSeq GFF3 validation.
Barry
On Mar 22, 2019, at 6:04 PM, Chris Mungall <notifications@github.com<mailto:notifications@github.com>> wrote:
Thanks! Do you have a contact?
On Fri, Mar 22, 2019 at 4:37 AM Barry Moore ***@***.******@***.***>> wrote:
@cmungall <https://github.com/cmungall> at one point the RefSeq group was
using the GAL based GFF3 validator for their production GFF3 validation.
I'm not sure if there are others using it or if RefSeq is still using it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACI4235KP1mZ2fP46fLrcMAQCWvS8EW5ks5vZQ0MgaJpZM4bOKxd>.
|
I haven't found any of the GFF3 validators I know of to be definitive or complete. I haven't kept track of which validators have which problems, but I've observed:
For using GFF3 for annotation submission, we primarily rely on converting it to ASN.1 as we see fit (including allowing deviations from the spec, like CDS rows with no or different IDs) and running our standard ASN.1 validation code on the result, which is much more extensive than possible with GFF3 alone (in part because it can analyze a feature vs. its sequence, which no GFF3 validator I know of can do). We do point users to a couple of validators to do a preliminary check: For GFF3 that we output, I've occasionally done some bulk analyses with different validators, but we don't routinely run any of the validators on everything that we produce. Defining a set of tests expected for any validator beyond the most rudimentary would likely re-expose some of the issues with the spec that were never resolved (trans-splicing, anyone?). We'd need to define a better approach for those issues before we could make much progress. |
Thanks Terrence, this is really useful. Looks like some more work in this area could be beneficial, in refining the spec and creating a fully complete scalable reference validator. This could perhaps be abstracted above format specifics of GFF3 and include the NCBI ASN.1 representation as well as FALDO GFF (cc @JervenBolleman). But not sure who has resources to work on this! In the interim it leaves SO in a bit of an odd state, hard to evolve it without knowing if changes will result in false positive or false negative changes in any given validator (some kind of containerized workflow / integration test with a large bank of sample GFF3s would be super-useful here) |
@cmungall there was a GFF3 working group started at one point (apart from me @barrymoore and @murphyte were also on this I believe?), but it has been several years. Maybe this or something similar is needed again? I do think it would be very useful to have a repo with example common cases from the spec as well as more problematic 'edge' cases, then build out tests from that as Terrance mentioned. This should help point to problem areas in the current specification, maybe hone in on an 'official' validation tool, and lead to improvements. Similar approaches seem to have worked with other formats with specifications, e.g SAM/BAM, VCF, CWL, etc. |
@cmungall Apollo does some basic "validation" when it tries to upload GFF3 by trying to import the structure into a reasonable internal model (which is SO compliant) versus one that is using the SO Other validator / parsers:
|
To clarify this what I was saying, currently groups run exported GFF3 through a several validation and merging scripts. I've been asked to write validators within Apollo, but those validate against the model (which we do a lot of already), which in turn should validate the GFF3 on export (we have a test that re-imports to validate round-trip, as well). The issue is that a lot of validation steps are very group-dependent (checking status fields, export tags, etc.), though there are some general ones we could add. Reference to the NCBI valid GFF3 is here: GMOD/Apollo#565 |
Some of the issues that we've seen are:
Bottom line: the flexibility of the GFF3 format means there aren't many absolute validation rules. But there are a set of best practices and other issues that can be reported as warnings. |
As part of the AgBioData consortium, we did a bit of a survey of GFF parsing and validation available. Taking comments on this ticket into account, this tool https://github.com/NAL-i5K/gff3toolkit does create warnings, etc on this flexible format's best practices/specification. https://gff3-py.readthedocs.io/en/latest/readme.html#features
http://genometools.org/cgi-bin/gff3validator.cgi
https://github.com/NAL-i5K/gff3toolkit - AgBioData parser/validator.
modENCODE validator
https://sourceforge.net/p/gmod/svn/HEAD/tree/gff-validator/trunk/validate_gff3.pl
http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
BioFSharp
https://github.com/daler/gffutils
https://pypi.org/project/bcbio-gff/
|
One more for possible consideration is this: it's different than the genometools gff3-validator @sierra-moxon has in the list above, and kind of interesting in that it allows extensibility via a DSL. I've only explored it lightly, but their examples seem to work as advertised. Might be a nice approach if different "dialects" need to be supported. |
I know this thread has trailed off, but I wanted to add to the discussion because I think there's still a need for some sort of "official" GFF validator, test suite, etc. -- at least something where if I'm developing a new bioinformatic tool that can output as GFF, I can have something to check my tool's outputs against, especially if they are structural annotations/gene models as opposed to some arbitrary feature/functional annotation. Hopefully some of these links help push the discussion forward! Some collections of "real-world" cases, useful for building a testing suitehttps://github.com/BioJulia/BioFmtSpecimens - Collection of real-world bioinformatics file format specimens to test against
https://github.com/cmdcolin/oddgenes - Collection of "odd genes" and edge cases
Some additional GFF parsers / validators that haven't been mentioned yethttps://gfacs.readthedocs.io/en/latest/index.html - gFACs: Gene Filtering, Analysis, and Conversion
https://agat.readthedocs.io/en/latest/agat_how_does_it_work.html - AGAT: Another GTF/GFF Analysis Toolkit
https://easy-import.readme.io/docs/repairing-gff - easy-import
https://github.com/TAMU-CPT/CPT_GffParser - a BioPython-compatible library for parsing and fixing GFF data
Some format conversion-specific toolshttps://bioconvert.readthedocs.io/en/main/# - 'BioConvert': a collaborative project to facilitate the interconversion of life science data formats
https://github.com/jorvis/biocode - 'biocode': a collection of bioinformatics code libraries and scripts (see
|
Great to see there is still a lot of interest in this I am planning to create a LinkML schema for GFF3. This would have a lot of advantages:
This could serve as a reference against which different validators could indicate conformance, and also directly as a validator |
That sounds nice! Glad to see that it is YAML-based and not something complex or particularly domain-specific. However, can you clarify the concept of "profiles"? To my ear, it sounds like it would involve different rule-sets for GFFs from different organisms. If that's the case, (IMO) that seems like something that would be a minor step in the wrong direction (though that discussion may be for another thread). |
The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl
This is 7 year old perl code
The SO wiki has:
http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools
which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code:
https://github.com/genometools/genometools
which is in C
Reciprocal ticket: genometools/genometools#910
There is a question here:
https://www.biostars.org/p/177319/
indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools
Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?
I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.
Understanding how validators use relationships is important for maintenance of SO:
The-Sequence-Ontology/SO-Ontologies#465
There could be a validator registry separate from the spec, and defined conformance tests for the validators
The text was updated successfully, but these errors were encountered: