-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update phylo workflow #3
Conversation
* `output.insertions` will be a TSV file now * `--reference` is now spelled `--input-ref` * `--genemap` is now spelled `--input-annotation` * `--retry-reverse-complement` is no longer supported * `--output-insertions` is now spelled `--output-tsv` Note: dropping `--retry-reverse-complement` is the one that I am most unsure about, but this version completes this step.
Initially, the workflow failed with the following error: ``` Error: 0: When reading genome annotation 1: When reading file: "config/hku1/genemap.gff" 2: Attempted to parse the genome annotation as JSON and as GFF, but both attempts failed: JSON error: invalid type: string "NC_006577.2\tfeature\tsource\t1\t29926\t.\t+\t.\tgene=nuc NC_006577.2\tfeature\tgene\t206\t13600 \t.\t+\t.\tgene=ORF1a NC_006577.2\tfeature\tgene\t13600\t21753\t.\t+\t.\tgene=ORF1b NC_006577.2\tfeature\tgene\t21773\t22933\t.\t+\t.\tg ene=HE NC_006577.2\tfeature\tgene\t22942\t27012\t.\t+\t.\tgene=Spike NC_006577.2\tfeature\tgene\t22978\t25221\t.\t+\t.\tgene=S1 NC_00657 7.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=S2 NC_006577.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=ORF4 NC_006577.2\tfeature\tge ne\t27373\t27621\t.\t+\t.\tgene=E NC_006577.2\tfeature\tgene\t27633\t28304\t.\t+\t.\tgene=M NC_006577.2\tfeature\tgene\t28320\t29645\t.\ t+\t.\tgene=N NC_006577.2\tfeature\tgene\t28342\t28959\t.\t+\t.\tgene=N2", expected struct GeneMap at line 2 column 1 GFF3 error: When processing gene, 'N': When processing feature group 'N' ('N') of type 'gene': genes must consist of exactly one f eature: Expected exactly one element, but found: 2 2: Location: /workdir/packages/nextclade/src/gene/gene_map.rs:56 ``` While looking at the referenced file, and comparing it to the other `genemap.gff` files in the config, I noticed that all the others used `gene_name` for everything after the first `gene` line. I changed this file to match, and the workflow got past the point where it was previously erroring out. I have no idea why this worked; hopefully somebody will explain in the code review.
Currently some (but I think not all) of the
I'm guessing that it is probably not a coincidence that this is the GFF file that I modified... |
Ope, perhaps I guessed wrong — reverting the |
These changes plus the forced BioPython version upgrade discussed above were sufficient to allow the workflow to complete, so I'm going to take this out of draft. I think the dependency management question is orthogonal enough to the original ticket that these changes (once reviewed and I'm sure modified) could get merged. |
@genehack I will agree with your comment in 4eeffd8 that the format of the gene map file is a little confusing. In the official Nextclade datasets, I see older style formatting with The official Nextclade gene map docs suggest we should use What you've done here looks reasonable, since most of our older workflows use |
I made a detailed writeup. This might go to the docs somewhere, though it is only useful for upgrading older datasets by the team internally, and is not super useful for most users. I think, for most users, saying "follow GFF3 spec" is more concise and less confusing. Here is the full story: The GFF3 specThe official GFF3 spec is here: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md The spec is complex, has many redundancies and degrees of freedom. A particular annotation can be achieved in many different ways. In that sense, it's less of a "format", as it does not specify the exact form or shape, but more of a "guide", providing a set of best practices, recommendations and constraints for a popular way of expressing annotations in the bioinformatics community, and aiming to unify the conflicting ad-hoc formats and forks which appeared over the years. The TSV-like nature of the format should not mislead you - while some columns do define some basics, the most of the actually useful information is stuffed into the "attributes" column (column 9), which is a container for key-value pairs, some constrained, some arbitrary. Crazy things can (and do) happen there in the wild. And column 9 is the source of most breakages and of the complexity in the parsers. In Nextclade v3, the closer you follow the GFF3 spec the higher the chances that things will work correctly. Please report any deviations as bugs by opening issues in Nextclade repo. Historical contextIn older versions of Nextclade we used non-spec-compliant NowIn Nextclade v3 we are trying to be more (or, hopefully, mostly) spec-compliant and we use For backwards compatibility, reducing breakage when transitioning to v3, and in order to support various non-compliant GFF3 files in the wild (people are not always disciplined and upload to databases files that are not compliant) I allowed myself to keep reading I believe that as a part of this PR, changing Another important backwards-compat and non-compliant behavior Nextclade has:
This is important in practice, because there are many pathogens where CDS-to-gene relationship is 1:1 and there is only genes or only CDS in GFF3 and people don't bother duplicating the info (one line for every gene and one line for every CDS, which are the same as genes but with a different feature type), just to produce a compliant annotation. Debugging
|
I believe it's a bug in the docs. The key should be |
@genehack After seeing the error message in I assembled a dataset from the old ref and annotation files from here https://github.com/nextstrain/seasonal-cov/tree/8548b683c3afb953ebc067087d8fe80e553e7c96/phylogenetic/config/hku1 (previous commit if you click the "History" button in the GFF3 file on GitHub). Tried to visualize the annotation: nextclade read-annotation cov-hku-1/genome_annotation.gff3 Tried to run with individual inputs (input fasta is just the reference fasta, because I cannot find any other example sequences): nextclade run --input-ref cov-hku-1/reference.fasta --input-annotation cov-hku-1/genome_annotation.gff3 -O tmp/cov-hku-1 cov-hku-1/reference.fast and tried to run as a dataset (it is possible because I added a minimal pathogen.json file there): nextclade run --input-dataset cov-hku-1/ -O tmp/cov-hku-1 cov-hku-1/reference.fasta Everything seems to be working smoothly, so I cannot reproduce the error. Could there be a problem with Nextclade CLI version? Perhaps older version had some bugs. I am not sure how Nextclade CLI is installed in this repo and what the version is. I used Nextclade CLI 3.5.0. What the error in your commit message means is Nextclade perceives the entry for feature "N" (gene/CDS "N") as duplicated and it does not know which one to pick. Though in the GFF3 file I don't see this happening - there seems to be only one entry for "N". So this is very puzzling. Full working-as-intended repro in docker, using the cov-hku-1.zip.txt file from above (click to expand)docker run -it --rm nextstrain/nextclade:3.5.0 bash -c "set -euxo pipefail \
&& apt-get update -yqq && apt-get -yqq install curl unzip \
&& curl -fsSL -o cov-hku-1.zip https://github.com/nextstrain/seasonal-cov/files/15342689/cov-hku-1.zip.txt \
&& mkdir -p cov-hku-1; cd cov-hku-1; unzip ../cov-hku-1.zip; cd .. \
&& nextclade read-annotation cov-hku-1/genome_annotation.gff3 \
&& nextclade run --verbose --input-ref cov-hku-1/reference.fasta --input-annotation cov-hku-1/genome_annotation.gff3 -O tmp/cov-hku-1/individual cov-hku-1/reference.fasta \
&& ls -al tmp/cov-hku-1/individual/ \
&& cat tmp/cov-hku-1/individual/nextclade.tsv \
&& cat tmp/cov-hku-1/individual/nextclade.cds_translation.S1.fasta \
&& nextclade run --verbose --input-dataset cov-hku-1/ -O tmp/cov-hku-1/dataset cov-hku-1/reference.fasta \
&& ls -al tmp/cov-hku-1/dataset/ \
&& cat tmp/cov-hku-1/dataset/nextclade.tsv \
&& cat tmp/cov-hku-1/dataset/nextclade.cds_translation.S1.fasta \
" |
@ivan-aksamentov thanks for looking into this. I checked out that same SHA (8548b683c) and ran the My earlier errors happened with an ambient runtime, which I ended up abandoning due to some weird dependency issues (probably triggered by installing some additional tooling on top of the initial Nextstrain build). With the Nextstrain Conda runtime I'm using now, I was able to run both workflows without issue. |
Description of proposed changes
Update the
phylogenetic
workflow to work with Nextclade v3.Note that this is still a work in progress and the workflow still errors out; pushing my initial changes for an initial review and to get advice on next steps.
Related issue(s)
#2