Update phylo workflow #3

genehack · 2024-05-15T00:01:31Z

Description of proposed changes

Update the phylogenetic workflow to work with Nextclade v3.

Note that this is still a work in progress and the workflow still errors out; pushing my initial changes for an initial review and to get advice on next steps.

Related issue(s)

#2

* `output.insertions` will be a TSV file now * `--reference` is now spelled `--input-ref` * `--genemap` is now spelled `--input-annotation` * `--retry-reverse-complement` is no longer supported * `--output-insertions` is now spelled `--output-tsv` Note: dropping `--retry-reverse-complement` is the one that I am most unsure about, but this version completes this step.

Initially, the workflow failed with the following error: ``` Error: 0: When reading genome annotation 1: When reading file: "config/hku1/genemap.gff" 2: Attempted to parse the genome annotation as JSON and as GFF, but both attempts failed: JSON error: invalid type: string "NC_006577.2\tfeature\tsource\t1\t29926\t.\t+\t.\tgene=nuc NC_006577.2\tfeature\tgene\t206\t13600 \t.\t+\t.\tgene=ORF1a NC_006577.2\tfeature\tgene\t13600\t21753\t.\t+\t.\tgene=ORF1b NC_006577.2\tfeature\tgene\t21773\t22933\t.\t+\t.\tg ene=HE NC_006577.2\tfeature\tgene\t22942\t27012\t.\t+\t.\tgene=Spike NC_006577.2\tfeature\tgene\t22978\t25221\t.\t+\t.\tgene=S1 NC_00657 7.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=S2 NC_006577.2\tfeature\tgene\t27051\t27380\t.\t+\t.\tgene=ORF4 NC_006577.2\tfeature\tge ne\t27373\t27621\t.\t+\t.\tgene=E NC_006577.2\tfeature\tgene\t27633\t28304\t.\t+\t.\tgene=M NC_006577.2\tfeature\tgene\t28320\t29645\t.\ t+\t.\tgene=N NC_006577.2\tfeature\tgene\t28342\t28959\t.\t+\t.\tgene=N2", expected struct GeneMap at line 2 column 1 GFF3 error: When processing gene, 'N': When processing feature group 'N' ('N') of type 'gene': genes must consist of exactly one f eature: Expected exactly one element, but found: 2 2: Location: /workdir/packages/nextclade/src/gene/gene_map.rs:56 ``` While looking at the referenced file, and comparing it to the other `genemap.gff` files in the config, I noticed that all the others used `gene_name` for everything after the first `gene` line. I changed this file to match, and the workflow got past the point where it was previously erroring out. I have no idea why this worked; hopefully somebody will explain in the code review.

genehack · 2024-05-15T00:04:35Z

Currently some (but I think not all) of the translate jobs are failing with the following stacktrace:

        augur translate             --tree results/hku1/tree.nwk             --ancestral-sequences results/hku1/nt_mu16:56:57 [55/29803]
 --reference-sequence config/hku1/genemap.gff             --output results/hku1/aa_muts.json
Traceback (most recent call last):
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/augur/translate.py", line 418, in run
    features = load_features(args.reference_sequence, genes)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/augur/utils.py", line 197, in load_features
    return _read_gff(reference, feature_names)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/augur/utils.py", line 299, in _read_gff
    gff_entries = list(GFF.parse(in_handle, limit_info={'gff_type': valid_types}))
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 791, in parse
    for rec in parser.parse_in_parts(gff_files, base_dict, limit_info,
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 334, in parse_in_parts
    cur_dict = self._results_to_features(cur_dict, results)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 372, in _results_to_features
    (_, base) = self._add_toplevel_feature(base, feature)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 579, in _add_toplevel_feature
    new_feature = self._get_feature(feature_dict)
  File "/opt/miniconda3/envs/nextstrain-dev/lib/python3.8/site-packages/BCBio/GFF/GFFParser.py", line 589, in _get_feature
    new_feature = SeqFeature.SeqFeature(SeqFeature.SimpleLocation(start=rstart, end=rend, strand=feature_dict['strand']), feature_dict['
type'],
AttributeError: module 'Bio.SeqFeature' has no attribute 'SimpleLocation'


An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>

[Tue May 14 16:56:56 2024]
Error in rule translate:
    jobid: 28
    input: results/hku1/tree.nwk, results/hku1/nt_muts.json, config/hku1/genemap.gff
    output: results/hku1/aa_muts.json
    shell:

        augur translate             --tree results/hku1/tree.nwk             --ancestral-sequences results/hku1/nt_muts.json
 --reference-sequence config/hku1/genemap.gff             --output results/hku1/aa_muts.json
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I'm guessing that it is probably not a coincidence that this is the GFF file that I modified...

phylogenetic/config/hku1/genemap.gff

genehack · 2024-05-15T00:10:25Z

I'm guessing that it is probably not a coincidence that this is the GFF file that I modified...

Ope, perhaps I guessed wrong — reverting the s/gene=/gene_name=/g change I made to hku1/genemap.gff and manually running the augur command fails with the same error.

genehack · 2024-05-15T00:58:26Z

These changes plus the forced BioPython version upgrade discussed above were sufficient to allow the workflow to complete, so I'm going to take this out of draft. I think the dependency management question is orthogonal enough to the original ticket that these changes (once reviewed and I'm sure modified) could get merged.

huddlej · 2024-05-16T16:27:10Z

@genehack I will agree with your comment in 4eeffd8 that the format of the gene map file is a little confusing. In the official Nextclade datasets, I see older style formatting with gene features and gene_name ids and the newer style of mostly CDS features with gene_name ids and then a version with CDS features and gene ids.

The official Nextclade gene map docs suggest we should use CDS features with gene ids.

What you've done here looks reasonable, since most of our older workflows use gene features and gene_name ids, but maybe @ivan-aksamentov, @corneliusroemer, or @rneher have suggestions.

ivan-aksamentov · 2024-05-16T23:46:05Z

@genehack @huddlej

I made a detailed writeup. This might go to the docs somewhere, though it is only useful for upgrading older datasets by the team internally, and is not super useful for most users. I think, for most users, saying "follow GFF3 spec" is more concise and less confusing.

Here is the full story:

The GFF3 spec

The official GFF3 spec is here: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The spec is complex, has many redundancies and degrees of freedom. A particular annotation can be achieved in many different ways. In that sense, it's less of a "format", as it does not specify the exact form or shape, but more of a "guide", providing a set of best practices, recommendations and constraints for a popular way of expressing annotations in the bioinformatics community, and aiming to unify the conflicting ad-hoc formats and forks which appeared over the years.

The TSV-like nature of the format should not mislead you - while some columns do define some basics, the most of the actually useful information is stuffed into the "attributes" column (column 9), which is a container for key-value pairs, some constrained, some arbitrary. Crazy things can (and do) happen there in the wild. And column 9 is the source of most breakages and of the complexity in the parsers.

In Nextclade v3, the closer you follow the GFF3 spec the higher the chances that things will work correctly. Please report any deviations as bugs by opening issues in Nextclade repo.

Historical context

In older versions of Nextclade we used non-spec-compliant gene and gene_name keys in the attributes column (column 9). More over, older versions of Nextclade supported only flat genome annotations, meaning the only supported feature type was "gene", and the list was flat, i.e. there were no nesting.

Now

In Nextclade v3 we are trying to be more (or, hopefully, mostly) spec-compliant and we use ID and Name as the primary attributes. ID is for identification and Name for friendly display. The ID and Parent attributes are also used to deduce a nested hierarchy of features in order to support multiple CDS per gene, multiple CDS segments per CDS (for things like splicing and slippage) and multiple proteins per CDS (proteins are currently parsed but not yet used). The translation unit is a CDS and this is the most important feature type in Nextclade. Currently supported feature types are: gene, CDS, protein.

For backwards compatibility, reducing breakage when transitioning to v3, and in order to support various non-compliant GFF3 files in the wild (people are not always disciplined and upload to databases files that are not compliant) I allowed myself to keep reading gene and gene_name attributes as aliases for Name. Which means some datasets out there might have non-compliant genome annotations and work just fine.

I believe that as a part of this PR, changing gene to gene_name or vice-versa has no effect. As they are both just aliases for Name (src). But let me know if it's not working as intended.

Another important backwards-compat and non-compliant behavior Nextclade has:

if there's no gene for a CDS, it generates a gene with the same parameters
if there's no CDS for a gene, it generates a CDS with the same parameters

This is important in practice, because there are many pathogens where CDS-to-gene relationship is 1:1 and there is only genes or only CDS in GFF3 and people don't bother duplicating the info (one line for every gene and one line for every CDS, which are the same as genes but with a different feature type), just to produce a compliant annotation.

Debugging

You can use read-annotation CLI command to view a GFF3 file from the perspective of Nextclade. This will display a table with the hierarchy of features as understood by Nextclade. Only gene, CDS and protein features will be displayed, because these are the only ones Nextclade understands (proteins are not used currently).
```
nextclade read-annotation genome_annotation.gff3
```
If you add --json flag, it will print the same information in JSON format:
```
nextclade read-annotation --json genome_annotation.gff3 > genome_annotation.json
```
The format is a dump of internal structs and comes without any kind of stability guarantees. It is not recommended for use anywhere outside of debugging and can change any time. This format is used internally to figure out everything related to translation and amino acids. This is what you see in Nextclade Web on results page at the bottom of the table and in the gene selection dropdown in the header of the last column of the table.
You can also visualize ALL features from a GFF3 file by adding --feature-tree flag (not very intuitively named).
```
nextclade read-annotation genome_annotation.gff3 --feature-tree
```
All features are displayed, even the ones Nextclade does not understand. This is a precursor of the format we saw just above - it comes earlier in the parsing process, before unused/unknown feature types are filtered out and before any preprocessing and additional calculations. This aims to show the annotation as is, but in more readable way, so this command might be useful for visualizing GFF3 files more generally, even outside of context of Nextclade.
The --json flag also applies with --feature-tree. This is a different internal format which comes earlier in the parsing process. The format is also an unstable internal format, purely for debugging purposes.

ivan-aksamentov · 2024-05-16T23:56:13Z

The official Nextclade gene map docs suggest we should use CDS features with gene ids.

I believe it's a bug in the docs. The key should be Name, not gene

ivan-aksamentov · 2024-05-17T01:51:47Z

@genehack After seeing the error message in 4eeffd8 (#3), I was looking into what could be wrong with Nextclade.

I assembled a dataset from the old ref and annotation files from here https://github.com/nextstrain/seasonal-cov/tree/8548b683c3afb953ebc067087d8fe80e553e7c96/phylogenetic/config/hku1 (previous commit if you click the "History" button in the GFF3 file on GitHub).
Here is the result. It's a .zip, but github does not allow uploading zips, so remove the .txt extension and unzip: cov-hku-1.zip.txt

Tried to visualize the annotation:

nextclade read-annotation cov-hku-1/genome_annotation.gff3

Tried to run with individual inputs (input fasta is just the reference fasta, because I cannot find any other example sequences):

nextclade run --input-ref cov-hku-1/reference.fasta --input-annotation cov-hku-1/genome_annotation.gff3 -O tmp/cov-hku-1 cov-hku-1/reference.fast

and tried to run as a dataset (it is possible because I added a minimal pathogen.json file there):

nextclade run --input-dataset cov-hku-1/ -O tmp/cov-hku-1 cov-hku-1/reference.fasta

Everything seems to be working smoothly, so I cannot reproduce the error. Could there be a problem with Nextclade CLI version? Perhaps older version had some bugs. I am not sure how Nextclade CLI is installed in this repo and what the version is. I used Nextclade CLI 3.5.0.

What the error in your commit message means is Nextclade perceives the entry for feature "N" (gene/CDS "N") as duplicated and it does not know which one to pick. Though in the GFF3 file I don't see this happening - there seems to be only one entry for "N". So this is very puzzling.

Full working-as-intended repro in docker, using the cov-hku-1.zip.txt file from above (click to expand)

docker run -it --rm nextstrain/nextclade:3.5.0 bash -c "set -euxo pipefail \
&& apt-get update -yqq && apt-get -yqq install curl unzip \
&& curl -fsSL -o cov-hku-1.zip https://github.com/nextstrain/seasonal-cov/files/15342689/cov-hku-1.zip.txt \
&& mkdir -p cov-hku-1; cd cov-hku-1; unzip ../cov-hku-1.zip; cd .. \
&& nextclade read-annotation cov-hku-1/genome_annotation.gff3 \
&& nextclade run --verbose --input-ref cov-hku-1/reference.fasta --input-annotation cov-hku-1/genome_annotation.gff3 -O tmp/cov-hku-1/individual cov-hku-1/reference.fasta \
&& ls -al tmp/cov-hku-1/individual/ \
&& cat tmp/cov-hku-1/individual/nextclade.tsv \
&& cat tmp/cov-hku-1/individual/nextclade.cds_translation.S1.fasta \
&& nextclade run --verbose --input-dataset cov-hku-1/ -O tmp/cov-hku-1/dataset cov-hku-1/reference.fasta \
&& ls -al tmp/cov-hku-1/dataset/ \
&& cat tmp/cov-hku-1/dataset/nextclade.tsv \
&& cat tmp/cov-hku-1/dataset/nextclade.cds_translation.S1.fasta \
"

genehack · 2024-05-17T21:16:45Z

@ivan-aksamentov thanks for looking into this.

I checked out that same SHA (8548b683c) and ran the ingest and phylogenetic workflows, and I no longer see the error either.

My earlier errors happened with an ambient runtime, which I ended up abandoning due to some weird dependency issues (probably triggered by installing some additional tooling on top of the initial Nextstrain build). With the Nextstrain Conda runtime I'm using now, I was able to run both workflows without issue.

genehack added 3 commits May 14, 2024 16:59

snakefmt applied to phylogenetic/rules/prepare_sequences.smk #2

2909a03

genehack requested review from kistlerk and a team May 15, 2024 00:08

victorlin reviewed May 15, 2024

View reviewed changes

phylogenetic/config/hku1/genemap.gff Show resolved Hide resolved

genehack marked this pull request as ready for review May 15, 2024 00:57

genehack changed the title ~~Update phylo workflow 2~~ Update phylo workflow May 15, 2024

victorlin mentioned this pull request May 15, 2024

pip/conda dependency version constraints not guaranteed in all environments nextstrain/augur#1463

Open

genehack merged commit 3313a90 into main May 16, 2024

genehack deleted the update-phylo-workflow-2 branch May 16, 2024 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update phylo workflow #3

Update phylo workflow #3

genehack commented May 15, 2024 •

edited

Loading

genehack commented May 15, 2024

genehack commented May 15, 2024

genehack commented May 15, 2024

huddlej commented May 16, 2024

ivan-aksamentov commented May 16, 2024 •

edited

Loading

ivan-aksamentov commented May 16, 2024

ivan-aksamentov commented May 17, 2024 •

edited

Loading

genehack commented May 17, 2024

Update phylo workflow #3

Update phylo workflow #3

Conversation

genehack commented May 15, 2024 • edited Loading

Description of proposed changes

Related issue(s)

genehack commented May 15, 2024

genehack commented May 15, 2024

genehack commented May 15, 2024

huddlej commented May 16, 2024

ivan-aksamentov commented May 16, 2024 • edited Loading

The GFF3 spec

Historical context

Now

Debugging

ivan-aksamentov commented May 16, 2024

ivan-aksamentov commented May 17, 2024 • edited Loading

genehack commented May 17, 2024

genehack commented May 15, 2024 •

edited

Loading

ivan-aksamentov commented May 16, 2024 •

edited

Loading

ivan-aksamentov commented May 17, 2024 •

edited

Loading