Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to Nextclade v3 #281

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Migrate to Nextclade v3 #281

wants to merge 8 commits into from

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Oct 18, 2024

Description of proposed changes

Migrates the ingest workflow to Nextclade v3, which simplifies the workflow since we no longer need to use two Nextclade datasets. Includes additional clean ups to use augur curate rename and augur merge, see commits for details.

Related issue(s)

Resolves #280

Checklist

Since the v3 Nextclade datasets for MPXV and hMPXV both use NC_063383.1
as the reference, we can just run Nextclade once using the MPXV dataset.
Move the downloaded Nextclade dataset to the `data/` directory so that
it's automatically ignored by git.

Move Nextclade outputs to `results` as they are part of the final
outputs of the ingest workflow.
Modified from <nextstrain/measles@faebd64>.

I switched the order of `augur curate rename` and `tsv-select` because
the Nextclade data includes fields that we don't use that hit the
csv field size limit.

This commit also includes automated reformatting by `snakefmt`.
Update the `files_to_upload` to match the output files from
Nextclade v3.
Remove reference to nextalign and update available files to match
the output files from Nextclade v3.
@joverlee521
Copy link
Contributor Author

Trial run successfully completed and uploaded results to s3://nextstrain-data/files/workflows/mpox/branch/ingest-nextclade-v3/

There are many changes across the metadata.tsv compared to the production metadata.tsv because of changes in the Nextclade columns. A majority of these changes come from the divergence column, which is expected because the v3 MPXV Nextclade dataset uses NC_063383.1 as the reference instead of a reconstructed ancestor.

I took a quick look at clade changes:

  • A majority of clade changes are a result of the new Ia & Ib clade distinctions
  • 42 sequences that used to fail now get assigned clades
  • 12 sequences that were labelled as outgroup now fail
Detailed clade change counts
prod_clade v3_clade count
I 6
II 4
IIa 5
IIb 10
Ia 2
outgroup 15
I II 1
I IIb 1
I Ia 518
I Ib 133
I outgroup 22
II IIa 1
II IIb 14
II outgroup 7
IIa II 2
IIa IIb 24
IIa outgroup 1
IIb outgroup 2
outgroup 12
outgroup I 1
outgroup II 4
outgroup IIa 4
outgroup IIb 28

These changes seem reasonable to me, but would like a quick review from @corneliusroemer

ingest/rules/nextclade.smk Show resolved Hide resolved
Adding log/benchmark to all rules in nextclade.smk
according to the Nextstrain Snakemake style guide
<https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
Chatter on Slack regarding workflow runtimes made me realize that since
we only run a single Nextclade job now, we no longer need to put a hard
limit on the threads.

<https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1729629899871479?thread_ts=1729629155.721859&cid=C01LCTT7JNN>
@@ -30,7 +30,7 @@ rule run_nextclade:
# The lambda is used to deactivate automatic wildcard expansion.
# https://github.com/snakemake/snakemake/blob/384d0066c512b0429719085f2cf886fdb97fd80a/snakemake/rules.py#L997-L1000
translations=lambda w: "results/translations/{cds}.fasta",
threads: 4
threads: workflow.cores
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shaved off 8 mins for the Nextclade job, but only 3 mins off the total workflow.

Using all of the cores prevents Snakemake from scheduling other concurrent jobs such as upload_to_s3 jobs. The upload jobs can be slow as well, so we could adjust the threads to something like workflow.cores * 0.75.

It's only a couple minute difference so I'm not going to worry about it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate to Nextclade v3
2 participants