Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ingest one snakefile #10

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9be3cc9
Ingest: Copy ingest from monkeypox repo
Nov 17, 2022
d91180e
Ingest: Ignore ingest cache directories
j23414 Mar 28, 2023
0700db2
Ingest: Remove Nextclade
Nov 17, 2022
6ce79cb
Remove bin/scripts duplication
Nov 17, 2022
af25286
fix: Use curl for downloading files
j23414 Mar 28, 2023
1400d2f
Ingest: Replace monkeypox text and parameters with dengue
Nov 17, 2022
9ada261
Dengue-specific-ingest: Add dengue serotype wildcards
Nov 17, 2022
8c9763f
Dengue-specific-ingest: Add download filters
Nov 17, 2022
15f335d
Add post processing script
Nov 17, 2022
4f57aed
Replace post processing R script with python
Nov 30, 2022
938feaf
[ingest] Simplify finding strain name
j23414 Apr 24, 2023
f7bc33b
zstd compress output files
Nov 17, 2022
3fccb81
fix: makes the compress rule more generic
j23414 Mar 28, 2023
7ecc29a
Build: Index by genbank accession instead of duplicate strain names
Feb 23, 2023
d6eadd0
fix: remove entries where accession is not found
j23414 Mar 28, 2023
6c21a39
Ingest: Compromise by duplicating scripts
Mar 25, 2023
360b383
Ingest: Replace monkeypox text and parameters with dengue in scripts
j23414 Jun 6, 2023
b724b96
Ingest: Compromise by allowing redundant data pull by serotype
Mar 25, 2023
b8f3b8d
[wip] attempt at limiting concurrent deploys
j23414 Mar 29, 2023
755d287
Build: parameterize threads in align rule
j23414 Mar 28, 2023
a78ac90
docs: Add documentation on running ingest
j23414 Apr 18, 2023
4e28f33
fix: wildcards paired with optional.yaml
j23414 Jun 6, 2023
87c80a9
refactor: move post_process_metadata to rule transform
j23414 Jun 9, 2023
4823f72
cleanup some unused metadata columns
j23414 Jun 9, 2023
247b2fd
mark temp intermediate files
j23414 Jun 9, 2023
d008d5a
fix: annotations is a file, not a param
j23414 Jun 9, 2023
77d1a07
refactor: parameterize data s3 source
j23414 Jun 9, 2023
df024cb
Ingest: Using Snakemake modules for ingesting data into the Nextstrai…
j23414 Jun 9, 2023
8d3ef7d
ingest: add zstd support
joverlee521 Jun 9, 2023
351b3ec
Snakefile: allow configfile overrides
joverlee521 Jun 10, 2023
dafdd12
Update workaround for ingest `shell`
joverlee521 Jun 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ environment*

# Snakemake state dir
/.snakemake
ingest/.snakemake
ingest/logs

# Local config overrides
/config_local.yaml
Expand Down
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,14 @@ This build starts by pulling preprocessed sequence and metadata files from:
* https://data.nextstrain.org/files/dengue/sequences_denv4.fasta.zst
* https://data.nextstrain.org/files/dengue/metadata_denv4.tsv.zst

The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals.
The above datasets have been preprocessed and cleaned from GenBank and are updated at regular intervals from the ingest folder.

```
nextstrain build ingest

# Upload final dataset and trigger slack notifications
nextstrain build ingest --configfiles config/config.yaml config/optional.yaml
```

### Using example data

Expand Down
116 changes: 97 additions & 19 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,9 +1,34 @@
from snakemake.utils import min_version
min_version("6.0")

# Use default pathogen build config if no configs are provided
if not config:
configfile: "config/config_dengue.yaml"
# Use default ingest config if no `transform` config is provided
if not config.get("transform"):
configfile: "ingest/config/config.yaml"

# Add the hard-coded ingest basedir to the workflow config so that we can
# pass it to the module ingest workflow. This will allow shell scripts to
# use the proper paths for local script invocation since we cannot set the
# workdir separately for module workflows.
# This work around is based on https://stackoverflow.com/a/66890412
config["ingest_basedir"] = f"{workflow.current_basedir}/ingest"

serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']

rule all:
input:
auspice_json = expand("auspice/dengue_{serotype}.json", serotype=serotypes)

module ingest_workflow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used modules myself, so I'm interested to know how you and @joverlee521 found them to work with - especially related to debugging / developing?

As a counterexample, I've been playing with an HBV build which uses our convention of separating out ingest data and code into ./ingest/, but still runs everything from a top-level ./Snakemake (which calls include: "ingest/ingest.smk").

The advantage of modules (as used in this context) is that they continue to allow ./ingest/ to function in a completely stand-alone way, which has obvious advantages. Perhaps using modules also enforces the separation of concerns between ingest + phylo in a desirable way? The cost is increased complexity. After seeing how complex the ncov snakemake workflow became over the years I'm trying to gauge whether the trade-offs here are worth it.

snakefile:
"ingest/Snakefile"
config: config
prefix: "ingest"

use rule * from ingest_workflow as ingest_*

rule files:
params:
dropped_strains = "config/dropped_strains.txt",
Expand Down Expand Up @@ -52,35 +77,65 @@ def clade_defs(w):
}
return defs[w.serotype]

rule download:
"""Downloading sequences and metadata from data.nextstrain.org"""
output:
sequences = "data/sequences_{serotype}.fasta.zst",
metadata = "data/metadata_{serotype}.tsv.zst"

params:
sequences_url = "https://data.nextstrain.org/files/dengue/sequences_{serotype}.fasta.zst",
metadata_url = "https://data.nextstrain.org/files/dengue/metadata_{serotype}.tsv.zst"
rule mv_ingest_data:
input:
sequences="ingest/data/sequences_{serotype}.fasta.zst",
metadata="ingest/data/metadata_{serotype}.tsv.zst",
output:
sequences="data/sequences_{serotype}.fasta.zst",
metadata="data/metadata_{serotype}.tsv.zst",
shell:
"""
curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
mv {input.sequences} {output.sequences}
mv {input.metadata} {output.metadata}
"""

if config.get("s3_src"):
ruleorder: download > mv_ingest_data

rule download:
"""Downloading sequences and metadata from data.nextstrain.org"""
output:
sequences = "data/sequences_{serotype}.fasta.zst",
metadata = "data/metadata_{serotype}.tsv.zst"

params:
sequences_url = f"{config.get('s3_src')}/sequences_{{serotype}}.fasta.zst",
metadata_url = f"{config.get('s3_src')}/metadata_{{serotype}}.tsv.zst"
shell:
"""
curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
"""

rule decompress:
"""Parsing fasta into sequences and metadata"""
input:
sequences = "data/sequences_{serotype}.fasta.zst",
metadata = "data/metadata_{serotype}.tsv.zst"
output:
sequences = "results/sequences_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv"
sequences = "data/sequences_{serotype}.fasta",
metadata = "data/metadata_{serotype}.tsv"
shell:
"""
zstd -d -c {input.sequences} > {output.sequences}
zstd -d -c {input.metadata} > {output.metadata}
"""

rule wrangle_metadata:
input:
metadata="data/metadata_{serotype}.tsv",
output:
metadata="results/wrangled_metadata_{serotype}.tsv",
params:
strain_id=config.get("strain_id_field", "strain"), #accession
shell:
"""
csvtk -t rename -f strain -n strain_original {input.metadata} \
| csvtk -t mutate -f {params.strain_id} -n strain > {output.metadata}
"""

rule filter:
"""
Filtering to
Expand All @@ -90,8 +145,8 @@ rule filter:
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "results/sequences_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv",
sequences = "data/sequences_{serotype}.fasta",
metadata = "results/wrangled_metadata_{serotype}.tsv",
exclude = files.dropped_strains
output:
sequences = "results/filtered_{serotype}.fasta"
Expand Down Expand Up @@ -122,6 +177,8 @@ rule align:
reference = files.reference
output:
alignment = "results/aligned_{serotype}.fasta"
params:
threads = 1
shell:
"""
augur align \
Expand All @@ -130,7 +187,7 @@ rule align:
--output {output.alignment} \
--fill-gaps \
--remove-reference \
--nthreads 1
--nthreads {params.threads}
"""

rule tree:
Expand Down Expand Up @@ -158,7 +215,7 @@ rule refine:
input:
tree = "results/tree-raw_{serotype}.nwk",
alignment = "results/aligned_{serotype}.fasta",
metadata = "results/metadata_{serotype}.tsv"
metadata = "results/wrangled_metadata_{serotype}.tsv"
output:
tree = "results/tree_{serotype}.nwk",
node_data = "results/branch-lengths_{serotype}.json"
Expand Down Expand Up @@ -223,7 +280,7 @@ rule traits:
"""
input:
tree = "results/tree_{serotype}.nwk",
metadata = "results/metadata_{serotype}.tsv"
metadata = "results/wrangled_metadata_{serotype}.tsv"
output:
node_data = "results/traits_{serotype}.json",
params:
Expand Down Expand Up @@ -262,15 +319,16 @@ rule export:
"""Exporting data files for for auspice"""
input:
tree = "results/tree_{serotype}.nwk",
metadata = "results/metadata_{serotype}.tsv",
metadata = "results/wrangled_metadata_{serotype}.tsv",
branch_lengths = "results/branch-lengths_{serotype}.json",
traits = "results/traits_{serotype}.json",
clades = "results/clades_{serotype}.json",
nt_muts = "results/nt-muts_{serotype}.json",
aa_muts = "results/aa-muts_{serotype}.json",
auspice_config = files.auspice_config
output:
auspice_json = "auspice/dengue_{serotype}.json"
auspice_json = "results/raw_dengue_{serotype}.json",
root_sequence = "results/raw_dengue_{serotype}_root-sequence.json",
shell:
"""
augur export v2 \
Expand All @@ -282,6 +340,26 @@ rule export:
--output {output.auspice_json}
"""

rule final_strain_name:
input:
auspice_json="results/raw_dengue_{serotype}.json",
metadata="results/wrangled_metadata_{serotype}.tsv",
root_sequence="results/raw_dengue_{serotype}_root-sequence.json",
output:
auspice_json="auspice/dengue_{serotype}.json",
root_sequence="auspice/dengue_{serotype}_root-sequence.json",
params:
display_strain_field=config.get("display_strain_field", "strain"),
shell:
"""
python3 bin/set_final_strain_name.py \
--metadata {input.metadata} \
--input-auspice-json {input.auspice_json} \
--display-strain-name {params.display_strain_field} \
--output {output.auspice_json}
cp {input.root_sequence} {output.root_sequence}
"""

rule clean:
"""Removing directories: {params}"""
params:
Expand Down
36 changes: 36 additions & 0 deletions bin/set_final_strain_name.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import pandas as pd
import json, argparse

def replace_name_recursive(node, lookup):
if node["name"] in lookup:
node["name"] = lookup[node["name"]]

if "children" in node:
for child in node["children"]:
replace_name_recursive(child, lookup)

if __name__=="__main__":
parser = argparse.ArgumentParser(
description="Swaps out the strain names in the Auspice JSON with the final strain name",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)

parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json")
parser.add_argument('--metadata', type=str, required=True, help="input data")
parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice")
parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON")
args = parser.parse_args()

metadata = pd.read_csv(args.metadata, sep='\t')
name_lookup = {}
for ri, row in metadata.iterrows():
strain_id = row['strain']
name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name]

with open(args.input_auspice_json, 'r') as fh:
data = json.load(fh)

replace_name_recursive(data['tree'], name_lookup)

with open(args.output, 'w') as fh:
json.dump(data, fh)
3 changes: 3 additions & 0 deletions config/config_dengue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
strain_id_field: "accession"
display_strain_field: "strain_original"
# s3_src: 'https://data.nextstrain.org/files/dengue'
110 changes: 41 additions & 69 deletions config/dropped_strains.txt
Original file line number Diff line number Diff line change
@@ -1,69 +1,41 @@
DENV/SPAIN/EEB17/2009
DENV1/FRANCE/00475/2008
DENV1/MALAYSIA/P1244/1972
DENV1/VIETNAM/BIDV3990/2008
DENV1/VIETNAM/BIDV992/2006
DENV2/AUSTRALIA/QML22/2015
DENV2/BURKINA_FASO/DAKAR2039/1980
DENV2/BURKINA_FASO/DAKARA2022/1980
DENV2/COTE_D_IVOIRE/DAKAR510/1980
DENV2/COTE_D_IVOIRE/DAKAR578/1980
DENV2/COTE_D_IVOIRE/DAKARA1247/1980
DENV2/GUINEA/PM33974/1981
DENV2/HAITI/DENGUEVIRUS2HOMOSAPIENS1/2016
DENV2/MALAYSIA/DKD811/2008
DENV2/MALAYSIA/P81407/1970
DENV2/MALAYSIA/SAB/2015
DENV2/NIGERIA/IBH11208/1966
DENV2/NIGERIA/IBH11234/1966
DENV2/NIGERIA/IBH11664/1966
DENV2/SENEGAL/0674/1970
DENV2/SENEGAL/DAKAR0761/1974
DENV2/SENEGAL/DAKAR141069/1999
DENV2/SENEGAL/DAKAR141070/1999
DENV2/SENEGAL/DAKARD75505/1999
DENV2/TRINIDAD_AND_TOBAGO/NA/1953
DENV4/MALAYSIA/P215/1975
DENV4/MALAYSIA/P514/1975
DENV4/MALAYSIA/P731120/1973
D2Sab2015 # miscategorized
QML22 # miscategorized
DAK_Ar_A1247 # sylvatic
Dak_Ar_2039 # sylvatic
Dak_Ar_578 # sylvatic
DAK_Ar_510 # sylvatic
PM33974 # sylvatic
Dak_Ar_A2022 # sylvatic
Dak_Ar_141069 # sylvatic
Dak_Ar_141070 # sylvatic
Dak_Ar_D75505 # sylvatic
Dak_HD_10674 # sylvatic
Dak_Ar_D20761 # sylvatic
IBH11664 # sylvatic
IBH11208 # sylvatic
IBH11234 # sylvatic
P8_1407 # sylvatic
P75_514 # sylvatic
P73_1120 # sylvatic
P75_215 # sylvatic
DKD811 # sylvatic
ZS01/01 # metadata issue
Vero # cell line
MS13002673 # too divergent
MS11011405 # too divergent
V43257 # too divergent
KDC0574A2_06/02/2011 # too divergent
00178/03 # too divergent
00759/12 # too divergent
00988/11 # too divergent
01113/10 # too divergent
01224/04 # too divergent
01231/10 # too divergent
01488/09 # too divergent
01542/04 # too divergent
dev1 # too divergent
DKE_121 # too divergent
SENDAK_HD_10674 # sylvatic
DENV2_1_DAK_HD_76395 # sylvatic
DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent
PR_6 # too divergent
KY923048 # D2Sab2015 # miscategorized
KX274130 # QML22 # miscategorized
EF105383 # DAK_Ar_A1247 # sylvatic
EF105382 # Dak_Ar_2039 # sylvatic
EF105380 # Dak_Ar_578 # sylvatic
EF105381 # DAK_Ar_510 # sylvatic
EF105378 # PM33974 # sylvatic
EF105386 # Dak_Ar_A2022 # sylvatic
EF105389 # Dak_Ar_141069 # sylvatic
EF105390 # Dak_Ar_141070 # sylvatic
EF457904 # Dak_Ar_D75505 # sylvatic
EF105384 # Dak_HD_10674 # sylvatic
EF105385 # Dak_Ar_D20761 # sylvatic
EF105388 # IBH11664 # sylvatic
EF105387 # IBH11208 # sylvatic
EU003591 # IBH11234 # sylvatic
EF105379 # P8_1407 # sylvatic
JF262779 # P75_514 # sylvatic
JF262780 # P73_1120 # sylvatic
EF457906 # P75_215 # sylvatic
FJ467493 # DKD811 # sylvatic
EF051521 # ZS01/01 # metadata issue
MT929160 # Vero # cell line
MH048676 # MS13002673 # too divergent
MH048674 # MS11011405 # too divergent
MT597439 # V43257 # too divergent
MN448607 # KDC0574A2_06/02/2011 # too divergent
ON046268 # 00178/03 # too divergent
ON046278 # 00759/12 # too divergent
ON046276 # 00988/11 # too divergent
ON046273 # 01113/10 # too divergent
ON046270 # 01224/04 # too divergent
ON046274 # 01231/10 # too divergent
ON046272 # 01488/09 # too divergent
ON046271 # 01542/04 # too divergent
MZ284953 # dev1 # too divergent
MZ215848 # DKE_121 # too divergent
MW946564 # SENDAK_HD_10674 # sylvatic
OK605757 # DENV2_1_DAK_HD_76395 # sylvatic
MW945427 # DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent
OM258630 # PR_6 # too divergent
Binary file modified example_data/sequences_all.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv1.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv2.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv3.fasta.zst
Binary file not shown.
Binary file modified example_data/sequences_denv4.fasta.zst
Binary file not shown.
Loading