Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Move Nextclade configs to separate YAML #34

Merged
merged 1 commit into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,26 @@ inputs/outputs should be relative to the ingest directory.
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

### Nextclade

Nextstrain is pushing to standardize ingest workflows with Nextclade runs to include Nextclade outputs in our publicly
hosted data. However, if a Nextclade dataset does not already exist, it requires curated data as input, so we are making
Nextclade steps optional here.

If Nextclade config values are included, the Nextclade rules will create the final metadata TSV by joining the Nextclade
output with the metadata. If Nextclade configs are not included, we rename the subset metadata TSV to the final metadata TSV.

To run Nextclade rules, include the `defaults/nextclade_config.yaml` config file with:

```
nextstrain build ingest --configfile defaults/nextclade_config.yaml
```

> [!TIP]
> If the Nextclade dataset is stable and you always want to run the Nextclade rules as part of ingest, we recommend
moving the Nextclade related config parameters from the `defaults/nextclade_config.yaml` file to the default config file
`defaults/config.yaml`.

## Build configs

The build-configs directory contains custom configs and rules that override and/or
Expand Down
2 changes: 2 additions & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ include: "rules/curate.smk"
# final metadata TSV by joining the Nextclade output with the metadata.
# If Nextclade configs are not included, we rename the subset metadata TSV
# to the final metadata TSV.
# To run nextclade.smk rules, include the `defaults/nextclade_config.yaml`
# config file with `nextstrain build ingest --configfile defaults/nextclade_config.yaml`.
if "nextclade" in config:

include: "rules/nextclade.smk"
Expand Down
15 changes: 0 additions & 15 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,18 +115,3 @@ curate:
"abbr_authors",
"institution",
]


# Nextclade parameters to include if you are running Nextclade as a part of your ingest workflow
# Note that this requires a Nextclade dataset to already exist for your pathogen.
# Remove the following parameters if you do not plan to run Nextclade.
nextclade:
# The name of the Nextclade dataset to use for running nextclade.
# Run `nextclade dataset list` to get a full list of available Nextclade datasets
dataset_name: ""
# Path to the mapping for renaming Nextclade output columns
# The path should be relative to the ingest directory
field_map: "config/nextclade_field_map.tsv"
# This is the ID field you would use to match the Nextclade output with the record metadata.
# This should be the new name that you have defined in your field map.
id_field: "seqName"
12 changes: 12 additions & 0 deletions ingest/defaults/nextclade_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Nextclade parameters to include if you are running Nextclade as a part of your ingest workflow
# Note that this requires a Nextclade dataset to already exist for your pathogen.
nextclade:
# The name of the Nextclade dataset to use for running nextclade.
# Run `nextclade dataset list` to get a full list of available Nextclade datasets
dataset_name: ""
# Path to the mapping for renaming Nextclade output columns
# The path should be relative to the ingest directory
field_map: "config/nextclade_field_map.tsv"
# This is the ID field you would use to match the Nextclade output with the record metadata.
# This should be the new name that you have defined in your field map.
id_field: "seqName"