Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snakemake simplifications #72

Merged
merged 15 commits into from
Jul 25, 2024
Merged

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Jul 15, 2024

See commit messages for details

Still to do

@jameshadfield jameshadfield changed the base branch from master to james/cattle-flu-segments July 15, 2024 04:16
@jameshadfield jameshadfield force-pushed the james/snakemake-simplifications branch 2 times, most recently from a8a639d to 736b114 Compare July 16, 2024 02:22
@jameshadfield jameshadfield marked this pull request as ready for review July 16, 2024 02:31
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice overall simplification of the Snakemake workflow (at least from my POV).

I left comments for consideration, the main one being the ramifications of updating the CI build-args.

rules/cattle-flu.smk Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
@@ -11,4 +11,4 @@ jobs:
# conform since this is a collaborative repo with an external group.
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0
with:
build-args: test_target
build-args: --configfile config/gisaid.yaml -pf test_target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, a similar update will need to be applied to docker-base, conda-base, and augur CI workflows.


I am considering if we want to use config/gisiad.yaml as a default config. Then we wouldn't need to update the build-args in CI and we can keep the simple nextstrain build invocation without needing to provide a config file.

This can be done in two ways:

  1. Defining it as a default config in the Snakefile (as suggested by Nextstrain Snakemake style guide).
configfile: "config/gisaid.yaml"

The downside to this method is the config dictionaries merge. This becomes an issue when adding --configfile config/h5n1-cattle-outbreak.yaml , as the top level builds would merge and run all subtypes.

  1. Only define the default config if no configs are provided (similar to mpox).
if not config:
    configfile: "config/gisaid.yaml"

The config dictionary merge issue from [1] still exists, but at least this way adding --configfile config/h5n1-cattle-outbreak.yaml will completely switch over to the h5n1-cattle-outbreak config params. The downside to this method is users need to remember to provide the configfile if they want to override specific configs with --config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh. Thanks for noticing - this would have been a pain to solve post-merge!

I like the explicitness of providing a --configfile for each run. But do I like it enough to make changes across 3 extra repos? Probably. Will think about it some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Only noticed because we recently went through this in #17).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs made to update CI in

config/gisaid.yaml Outdated Show resolved Hide resolved
@@ -22,7 +22,10 @@ segments:
- ns

#### Parameters which define the input source ####
s3_src: s3://nextstrain-data-private/files/workflows/avian-flu
s3_src:
name: gisaid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider consolidating s3_src.name with the the top level ingest_source param? It's not entirely clear to me why this is a separate config param (except using gisaid vs fauna)

Copy link
Member Author

@jameshadfield jameshadfield Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone a slightly different direction - I've removed ingest_source so we now have two isolated and mutually exclusive¹ config keys: s3_src and local_ingest. I preferred this as the local ingest name (e.g. "andersen-lab") doesn't necessarily overlap with the S3 "name".

¹ As per added comment in the Snakefile, we could relax the exclusivity of S3_SRC and LOCAL_INGEST if we want to use --config local_ingest=gisaid overrides in the future.

Snakefile Outdated Show resolved Hide resolved
Snakefile Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
@jameshadfield jameshadfield force-pushed the james/snakemake-simplifications branch from 736b114 to 6b9bc7c Compare July 16, 2024 23:24
@jameshadfield jameshadfield force-pushed the james/snakemake-simplifications branch from 6b9bc7c to 6316021 Compare July 17, 2024 02:35
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jul 18, 2024
Mirrors the change in avian-flu's CI.

<nextstrain/avian-flu#72>
jameshadfield added a commit to nextstrain/docker-base that referenced this pull request Jul 18, 2024
Mirrors the change in avian-flu's CI.

<nextstrain/avian-flu#72>
jameshadfield added a commit to nextstrain/conda-base that referenced this pull request Jul 18, 2024
Mirrors the change in avian-flu's CI.

<nextstrain/avian-flu#72>
This work is in preparation for the next commit which will add
functionality which needs to conditionally change the return value based
on wildcards.
Individual datasets for all 8 segments are added using NCBI data
(consistent with the whole-genome view) and using the strains included
in the genome tree to choose the appropriate monophyletic clade in each
segment tree to include.

There is an added Auspice colouring "genome_tree" which indicates
whether the strain exists in the whole genome tree.

See top-level README for instructions on how to run.

There are a number of important caveats with the current implementation,
all of which are fixable but I think better done in future work.

* The input source must be NCBI for this build to work as expected and
thus the `s3_src` config argument is essential. Snakemake will not reuse
existing files in `./data` if they originated from a different source
(e.g. GISAID), despite their filenames being identical as the params of
the originating rule are different. I've added a note about this in the
README.

* The filenames produced include "_all-time" as the current Snakemake
workflow requires a "time" wildcard, however we want to remove this
prior to uploading. E.g.
`auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json` should be
uploaded to /avian-flu/h5n1-cattle-outbreak/ha to match the URL
structure for the genome build. We can use the following bash one-liner
prior to uploading while this remains unfixed:
```
for i in auspice/avian-flu_h5n1-cattle-outbreak_*_all-time.json; do mv $i ${i%_all-time.json}.json; done
```

* The segment builds may include basal strains which aren't part of the
  cattle-flu outbreak but are included because there aren't mutations
  which distinguish them from others which should be. The "genome_tree"
  colouring is helpful here. (See the note in
  `restrict-via-common-ancestor.py` -- we may want to include more basal
  strains here).
Part of simplifications described in
<#70>

This approach avoids using an empty wildcard for time but introduces the
need to have a separate rule to map the target Auspice dataset JSON
filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a
results file with a time wildcard (e.g.
`results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of
the pipeline is unchanged.
In preparation for combining the genome snakemake workflow into this one
The myriad usages of lambda functions should go away once we switch to
using config YAMLs.

Now that we have everything in a single pipeline the
h5n1-cattle-outbreak segment builds no longer need to download the
genome tree (which they depend on), they can use it as a normal
dependency and snakemake will build it as needed.
This abstracts out the configuration into two separate YAML files. As a
result the snakemake complexity is reduced and (hopefully) the main
interaction point can now be the YAMLs themselves.

There are a few changes to the behaviour of the h5n1-cattle-outbreak
builds:
  * For individual segment builds we now use the h5n1-cattle-outbreak
    dropped list (previously we used the H5N1 drop list)
  * For individual segment builds we now use the H5NX input data
    (previously we used the H5N1 input data)
  * We no longer remove sequences via a clock filter

There are no changes to the behaviour of the GISAID builds.

The config is generally straightforward except where parameters differ
for a genome build vs the corresponding segment builds. To avoid having
to list the same parameters out 8 times, I implemented (e.g.)
`config.traits.genome_columns` and `config.traits.columns`. This is only
observed in the h5n1-cattle-outbreak config.
The rules in `common.smk` were separated out to reduce rule duplication
between the main Snakefile and Snakefile.genome. The latter has since
been integrated into the main snakefile, and so we do the same with
these "common" rules.
This results in disjoint sets of filenames for the GISAID builds
(config/gisaid.yaml) and the NCBI builds
(config/h5n1-cattle-outbreak.yaml), which therefore allows you to run
each set of builds locally without one interfering with the other.

In addition, the way local-ingest data can be used is streamlined so that
you can achieve the same outcome with local data. The config keys are
similarly streamlined.

Note that if you run (e.g.) GISAID builds using local data then run them
with S3 data all the intermediate files will be regenerated. In other
words you cannot maintain parallel "versions" of these simultaneously.
Makes listing / looking at the results files a more pleasant experience

There should be no changes to behaviour with this commit.
This (mass-renaming) is a companion to the parent commit ("Organise
results into directories"). I chose to keep the filenames unchanged as
it's a common occurrence to be editing files of the same type (e.g.
include lists) for multiple subtypes and thus having the subtype remain
in the filename is helpful.
The pipeline already adds this field to the metadata TSV in-use, but it
won't be exported without this addition to the auspice-config JSON

Note that the clade definitions haven't been regenerated for NCBI data
so there's actually no clades defined at the moment, and thus nothing
is exported.
to reflect the changes made in the previous few commits.

The addition of "genome" to the h5n1-cattle-outbreak config YAML is
needed to make it an explicit output of the `all` rule, and this output
is what's used by the `deploy_all` rule
The `pathogen-repo-build` reusable action adds an extremely helpful
summary describing the AWS run. This adds some similarly helpful info
which should make it much simpler to check the results of a phylo run.
The build has recently expanded in scope with the addition of
segment-level builds resulting in runtime increase from ~10min to
~80min. Parallelisation should speed this up a lot.
@jameshadfield jameshadfield force-pushed the james/snakemake-simplifications branch from 6316021 to e6b9158 Compare July 25, 2024 03:10
@jameshadfield jameshadfield changed the base branch from james/cattle-flu-segments to master July 25, 2024 03:40
jameshadfield added a commit to nextstrain/nextstrain.org that referenced this pull request Jul 25, 2024
See <nextstrain/avian-flu#69> and
<nextstrain/avian-flu#72> for details of the new
datasets themselves.

Segments ordered following the other avian-flu builds in the manifest,
although note that this order differs from the canonical order of
"pb2", "pb1", "pa", "ha","np", "na", "mp", "ns" which is used to order
segments in the avian-flu/h5n1-cattle-outbreak/genome build.
@jameshadfield
Copy link
Member Author

Rebased onto latest master. Merging now, which will result in #69 closing as those commits are in this PR.

@jameshadfield jameshadfield merged commit 74b95ff into master Jul 25, 2024
3 checks passed
@jameshadfield jameshadfield deleted the james/snakemake-simplifications branch July 25, 2024 05:20
jameshadfield added a commit to nextstrain/docker-base that referenced this pull request Jul 25, 2024
Mirrors the change in avian-flu's CI.

<nextstrain/avian-flu#72>
Comment on lines +94 to +106
__clock_std_dev: &clock_std_dev 0.00211 # YAML anchor so we can reference this value below

clock_rates:
FALLBACK:
# The rates for the 8 segments are taken from the GISAID H5N1/2y config
pb2: [0.00287, *clock_std_dev]
pb1: [0.00264, *clock_std_dev]
pa: [0.00248, *clock_std_dev]
ha: [0.00455, *clock_std_dev]
np: [0.00252, *clock_std_dev]
na: [0.00349, *clock_std_dev]
mp: [0.00191, *clock_std_dev]
ns: [0.00249, *clock_std_dev]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield BTW, you can do this instead to avoid the bogus extra field:

  clock_rates:
    FALLBACK:
      # The rates for the 8 segments are taken from the GISAID H5N1/2y config
      pb2: [0.00287, &clock_std_dev 0.00211]
      pb1: [0.00264, *clock_std_dev]
      pa: [0.00248, *clock_std_dev]
      ha: [0.00455, *clock_std_dev]
      np: [0.00252, *clock_std_dev]
      na: [0.00349, *clock_std_dev]
      mp: [0.00191, *clock_std_dev]
      ns: [0.00249, *clock_std_dev]

Avoiding the bogus field is nice because you don't have to worry about it colliding with anything and if a config schema is used (like we do in ncov), it won't cause validation errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants