Snakemake simplifications #72

jameshadfield · 2024-07-15T04:15:49Z

See commit messages for details

Still to do

Change filenames in results to better partition intermediate files into directories. E.g. results/traits_h5nx_pb2_all-time.json → results/h5nx/pb2/all-time/traits.json
Update GitHub actions to use new invocation syntax (CI now passes)
Trial builds (trial name: snakemake-simplifications)
- ~~NCBI (including ingest)~~ Messed this up, you can't use this for trial builds so I inadvertantly asked for the "snakemake-simplifications" docker image and it failed
- GISAID (phylo only)
  - Builds at https://next.nextstrain.org/staging/avian-flu/trials/snakemake-simplifications/avian-flu/h5n1/ha/2y, https://nextstrain.org/staging/avian-flu/trials/snakemake-simplifications/avian-flu/h7n9/pb2/all-time (etc)
- NCBI (phylo only)

joverlee521

Nice overall simplification of the Snakemake workflow (at least from my POV).

I left comments for consideration, the main one being the ramifications of updating the CI build-args.

rules/cattle-flu.smk

Snakefile

joverlee521 · 2024-07-16T19:37:35Z

.github/workflows/ci.yaml

@@ -11,4 +11,4 @@ jobs:
    # conform since this is a collaborative repo with an external group.
    uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0
    with:
-      build-args: test_target
+      build-args: --configfile config/gisaid.yaml -pf test_target


FYI, a similar update will need to be applied to docker-base, conda-base, and augur CI workflows.

I am considering if we want to use config/gisiad.yaml as a default config. Then we wouldn't need to update the build-args in CI and we can keep the simple nextstrain build invocation without needing to provide a config file.

This can be done in two ways:

Defining it as a default config in the Snakefile (as suggested by Nextstrain Snakemake style guide).

configfile: "config/gisaid.yaml"

The downside to this method is the config dictionaries merge. This becomes an issue when adding --configfile config/h5n1-cattle-outbreak.yaml , as the top level builds would merge and run all subtypes.

Only define the default config if no configs are provided (similar to mpox).

if not config: configfile: "config/gisaid.yaml"

The config dictionary merge issue from [1] still exists, but at least this way adding --configfile config/h5n1-cattle-outbreak.yaml will completely switch over to the h5n1-cattle-outbreak config params. The downside to this method is users need to remember to provide the configfile if they want to override specific configs with --config.

Urgh. Thanks for noticing - this would have been a pain to solve post-merge!

I like the explicitness of providing a --configfile for each run. But do I like it enough to make changes across 3 extra repos? Probably. Will think about it some more.

(Only noticed because we recently went through this in #17).

PRs made to update CI in

augur

conda-base

docker-base

config/gisaid.yaml

joverlee521 · 2024-07-16T20:22:03Z

config/gisaid.yaml

@@ -22,7 +22,10 @@ segments:
  - ns

 #### Parameters which define the input source ####
-s3_src: s3://nextstrain-data-private/files/workflows/avian-flu
+s3_src:
+  name: gisaid


Consider consolidating s3_src.name with the the top level ingest_source param? It's not entirely clear to me why this is a separate config param (except using gisaid vs fauna)

I've gone a slightly different direction - I've removed ingest_source so we now have two isolated and mutually exclusive¹ config keys: s3_src and local_ingest. I preferred this as the local ingest name (e.g. "andersen-lab") doesn't necessarily overlap with the S3 "name".

¹ As per added comment in the Snakefile, we could relax the exclusivity of S3_SRC and LOCAL_INGEST if we want to use --config local_ingest=gisaid overrides in the future.

Snakefile

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

This work is in preparation for the next commit which will add functionality which needs to conditionally change the return value based on wildcards.

Individual datasets for all 8 segments are added using NCBI data (consistent with the whole-genome view) and using the strains included in the genome tree to choose the appropriate monophyletic clade in each segment tree to include. There is an added Auspice colouring "genome_tree" which indicates whether the strain exists in the whole genome tree. See top-level README for instructions on how to run. There are a number of important caveats with the current implementation, all of which are fixable but I think better done in future work. * The input source must be NCBI for this build to work as expected and thus the `s3_src` config argument is essential. Snakemake will not reuse existing files in `./data` if they originated from a different source (e.g. GISAID), despite their filenames being identical as the params of the originating rule are different. I've added a note about this in the README. * The filenames produced include "_all-time" as the current Snakemake workflow requires a "time" wildcard, however we want to remove this prior to uploading. E.g. `auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json` should be uploaded to /avian-flu/h5n1-cattle-outbreak/ha to match the URL structure for the genome build. We can use the following bash one-liner prior to uploading while this remains unfixed: ``` for i in auspice/avian-flu_h5n1-cattle-outbreak_*_all-time.json; do mv $i ${i%_all-time.json}.json; done ``` * The segment builds may include basal strains which aren't part of the cattle-flu outbreak but are included because there aren't mutations which distinguish them from others which should be. The "genome_tree" colouring is helpful here. (See the note in `restrict-via-common-ancestor.py` -- we may want to include more basal strains here).

Part of simplifications described in <#70> This approach avoids using an empty wildcard for time but introduces the need to have a separate rule to map the target Auspice dataset JSON filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a results file with a time wildcard (e.g. `results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of the pipeline is unchanged.

In preparation for combining the genome snakemake workflow into this one

The myriad usages of lambda functions should go away once we switch to using config YAMLs. Now that we have everything in a single pipeline the h5n1-cattle-outbreak segment builds no longer need to download the genome tree (which they depend on), they can use it as a normal dependency and snakemake will build it as needed.

This abstracts out the configuration into two separate YAML files. As a result the snakemake complexity is reduced and (hopefully) the main interaction point can now be the YAMLs themselves. There are a few changes to the behaviour of the h5n1-cattle-outbreak builds: * For individual segment builds we now use the h5n1-cattle-outbreak dropped list (previously we used the H5N1 drop list) * For individual segment builds we now use the H5NX input data (previously we used the H5N1 input data) * We no longer remove sequences via a clock filter There are no changes to the behaviour of the GISAID builds. The config is generally straightforward except where parameters differ for a genome build vs the corresponding segment builds. To avoid having to list the same parameters out 8 times, I implemented (e.g.) `config.traits.genome_columns` and `config.traits.columns`. This is only observed in the h5n1-cattle-outbreak config.

The rules in `common.smk` were separated out to reduce rule duplication between the main Snakefile and Snakefile.genome. The latter has since been integrated into the main snakefile, and so we do the same with these "common" rules.

This results in disjoint sets of filenames for the GISAID builds (config/gisaid.yaml) and the NCBI builds (config/h5n1-cattle-outbreak.yaml), which therefore allows you to run each set of builds locally without one interfering with the other. In addition, the way local-ingest data can be used is streamlined so that you can achieve the same outcome with local data. The config keys are similarly streamlined. Note that if you run (e.g.) GISAID builds using local data then run them with S3 data all the intermediate files will be regenerated. In other words you cannot maintain parallel "versions" of these simultaneously.

Makes listing / looking at the results files a more pleasant experience There should be no changes to behaviour with this commit.

This (mass-renaming) is a companion to the parent commit ("Organise results into directories"). I chose to keep the filenames unchanged as it's a common occurrence to be editing files of the same type (e.g. include lists) for multiple subtypes and thus having the subtype remain in the filename is helpful.

The pipeline already adds this field to the metadata TSV in-use, but it won't be exported without this addition to the auspice-config JSON Note that the clade definitions haven't been regenerated for NCBI data so there's actually no clades defined at the moment, and thus nothing is exported.

to reflect the changes made in the previous few commits. The addition of "genome" to the h5n1-cattle-outbreak config YAML is needed to make it an explicit output of the `all` rule, and this output is what's used by the `deploy_all` rule

The `pathogen-repo-build` reusable action adds an extremely helpful summary describing the AWS run. This adds some similarly helpful info which should make it much simpler to check the results of a phylo run.

The build has recently expanded in scope with the addition of segment-level builds resulting in runtime increase from ~10min to ~80min. Parallelisation should speed this up a lot.

See <nextstrain/avian-flu#69> and <nextstrain/avian-flu#72> for details of the new datasets themselves. Segments ordered following the other avian-flu builds in the manifest, although note that this order differs from the canonical order of "pb2", "pb1", "pa", "ha","np", "na", "mp", "ns" which is used to order segments in the avian-flu/h5n1-cattle-outbreak/genome build.

jameshadfield · 2024-07-25T05:20:43Z

Rebased onto latest master. Merging now, which will result in #69 closing as those commits are in this PR.

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

tsibley · 2024-10-16T05:01:08Z

config/h5n1-cattle-outbreak.yaml

+  __clock_std_dev:  &clock_std_dev  0.00211 # YAML anchor so we can reference this value below
+
+  clock_rates:
+    FALLBACK:
+      # The rates for the 8 segments are taken from the GISAID H5N1/2y config
+      pb2: [0.00287, *clock_std_dev]
+      pb1: [0.00264, *clock_std_dev]
+      pa: [0.00248, *clock_std_dev]
+      ha: [0.00455, *clock_std_dev]
+      np: [0.00252, *clock_std_dev]
+      na: [0.00349, *clock_std_dev]
+      mp: [0.00191, *clock_std_dev]
+      ns: [0.00249, *clock_std_dev]


@jameshadfield BTW, you can do this instead to avoid the bogus extra field:

clock_rates: FALLBACK: # The rates for the 8 segments are taken from the GISAID H5N1/2y config pb2: [0.00287, &clock_std_dev 0.00211] pb1: [0.00264, *clock_std_dev] pa: [0.00248, *clock_std_dev] ha: [0.00455, *clock_std_dev] np: [0.00252, *clock_std_dev] na: [0.00349, *clock_std_dev] mp: [0.00191, *clock_std_dev] ns: [0.00249, *clock_std_dev]

Avoiding the bogus field is nice because you don't have to worry about it colliding with anything and if a config schema is used (like we do in ncov), it won't cause validation errors.

jameshadfield changed the base branch from master to james/cattle-flu-segments July 15, 2024 04:16

jameshadfield force-pushed the james/snakemake-simplifications branch 2 times, most recently from a8a639d to 736b114 Compare July 16, 2024 02:22

jameshadfield marked this pull request as ready for review July 16, 2024 02:31

jameshadfield requested a review from joverlee521 July 16, 2024 02:31

joverlee521 reviewed Jul 16, 2024

View reviewed changes

jameshadfield force-pushed the james/snakemake-simplifications branch from 736b114 to 6b9bc7c Compare July 16, 2024 23:24

jameshadfield mentioned this pull request Jul 16, 2024

[phylo github actions] add summary message #75

Closed

jameshadfield force-pushed the james/snakemake-simplifications branch from 6b9bc7c to 6316021 Compare July 17, 2024 02:35

jameshadfield added a commit to nextstrain/augur that referenced this pull request Jul 18, 2024

Update avian-flu CI

0bf8e5a

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

jameshadfield mentioned this pull request Jul 18, 2024

Update avian-flu CI nextstrain/augur#1552

Merged

jameshadfield added a commit to nextstrain/docker-base that referenced this pull request Jul 18, 2024

Update avian-flu CI

e806b13

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

jameshadfield mentioned this pull request Jul 18, 2024

Update avian flu ci nextstrain/docker-base#224

Merged

jameshadfield added a commit to nextstrain/conda-base that referenced this pull request Jul 18, 2024

Update avian-flu CI

595e589

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

jameshadfield mentioned this pull request Jul 18, 2024

Update avian-flu CI nextstrain/conda-base#86

Merged

jameshadfield added 15 commits July 25, 2024 15:09

multi-threaded alignment & tree building

a8bf1c3

refactor some inputs to functions

eb7c363

This work is in preparation for the next commit which will add functionality which needs to conditionally change the return value based on wildcards.

Refactor clock-rate params

d60bfca

In preparation for combining the genome snakemake workflow into this one

Integrate common.smk into Snakefile

b5f84e2

The rules in `common.smk` were separated out to reduce rule duplication between the main Snakefile and Snakefile.genome. The latter has since been integrated into the main snakefile, and so we do the same with these "common" rules.

Organise results into directories

ff3b2ab

Makes listing / looking at the results files a more pleasant experience There should be no changes to behaviour with this commit.

Update GitHub Actions invocations

cfefd2d

to reflect the changes made in the previous few commits. The addition of "genome" to the h5n1-cattle-outbreak config YAML is needed to make it an explicit output of the `all` rule, and this output is what's used by the `deploy_all` rule

[phylo github actions] add summary message

ab3ef2e

The `pathogen-repo-build` reusable action adds an extremely helpful summary describing the AWS run. This adds some similarly helpful info which should make it much simpler to check the results of a phylo run.

Run h5n1-cattle-outbreak builds on AWS Batch

e6b9158

The build has recently expanded in scope with the addition of segment-level builds resulting in runtime increase from ~10min to ~80min. Parallelisation should speed this up a lot.

jameshadfield force-pushed the james/snakemake-simplifications branch from 6316021 to e6b9158 Compare July 25, 2024 03:10

jameshadfield changed the base branch from james/cattle-flu-segments to master July 25, 2024 03:40

jameshadfield mentioned this pull request Jul 25, 2024

Update manifest with h5n1-cattle-outbreak segments nextstrain/nextstrain.org#957

Merged

3 tasks

jameshadfield merged commit 74b95ff into master Jul 25, 2024
3 checks passed

jameshadfield deleted the james/snakemake-simplifications branch July 25, 2024 05:20

jameshadfield added a commit to nextstrain/docker-base that referenced this pull request Jul 25, 2024

Update avian-flu CI

80b7af1

Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>

jameshadfield mentioned this pull request Jul 25, 2024

Segment builds for H5N1 cattle flu outbreak #69

Closed

1 task

joverlee521 mentioned this pull request Aug 9, 2024

Snakefile: Conditionally include --no-sign-request option #49

Merged

1 task

tsibley reviewed Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snakemake simplifications #72

Snakemake simplifications #72

jameshadfield commented Jul 15, 2024 •

edited

Loading

joverlee521 left a comment •

edited

Loading

joverlee521 Jul 16, 2024

jameshadfield Jul 16, 2024

joverlee521 Jul 16, 2024

jameshadfield Jul 18, 2024

joverlee521 Jul 16, 2024

jameshadfield Jul 16, 2024 •

edited

Loading

jameshadfield commented Jul 25, 2024

tsibley Oct 16, 2024

Snakemake simplifications #72

Snakemake simplifications #72

Conversation

jameshadfield commented Jul 15, 2024 • edited Loading

Still to do

joverlee521 left a comment • edited Loading

Choose a reason for hiding this comment

joverlee521 Jul 16, 2024

Choose a reason for hiding this comment

jameshadfield Jul 16, 2024

Choose a reason for hiding this comment

joverlee521 Jul 16, 2024

Choose a reason for hiding this comment

jameshadfield Jul 18, 2024

Choose a reason for hiding this comment

joverlee521 Jul 16, 2024

Choose a reason for hiding this comment

jameshadfield Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

jameshadfield commented Jul 25, 2024

tsibley Oct 16, 2024

Choose a reason for hiding this comment

jameshadfield commented Jul 15, 2024 •

edited

Loading

joverlee521 left a comment •

edited

Loading

jameshadfield Jul 16, 2024 •

edited

Loading