-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snakemake simplifications #72
Conversation
a8a639d
to
736b114
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice overall simplification of the Snakemake workflow (at least from my POV).
I left comments for consideration, the main one being the ramifications of updating the CI build-args.
@@ -11,4 +11,4 @@ jobs: | |||
# conform since this is a collaborative repo with an external group. | |||
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0 | |||
with: | |||
build-args: test_target | |||
build-args: --configfile config/gisaid.yaml -pf test_target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, a similar update will need to be applied to docker-base, conda-base, and augur CI workflows.
I am considering if we want to use config/gisiad.yaml
as a default config. Then we wouldn't need to update the build-args in CI and we can keep the simple nextstrain build
invocation without needing to provide a config file.
This can be done in two ways:
- Defining it as a default config in the Snakefile (as suggested by Nextstrain Snakemake style guide).
configfile: "config/gisaid.yaml"
The downside to this method is the config dictionaries merge. This becomes an issue when adding --configfile config/h5n1-cattle-outbreak.yaml
, as the top level builds
would merge and run all subtypes.
- Only define the default config if no configs are provided (similar to mpox).
if not config:
configfile: "config/gisaid.yaml"
The config dictionary merge issue from [1] still exists, but at least this way adding --configfile config/h5n1-cattle-outbreak.yaml
will completely switch over to the h5n1-cattle-outbreak config params. The downside to this method is users need to remember to provide the configfile if they want to override specific configs with --config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh. Thanks for noticing - this would have been a pain to solve post-merge!
I like the explicitness of providing a --configfile
for each run. But do I like it enough to make changes across 3 extra repos? Probably. Will think about it some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Only noticed because we recently went through this in #17).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRs made to update CI in
@@ -22,7 +22,10 @@ segments: | |||
- ns | |||
|
|||
#### Parameters which define the input source #### | |||
s3_src: s3://nextstrain-data-private/files/workflows/avian-flu | |||
s3_src: | |||
name: gisaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider consolidating s3_src.name
with the the top level ingest_source
param? It's not entirely clear to me why this is a separate config param (except using gisaid
vs fauna
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone a slightly different direction - I've removed ingest_source
so we now have two isolated and mutually exclusive¹ config keys: s3_src
and local_ingest
. I preferred this as the local ingest name (e.g. "andersen-lab") doesn't necessarily overlap with the S3 "name".
¹ As per added comment in the Snakefile, we could relax the exclusivity of S3_SRC
and LOCAL_INGEST
if we want to use --config local_ingest=gisaid
overrides in the future.
736b114
to
6b9bc7c
Compare
6b9bc7c
to
6316021
Compare
Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>
Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>
Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>
This work is in preparation for the next commit which will add functionality which needs to conditionally change the return value based on wildcards.
Individual datasets for all 8 segments are added using NCBI data (consistent with the whole-genome view) and using the strains included in the genome tree to choose the appropriate monophyletic clade in each segment tree to include. There is an added Auspice colouring "genome_tree" which indicates whether the strain exists in the whole genome tree. See top-level README for instructions on how to run. There are a number of important caveats with the current implementation, all of which are fixable but I think better done in future work. * The input source must be NCBI for this build to work as expected and thus the `s3_src` config argument is essential. Snakemake will not reuse existing files in `./data` if they originated from a different source (e.g. GISAID), despite their filenames being identical as the params of the originating rule are different. I've added a note about this in the README. * The filenames produced include "_all-time" as the current Snakemake workflow requires a "time" wildcard, however we want to remove this prior to uploading. E.g. `auspice/avian-flu_h5n1-cattle-outbreak_ha_all-time.json` should be uploaded to /avian-flu/h5n1-cattle-outbreak/ha to match the URL structure for the genome build. We can use the following bash one-liner prior to uploading while this remains unfixed: ``` for i in auspice/avian-flu_h5n1-cattle-outbreak_*_all-time.json; do mv $i ${i%_all-time.json}.json; done ``` * The segment builds may include basal strains which aren't part of the cattle-flu outbreak but are included because there aren't mutations which distinguish them from others which should be. The "genome_tree" colouring is helpful here. (See the note in `restrict-via-common-ancestor.py` -- we may want to include more basal strains here).
Part of simplifications described in <#70> This approach avoids using an empty wildcard for time but introduces the need to have a separate rule to map the target Auspice dataset JSON filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a results file with a time wildcard (e.g. `results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of the pipeline is unchanged.
In preparation for combining the genome snakemake workflow into this one
The myriad usages of lambda functions should go away once we switch to using config YAMLs. Now that we have everything in a single pipeline the h5n1-cattle-outbreak segment builds no longer need to download the genome tree (which they depend on), they can use it as a normal dependency and snakemake will build it as needed.
This abstracts out the configuration into two separate YAML files. As a result the snakemake complexity is reduced and (hopefully) the main interaction point can now be the YAMLs themselves. There are a few changes to the behaviour of the h5n1-cattle-outbreak builds: * For individual segment builds we now use the h5n1-cattle-outbreak dropped list (previously we used the H5N1 drop list) * For individual segment builds we now use the H5NX input data (previously we used the H5N1 input data) * We no longer remove sequences via a clock filter There are no changes to the behaviour of the GISAID builds. The config is generally straightforward except where parameters differ for a genome build vs the corresponding segment builds. To avoid having to list the same parameters out 8 times, I implemented (e.g.) `config.traits.genome_columns` and `config.traits.columns`. This is only observed in the h5n1-cattle-outbreak config.
The rules in `common.smk` were separated out to reduce rule duplication between the main Snakefile and Snakefile.genome. The latter has since been integrated into the main snakefile, and so we do the same with these "common" rules.
This results in disjoint sets of filenames for the GISAID builds (config/gisaid.yaml) and the NCBI builds (config/h5n1-cattle-outbreak.yaml), which therefore allows you to run each set of builds locally without one interfering with the other. In addition, the way local-ingest data can be used is streamlined so that you can achieve the same outcome with local data. The config keys are similarly streamlined. Note that if you run (e.g.) GISAID builds using local data then run them with S3 data all the intermediate files will be regenerated. In other words you cannot maintain parallel "versions" of these simultaneously.
Makes listing / looking at the results files a more pleasant experience There should be no changes to behaviour with this commit.
This (mass-renaming) is a companion to the parent commit ("Organise results into directories"). I chose to keep the filenames unchanged as it's a common occurrence to be editing files of the same type (e.g. include lists) for multiple subtypes and thus having the subtype remain in the filename is helpful.
The pipeline already adds this field to the metadata TSV in-use, but it won't be exported without this addition to the auspice-config JSON Note that the clade definitions haven't been regenerated for NCBI data so there's actually no clades defined at the moment, and thus nothing is exported.
to reflect the changes made in the previous few commits. The addition of "genome" to the h5n1-cattle-outbreak config YAML is needed to make it an explicit output of the `all` rule, and this output is what's used by the `deploy_all` rule
The `pathogen-repo-build` reusable action adds an extremely helpful summary describing the AWS run. This adds some similarly helpful info which should make it much simpler to check the results of a phylo run.
The build has recently expanded in scope with the addition of segment-level builds resulting in runtime increase from ~10min to ~80min. Parallelisation should speed this up a lot.
6316021
to
e6b9158
Compare
See <nextstrain/avian-flu#69> and <nextstrain/avian-flu#72> for details of the new datasets themselves. Segments ordered following the other avian-flu builds in the manifest, although note that this order differs from the canonical order of "pb2", "pb1", "pa", "ha","np", "na", "mp", "ns" which is used to order segments in the avian-flu/h5n1-cattle-outbreak/genome build.
Rebased onto latest master. Merging now, which will result in #69 closing as those commits are in this PR. |
Mirrors the change in avian-flu's CI. <nextstrain/avian-flu#72>
__clock_std_dev: &clock_std_dev 0.00211 # YAML anchor so we can reference this value below | ||
|
||
clock_rates: | ||
FALLBACK: | ||
# The rates for the 8 segments are taken from the GISAID H5N1/2y config | ||
pb2: [0.00287, *clock_std_dev] | ||
pb1: [0.00264, *clock_std_dev] | ||
pa: [0.00248, *clock_std_dev] | ||
ha: [0.00455, *clock_std_dev] | ||
np: [0.00252, *clock_std_dev] | ||
na: [0.00349, *clock_std_dev] | ||
mp: [0.00191, *clock_std_dev] | ||
ns: [0.00249, *clock_std_dev] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameshadfield BTW, you can do this instead to avoid the bogus extra field:
clock_rates:
FALLBACK:
# The rates for the 8 segments are taken from the GISAID H5N1/2y config
pb2: [0.00287, &clock_std_dev 0.00211]
pb1: [0.00264, *clock_std_dev]
pa: [0.00248, *clock_std_dev]
ha: [0.00455, *clock_std_dev]
np: [0.00252, *clock_std_dev]
na: [0.00349, *clock_std_dev]
mp: [0.00191, *clock_std_dev]
ns: [0.00249, *clock_std_dev]
Avoiding the bogus field is nice because you don't have to worry about it colliding with anything and if a config schema is used (like we do in ncov), it won't cause validation errors.
See commit messages for details
Still to do
results
to better partition intermediate files into directories. E.g.results/traits_h5nx_pb2_all-time.json
→results/h5nx/pb2/all-time/traits.json
snakemake-simplifications
)NCBI (including ingest)Messed this up, you can't use this for trial builds so I inadvertantly asked for the "snakemake-simplifications" docker image and it failed