Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[phylo github actions] add summary message #75

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Jul 16, 2024

The pathogen-repo-build reusable action adds an extremely helpful summary describing the AWS run. This adds some similarly helpful info which should make it much simpler to check the results of a phylo run.

Trial run(s):

  • NCBI Note the 2nd URL is missing, this has now been fixed

This abstracts out the configuration into two separate YAML files. As a
result the snakemake complexity is reduced and (hopefully) the main
interaction point can now be the YAMLs themselves.

There are a few changes to the behaviour of the h5n1-cattle-outbreak
builds:
  * For individual segment builds we now use the h5n1-cattle-outbreak
    dropped list (previously we used the H5N1 drop list)
  * For individual segment builds we now use the H5NX input data
    (previously we used the H5N1 input data)
  * We no longer remove sequences via a clock filter

There are no changes to the behaviour of the GISAID builds.

The config is generally straightforward except where parameters differ
for a genome build vs the corresponding segment builds. To avoid having
to list the same parameters out 8 times, I implemented (e.g.)
`config.traits.genome_columns` and `config.traits.columns`. This is only
observed in the h5n1-cattle-outbreak config.
The rules in `common.smk` were separated out to reduce rule duplication
between the main Snakefile and Snakefile.genome. The latter has since
been integrated into the main snakefile, and so we do the same with
these "common" rules.
This results in disjoint sets of filenames for the GISAID builds
(config/gisaid.yaml) and the NCBI builds
(config/h5n1-cattle-outbreak.yaml), which therefore allows you to run
each set of builds locally without one interfering with the other.

In addition, the way local-ingest data can be used is streamlined so that
you can achieve the same outcome with local data.

Note that if you run (e.g.) GISAID builds using local data then run them
with S3 data all the intermediate files will be regenerated. In other
words you cannot maintain parallel "versions" of these simultaneously.
Makes listing / looking at the results files a more pleasant experience

There should be no changes to behaviour with this commit.
The pipeline already adds this field to the metadata TSV in-use, but it
won't be exported without this addition to the auspice-config JSON

Note that the clade definitions haven't been regenerated for NCBI data
so there's actually no clades defined at the moment, and thus nothing
is exported.
to reflect the changes made in the previous few commits.

The addition of "genome" to the h5n1-cattle-outbreak config YAML is
needed to make it an explicit output of the `all` rule, and this output
is what's used by the `deploy_all` rule
The `pathogen-repo-build` reusable action adds an extremely helpful
summary describing the AWS run. This adds some similarly helpful info
which should make it much simpler to check the results of a phylo run.
@jameshadfield jameshadfield force-pushed the james/snakemake-simplifications branch from 736b114 to 6b9bc7c Compare July 16, 2024 23:24
@jameshadfield
Copy link
Member Author

Cherry-picked into #72 - thanks for taking a look @joverlee521!

@jameshadfield jameshadfield deleted the james/improved-gha-summaries branch July 17, 2024 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants