Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest #10

Merged
merged 17 commits into from
Feb 14, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
ingest/README.md: remove Profiles, change config to defaults
  • Loading branch information
kimandrews committed Feb 10, 2024
commit 6976bd4f9cdc6ab5503ac26828a7d94157e5cb9c
14 changes: 3 additions & 11 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ This produces a `results` directory with the following outputs:
- sequences.fasta
- metadata.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also includes all_metadata.tsv.

In phylo workflows we have {data,results,auspice} and it's understood that the files in results can be numerous and change frequently with workflow updates. For ingest we only have {data,results}. My understanding is that more complex ingest workflows will populate results with many files. So maybe we could change the wording here to indicate that of the files in results these two are the ones that should be used for downstream analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the all_metadata.tsv looks like an intermediate data file, not the final result metadata.tsv.

@kimandrews, you can change the path from results/all_metadata.tsv to data/all_metadata.tsv so there's less confusion on what the final output files should be (e.g. results/sequences.fasta and results/metadata.tsv.

metadata="results/all_metadata.tsv",

rule subset_metadata:
input:
metadata="results/all_metadata.tsv",
output:
subset_metadata="results/metadata.tsv",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f3fe0c1


## Config
## Defaults

The config directory contains all of the default configurations for the ingest workflow.
The defaults directory contains all of the default configurations for the ingest workflow.

[config/defaults.yaml](config/defaults.yaml) contains all of the default configuration parameters
[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters
used for the ingest workflow. Use Snakemake's `--configfile`/`--config`
options to override these default values.

Expand All @@ -34,14 +34,6 @@ The modules of the workflow are in separate files to keep the main ingest [Snake
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

## Profiles

The profiles directory contains custom configs and rules that override and/or
extend the default workflow.

- [nextstrain_automation](profiles/nextstrain_automation/) - profile for the internal automated Nextstrain builds.


## Vendored

This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo)
Expand Down