This workflow is intended to formally replace ad hoc whole genome sequencing validation workflows using manual or cumbersome invocations of hap.py or equivalent tooling. Separate but analogous support is provided for SNVs and SVs. As SV validation is not standardized in the field as a whole, various implementations of SV validation heuristics are supported to mimic the behavior of various prominent publications.
New global targets should be added in workflow/Snakefile
. Content in workflow/Snakefile
and the snakefiles in workflow/rules
should be specifically rules; python infrastructure should be composed as subroutines under lib/
and constructed in such a manner as to be testable with pytest. Rules can call embedded scripts (in python or R/Rmd) from workflow/scripts
; again, these should be constructed to be testable with pytest or testthat.
- Lightning Auriga (@lightning-auriga)
- Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone git@github.com:UCI-GREGoR/wgs-validation.git
Note that this requires local git ssh key configuration; see here for instructions as required.
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, and manifest.tsv
to specify your sample setup.
The following settings are recognized in config/config.yaml
.
Configuration Setting | Description |
---|---|
experiment-manifest |
relative path to manifest of experimental vcfs |
reference-manifest |
relative path to manifest of reference (i.e. "gold standard") vcfs |
comparisons-manifest |
relative path to manifest of desired experimental/reference comparisons |
happy-bedfiles-per-stratification |
how many stratification region sets should be dispatched to a single hap.py job. hap.py is a resource hog, and a relatively small number of stratification sets to the same run can cause it to explode. a setting of no more that 6 has worked in the past, though that was in a different setting |
sv-settings |
configuration settings for SV comparisons and SV-specific tools |
merge-experimental-before-comparison : whether to use SVDB to combine variants within a single experimental sample vcf before comparison |
|
merge-reference-before-comparison : whether to use SVDB to combine variants within a single reference sample vcf before comparison |
|
svanalyzer : settings specific to svanalyzer . see svanalyzer project for parameter documentation |
|
svdb : settings specific to svdb . see svdb project for parameter documentation |
|
sveval : settings specific to sveval . see sveval project for parameter documentation |
|
genome-build |
desired genome reference build for the comparisons. referenced by aliases specified in genomes block |
The genomes
block of the configuration file contains an arbitrary set of genome specifications. The intention is that the
blocks specified under this tag are assigned keys such as grch38
, grch37
, etc. The following settings are specified
under each block:
Configuration Setting | Description |
---|---|
fasta |
path to genome fasta corresponding to this build. can be a path to a local file, or an http/ftp link, or an s3 path |
confident-regions |
arbitrarily many bedfiles describing a confident background on which comparison should be evaluated. the names under confident-regions are unique identifiers labeling the region type, and contain the following key/value pairs |
bed : bed regions in which to compute calculations. high-confidence GIAB background bedfiles can be specified here |
|
inclusion : (optional) a regex to match against experimental replicate entry in manifest (see below). only reports containing samples matching this pattern will be run against these confident regions. if not included, this region is used for all reports |
|
stratification-regions |
intended to be the GIAB stratification regions, as described here. the remote directory will be mirrored locally. these entries are specified as: |
ftp : the ftp hosting site |
|
dir : the subdirectory of the ftp hosting site, through the genome build directory |
|
region-labels : a manifest of GA4GH-style stratification regions, with short name of region and corresponding human-legible description of the content of the region. see below for how these should be selected in the workflow |
hap.py and similar tools are designed to compute both overall metrics and metrics within specific subdivisions of the genome. These divisions can be anything, but are most commonly framed as genome stratifications containing different types of variation that can be challenging for WGS calling tools.
The GA4GH stratification regions referenced above are numerous, and many of those regions aren't particularly interesting or interpretable. Furthermore, hap.py is resource-intensive, and computing superfluous comparisons that won't be analyzed is undesirable. As such, this workflow requires the user to select individual stratification sets for inclusion in the output reports. The inclusions are specified as key-value pairs in config/config.yaml
under the configuration key genomes/{genome-build}/stratification-regions/region-inclusions
. The keys are names of regions as specified in the corresponding region-labels
manifest; the values are regular expressions describing the sample identifiers that should be evaluated within those regions. To select all samples, use the regular expression "^.*$"
.
As an example, the following would evaluate the whole genome background, annotated as *
in the GA4GH v3.1 region set, against all samples, but the alldifficultregions
set would only be evaluated in samples labeled (in part) NA12878:
genomes:
grch38:
stratification-regions:
region-inclusions:
"*": ".*"
"alldifficultregions": ".*NA12878.*"
The following columns are expected in the experiment manifest, by default at config/manifest_experiment.tsv
:
Manifest Entry | Description |
---|---|
experimental_dataset |
arbitrary identifier for this dataset |
replicate |
identifier linking experimental subjects representing the same underlying sample and conditions. this identifier will be used to collapse multiple subjects into single mean/SE estimates in the downstream report, if multiple subjects with the same identifier are included in the same report |
vcf |
path to experimental dataset vcf |
Note that for the experimental manifest, multiple rows with the same experimental_dataset
value can be included with different corresponding vcfs. If this is the case, the multiple vcfs will be concatenated and sorted into a single vcf before validation. The intended use case of this functionality is on-the-fly concatenation of single chromosome vcfs per sample.
The following columns are expected in the reference manifest, by default at config/manifest_reference.tsv
:
Manifest Entry | Description |
---|---|
reference_dataset |
arbitrary, unique alias for this reference dataset |
vcf |
path to reference dataset vcf |
The following columns are expected in the comparisons manifest, by default at config/manifest_comparisons.tsv
:
Manifest Entry | Description |
---|---|
experimental_dataset |
experimental dataset for this comparison, referenced by unique alias |
reference_dataset |
reference dataset for this comparison, referenced by unique alias |
report |
unique identifier labeling which report this comparison should be included in. multiple can be specified, in a comma-delimited list |
Note that the entries in individual columns of the comparisons manifest are not intended to be unique, so multiple comparisons involving the same file are expected.
Install Snakemake using conda:
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake --use-conda --profile sge-profile --cluster-config config/cluster.yaml --jobs 100
See the Snakemake documentation for further details.
Snakemake interfaces with job schedulers via cluster profiles. For running jobs on SGE, you can use the cookiecutter template here.
After the completion of a run, there will be control validation reports, in html format, at results/reports
.
One report will exist per configured comparison, in config/manifest_comparisons.tsv
column report
.
The contents of the report are:
- tabular summaries of requested stratification regions from the configuration
- plots of requested stratification regions from the configuration
- summary information about the R execution environment used to create the report
Other information will be included in future versions.
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
- Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com/UCI-GREGoR/wgs-validation.git
orupstream https://github.com/UCI-GREGoR/wgs-validation.git
if you do not have setup ssh keys. - Update the upstream version:
git fetch upstream
. - Create a diff with the current version:
git diff HEAD upstream/default workflow > upstream-changes.diff
. - Investigate the changes:
vim upstream-changes.diff
. - Apply the modified diff via:
git apply upstream-changes.diff
. - Carefully check whether you need to update the config files:
git diff HEAD upstream/default config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
In case you have also changed or added steps, please consider contributing them back to the original repository. This project follows git flow; feature branches off of dev are welcome.
- Clone the fork to your local system, to a different place than where you ran your analysis.
- Check out a branch off of dev:
git fetch
git checkout dev
git checkout -b your-new-branch
- Make whatever changes best please you to your feature branch.
- Commit and push your changes to your branch.
- Create a pull request against dev.
Testing infrastructure for embedded python and R scripts is installed under lib/
and workflow/scripts/
. Additional testing
coverage for the Snakemake infrastructure itself should be added once the workflow is more mature (see here).
The testing under lib/
is currently functional. Partial testing exists for the builtin scripts under workflow/scripts
: the new utilities
for this implementation are tested, but some code inherited from the legacy pipeline(s) is not yet covered. To run the tests, do the following (from top level):
mamba install pytest-cov
pytest --cov=lib --cov=workflow/scripts lib workflow/scripts
The testing under workflow/scripts
is currently functional. The tests can be run with the utility script run_tests.R
:
Rscript ./run_tests.R
To execute the above command, the environment must have an instance of R with appropriate libraries:
mamba install -c bioconda -c conda-forge "r-base>=4" r-testthat r-covr r-r.utils r-desctools