Authors:
Qian Liu ^[Roswell Park Comprehensive Cancer Center],
Another Author^[Roswell Park Comprehensive Cancer Center].
Last modified: July 27, 2023.
- Basic familiarity with DNA-seq data variant calling
- Interest of using workflow language
The workshop format is a 45 minute session consisting of hands-on demos, exercises and Q&A.
- ReUseData
- RcwlPipelines
- Rcwl
For the somatic variant calling, we will need to prepare the following:
- Experiment data
- In the format of
.bam
,.bam.bai
files
- In the format of
- ReUsable Genomic data
- reference sequence file (
b37
orhg38
) - Panel of Normals (PON) ref
- reference sequence file (
- Software tool:
- Here we use
Mutect2
to Call somatic SNVs and indels via local assembly of haplotypes. ref
- Here we use
We also want to have the data analysis workflow to be reproducible:
- Software tool properly tracked for version, docker image etc.
- Data provenance properly tracked for public data resources for:
- workflow reproducibility
- later reuse in other similar projects
The first can be solved by workflow languages (e.g., CWL, WDL, snakemake, etc.). There is no similar tools for the 2nd task.
In this workshop, I will demostrate two Bioconductor packages:
Rcwl
as an R interface for CWL
, and RcwlPipelines
for >200
pre-built bioinformatics tools and best practice pipelines in R,
that are easily usable and highly customizable. I will also introduce
a R/Bioconductor package ReUseData
for the management of reusable
genomic data.
With these tools, we should be able to conduct reproducible data analysis using commonly used bioinformatics tools (including command-line based tools and R/Bioconductor packages) and validated, best practice workflows (based on workflow languages such as CWL) within a unified R programming environment.