molgenis-c5-TumorNormal

The main goal is to detect variants in tumor-normal sample combinations.

Build status

The build status at the moment reflects just test whether or not molgenis-compute is able to generate the scripts. A complete install/run of the workflow is not tested.

branches

master
devel

Install

check out my .travis.yml and the test/*.sh files.

Prerequirements:

Lmod
EasyBuild
Molgenis-Compute
All the other software: installs using Easybuild.
Slurm sheduler or large enough machine to run locally although not tested starting at about ~30g mem, ~1T storage and 8 (intel) cpu's

Running

Prepare a samplesheet similar to Samplesheet.csv update the setting to match your dataset and run it using:


 Use:bash GenerateScripts.sh [none|exome|rna|nugene|nugrna|iont|withpoly|lexo] samplesheet projectname targetsList <nugeneProbebed/iontAmpliconBed>
     [none|exome|rna|nugene|nugrna|iont|withpoly|lexo]   
                    Application to use for sequencing.
     samplesheet    Csv file describing the fastq and samples to be analysed.
     projectname    Name to tag the project: this decides the header in your batchfiles and location to generate the jobs files.

     targetsList    Interval_list formatted file describing the target regions to report mutations in and near these regions.
     <nugeneProbebed/iontAmpliconBed>
                    Specific files for special workflow.

Methods

Variant Calling

The variant calling pipeline is an adaptation of the GATK workflow (DePristo et al. 2011, Van der Auwera et al. 2013) and molgenis compute (Byelas, H et al. 2013) as workflow management software.

Alignment of reads was done using BWA (Li, H. 2013) and the Genome Analysis Toolkit (abbr. GATK) (McKenna et al. 2010 ). Using the human genome reference build GRCH37 with decoys from the GATK bundle. Picard Tools was used for format conversion and Marking duplicates. As variant caller this pipeline uses the HaplotypeCaller to call all the samples of the same patient/cohort. Annotation of the variants was done using SnpEff (Cingolani P, Platts A, Wang le L et al. 2012, ) / SnpSift (Cingolani P, Patel VM, Coon M et al 2012) with the ensembl release 75 (Flicek, P et al 2014) gene annotations and the dbNSFP2.7 database (Liu, X et al. 2011, Liu, X et al. 2013), the GATK was used with variant annotations of dbsnp 138 (Sherry, ST et al. 2001), Cosmic v72 (Forbes et al. 2014), 1000 genomes phase 3 (1000 Genomes Project Consortium et al. 2012) and Exac 0.3 databases (Lek et al. 2016). The data was filtered for quality metrics similar to GATK recommendations (described below) and custom filters for population frequency and variant effect. For code and exact versioning see: https://github.com/mmterpstra/molgenis-c5-TumorNormal

Cnv Analysis

Copy number variation data were generated using Samtools and Varscan (Li et al., 2009, Koboldt et al., 2012) (pubmed, pubmed link). The GC-content normalised log2 fold change data was generated using using data of the tumor sample and the matched normal sample. Using a minimal segment size of 2000 bp, a maximal segment size of 5000 bp, minimal base wise coverage of 1 and the alignments with a mapping quality greater than 40 to calculate the fold changes, otherwise the default settings are assumed. The segment calling was done with DNAcopy algoritmn (Olshen et al., 2004). In addition to the default settings the circulair binary segmentation (CBS) calls were merged if they had an SD of < 0.5 with adjacent segments and weigths were added based on weigth = mean(normal/sample)+delta(normal,sample). Where normal = mean coverage per base in normal control and sample = mean coverage per base in described sample. The results were plotted using R after using perl for format conversion.

Alternate Workflows

These are the atlternate workflows and the change compared to the original.

Nugene
- In the nugene analysis the BQSR step is omitted because this is only meant for enritchment capturing a whole exome/genome. Using this on a small capturing kit might result in missing variant calls because of lack of observations. This happens because for BSQR you'll need approx 100*10^6 basepairs for recalibration.
- Also Landing probe removal by aligment location was performed with custom script from pipeline-util repo and quality trimming (Q>20) with bbduk was performed
NugeneInc
- This is not the official nugene pipeline but modeled to look like it: trimming for landing probe sequence and quality >20 trim using bbduk from the bbmap package.
RNAseq
- This has Hisat2 as aligner, used the GATK tool SplitNReads and uses different haplotypecaller settings.
Exome
- This is considered the default.
NugeneRNA
- combination of the Nugene and the RNAseq protocol changes.
none
- WGS protocol (future work).
Iontorrent
- Ion torrent workflow.... similar to nugene although work in progress.
With non-polymorfic variants
- Does not filter for non polymorfic variants.

References

Name	project website	Article
Fastqc	bioinformatics.babraham.ac.uk
BWA	github	preprint
Picard-tools	sourceforge	instructions below faq
GATK toolkit	project home	instructions here
VarScan	github	pubmed link
SnpEff	sourceforge	pubmed
SnpSift	sourceforge	pubmed
SAMtools	project home	pubmed
Compute	project home	citeseerx
DNAcopy	bioconductor.org	bioconductor.org
Cosmic	project home	pubmed
1000 genomes	project home	pubmed
dbSNP	project home	pubmed
dbNSFP	project home	site pubmed, pubmed and drive link
ensembl build 75	archive site	doi.org link
Exac	project home	instructions here
FusionCatcher	github	instructions here
manta	github	preprint
HTSeq	project home	pubmed
BBMap	sourceforge	na
BEDTools2	github	pubmed
Hisat2	project home github	nature
CoNVaDING	github	pubmed

Versions

Version managment done with lmod lua based implementation of environment modules. Installation/deployment management done with Easybuild. Here are the tools and software versions (for the current versions look at the *.siteconfig.csv):

software	version
FastQC	0.11.9-Java-11
BWA (mem)	0.7.17-GCCcore-11.3.0
picard	2.26.10-Java-8-LTS
R	4.2.1-foss-2022a
GATK	3.8-1-0-Java-1.8.0_144
snpEff	4.3t-Java-11-LTS
VarScan	2.4.3-Java-1.7.0_80
SAMtools	1.9-GCCcore-7.3.0
HTSlib	1.17-GCCcore-11.3.0
BCFtools	1.17-GCCcore-11.3.0
VCFtools	0.1.16-foss-2018b-Perl-5.30.0
open-cravat	1.8.0-foss-2018b-Python-3.7.4
pipeline-util	0.8.20-5-ga0a29bb-foss-2022a
parallel	20220722-GCCcore-11.3.0
TableToXlsx	0.3.0-GCCcore-11.3.0-Perl-5.34.1
BBMap	35.69-foss-2018b
BEDTools	2.27.1-foss-2018b
HTSeq	0.11.0-GCCcore-7.3.0-Python-3.7.4
R (rmarkdown)	4.2.1-foss-2022a
manta	1.6.0-GCCcore-11.3.0
freebayes	1.3.2-GCCcore-7.3.0
LoFreq	2.1.5-foss-2018b-Python-2.7.16
multiqc	1.12-GCCcore-11.3.0
R-bundle-ichorCNA	0.3.2-20191219-foss-2022a-R-4.2.1
fgbio	1.3.0
STAR	2.7.3a-foss-2018b
TrimGalore	0.6.7-GCCcore-11.3.0
cutadapt	4.2-GCCcore-11.3.0

Resource links

The resource links for installing references/preparing them

name	site
GATK Bundle	human reference
Ensembl	reference/gtf dowload
UCSC Tools	format conversion/additional tools

Workflow

The workflow is a multistep protocol including the following main steps.

Start with input reads in fastq format (from illumina sequencing)
(RNA / NugeneRNA ) Perform FusionCather on PE reads or else skip this analysis.
(Nugene/NugeneRNA) Add N6 barcode info into the read and trim by linker sequence and quality. steps detailed below.
- Add N6 barcode into read.
- Trim on Linker sequence and on base qual < 20
Align reads versus 1000g b37 reference genome using BWA or hisat for DNA or RNA respectively.
(Nugene/NugeneRNA) hard trim alignment on overlap with the landing probes.
Use Picard tools AddOrReplaceGroups to covert the sam into properly sorted and indexed bam format.
(Nugene/NugeneRNA) Expand readgroups based on the N6 barcodes: for each unique N6 barcode create a new readgroup.
Use Picard tools MergeBamFiles merge the .bam into a single .bam file for each sample.
Use Picard tools MarkDuplicates to mark the duplicates in each sample.
Use GATK indel realignment to know sites.
Call variants using GATK HaplotypeCaller
Annotate Variants using SNPEFF (funtional effects), SNPSIFT (dbsnfp) and GATK VariantAnnotator (exac / 1000g /cosmic).
Using the GATK SelectVariants and FilterVariants split by snv and indel & mnp then filter depending on the dataset.

list of definitions

Definition	Meaning
Readgroup	A subset of data with attached sample/library/machine/lane/etc.. info attached for use in Duplicate removal, BQSR, variant calling and other analysis steps to improve them.
N6 barcode	The random barcode of nugene attached to the normal sample barcode.

See pictures below for the different sub workflows:

Generic workflow

Nugene workflow

Nugene RNA workflow

Output files

The output contains 3 different types:

Project folder

All the (intermediate) files of the analysis
Use: to clean up ${project}_nobam.zip
All the relevant files excluding the bam files.
Subfolders
- CovariantAnalysis/
  - The base quality score recalibration results. This shows the effect of the analysis on the quality scores.
- fastqc/
  - The complete results of the fastqc tool
- HsMetrics/
  - The complete results of the picard CalculateHsMetrics tool. ${sample}.hsmetrics.log: calulated metrics like fraction of bases 20x covered and mean coverage. ${sample}.hsmetrics.pertargetcov.tsv: coverage for each target region
- md/
  - Collected metrics of the more important quality metrics in html/markdown format.
- multipleMetrics/
  - The complete results of the picard multipleMetrics tool. Many metrics included...
- variantfiltering/
  - The (soft) filtered vcf files, to make IGV(integrative genomics viewer)/bioinformatics people happy.
- varscan.${controlSampleName}/
  - Our CNV analysis see the varscan.normals/.pdf varscan.normals/.seg files for results. This shows the copynumber ratios versus the normal samples.
- xlsx/
  - The table dumps of the VCF files made for excel spreadsheet processing. Four types are present description,raw,min and filtered for viewing description of the headers,the raw variant calls (without filtering), the variant calls with a selected amount of columns and the variant calls filtered for population frequency and function. This for different type of calls: single nucletide variants (snv), small indels and multinucleotide polymorfisms (indelmnp) and structural variants called with manta (sv / work in progress).

Variantcalling

This is done with the haplotype caller using -stand_call_conf 10.0 and -stand_emit_conf 20.0 (call bases with QUAL > 10). Instead of the default calling with a QUAL >= 30 (see also the 'Filtering of variant calls' paragraph).

Filtering of variant calls

The recommended filtering procedure depends on the sample size, sequencing method, target region size and depth these are general recommendations. The filtering might benefit from less stringent filtering of the variant statistics to increase sensitivity. The refiltering is best done on the raw data because these still contain the removed variance. Also consider adding exac.ac/exac.an <> 0.02 and adding the population(european/asian/american/african) specific filters for exac,esp6500 and 1000g. If you document your filtering I might add it to the filtering pipeline.

The table below shows the filtering steps that are automatically performed. Note the different filtering strategies for the following subtypes.

snv
indel & mnp

Name	Expression	Applied on	Description
"LowQual"	"QUAL < 20"	both (snv and indel & mnp)	Variant statistic. Filter for the possibility (> 1/1000 or <30 pred scaled or > 1/100 or <20 pred scaled) that the variant call is wrong using a bayesian model.
"QDlt2"	"QD < 2.0"	both	Variant statistic. Filter for the pred scaled possibility that the variant call is wrong divided by the depth < 2.0
"MQlt40"	"MQ < 40.0"	snv	Variant statistic. Filter snv for the pred scaled possibility that a mapping is wrong, capped 60, calculated with secondary hits using the base quality scores at the different positions to call it 0 or higher. Filter for unique mappings.
"MQRankSumlt-12_5"	"MQRankSum < -12.5"	snv	Variant statistic. Filter snv for mutations in which the mutation or the reference has difficulties mapping depending on one another.
"MQRankSumlt-20"	"MQRankSum < -20"	indel & mnp	Variant statistic. Filter indel for mutations in which the mutation or the reference has difficulties mapping depending on one another.
"ReadPosRankSumlt-20"	"ReadPosRankSum < -20.0"	snv	Variant statistic. Fisher strand bias of the covering reads. High values == more bias.
"FSgt60"	"FS > 60.0"	snv	Variant statistic. Fisher strand bias of the covering reads. High values == more bias.
"FSgt200"	"FS > 200.0"	indel & mnp	Variant statistic. Tandem repeat annotation of the region. The more repeats the less likely the indel call is true. Sequencing has difficulties with repeats (with about >=8 repeating units).
"RPAgt8"	"vc.getAttribute('RPA').0 > 8		vc.getAttribute('RPA').1 > 8
"TeMeermanAlleleBiasgt5"	TeMeermanAlleleBias > 5.0	both	Variant statistic. Never filters anything -> legency artifact
"NotPolymorfic"	all GT equal	both	Genotype annotation. All the genotypes are equal. If analysed with a proper control this variant is very unlikely to be harmful.
"NotPutatativeHarmfulVariant"	see descr.	both	Functional annotation. everything that is not fuctional according the snpeff documentation under 'Effect prediction details' aka jexl expression :" "!((vc.hasAttribute('SNPEFFANN_ANNOTATION_IMPACT') && vc.getAttribute('SNPEFFANN_ANNOTATION_IMPACT').contains('HIGH'))
"1000gMAFgt0.02"	see descr.	both	Population annotation. entire 1000g phase 1 alelle fequency of 2% aka "(vc.hasAttribute('1000gPhase1Snps.AF') &&(vc.getAttribute('1000gPhase1Snps.AF') > 0.02&&vc.getAttribute('1000gPhase1Snps.AF') < 0.98))" --filterName "1000gMAFgt0.02"
"1000gEURMAFgt0.02"	see descr.	both	Population annotation. european 1000g phase 1 alelle fequency of 2% aka "(vc.hasAttribute('1000gPhase1Snps.EUR_AF') && (vc.getAttribute('1000gPhase1Snps.EUR_AF') > 0.02&&vc.getAttribute('1000gPhase1Snps.EUR_AF') < 0.98))" --filterName "1000gEURMAFgt0.02"
"QDlt2andQdbyAflt8"	"QD < 2.0 && QD / AF < 8.0	both	Variant statistic. Having both QD < 2 and QD/ AF < 8 in tekst:the qual divided by depth less then 2 and the qual divided by depth divided by allele frequency less than 8. The second parameter(QDAF is to avoid removing low AF (read like: rare) variant calls from the data.). Also note the QD is already calculated by using samples containing the variant.

Testing

simple syntax /generation checking

make test

testdata from the broadinstitute from here

bash GenerateScripts.sh exomehuman38 samplesheet_tutorial_801.csv testgrch38 /data/umcg-mterpstra/apps/data/ftp.broadinstitute.org/bundle/bundle17jan2020/hg38/wgs_calling_regions.hg38.interval_list && bash .RunWorkFlowGeneration.sh && bash /scratch/umcg-mterpstra/projects/testgrch38/jobs/submit.sh)

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
eb		eb
img		img
parameters		parameters
scripts		scripts
templates/compute		templates/compute
tests		tests
workflows		workflows
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENCE		LICENCE
Makefile		Makefile
Molgenis-Compute-v17.08.1-Java-8-LTS.eb		Molgenis-Compute-v17.08.1-Java-8-LTS.eb
README.md		README.md
samplesheet.csv		samplesheet.csv
samplesheet.pg.csv		samplesheet.pg.csv
samplesheet_tutorial_801.csv		samplesheet_tutorial_801.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

molgenis-c5-TumorNormal

Build status

Install

Running

Methods

Variant Calling

Cnv Analysis

Alternate Workflows

References

Versions

Resource links

Workflow

Output files

Variantcalling

Filtering of variant calls

Testing

About

Releases

Packages

Languages

License

mmterpstra/molgenis-c5-TumorNormal

Folders and files

Latest commit

History

Repository files navigation

molgenis-c5-TumorNormal

Build status

Install

Running

Methods

Variant Calling

Cnv Analysis

Alternate Workflows

References

Versions

Resource links

Workflow

Output files

Variantcalling

Filtering of variant calls

Testing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages