The main goal is to detect variants in tumor-normal sample combinations.
The build status at the moment reflects just test whether or not molgenis-compute is able to generate the scripts. A complete install/run of the workflow is not tested.
branches
check out my .travis.yml
and the test/*.sh
files.
Prerequirements:
- Lmod
- EasyBuild
- Molgenis-Compute
- All the other software: installs using Easybuild.
- Slurm sheduler or large enough machine to run locally although not tested starting at about ~30g mem, ~1T storage and 8 (intel) cpu's
Prepare a samplesheet similar to Samplesheet.csv
update the setting to match your dataset and run it using:
Use:bash GenerateScripts.sh [none|exome|rna|nugene|nugrna|iont|withpoly|lexo] samplesheet projectname targetsList <nugeneProbebed/iontAmpliconBed>
[none|exome|rna|nugene|nugrna|iont|withpoly|lexo]
Application to use for sequencing.
samplesheet Csv file describing the fastq and samples to be analysed.
projectname Name to tag the project: this decides the header in your batchfiles and location to generate the jobs files.
targetsList Interval_list formatted file describing the target regions to report mutations in and near these regions.
<nugeneProbebed/iontAmpliconBed>
Specific files for special workflow.
The variant calling pipeline is an adaptation of the GATK workflow (DePristo et al. 2011, Van der Auwera et al. 2013) and molgenis compute (Byelas, H et al. 2013) as workflow management software.
Alignment of reads was done using BWA (Li, H. 2013) and the Genome Analysis Toolkit (abbr. GATK) (McKenna et al. 2010 ). Using the human genome reference build GRCH37 with decoys from the GATK bundle. Picard Tools was used for format conversion and Marking duplicates. As variant caller this pipeline uses the HaplotypeCaller to call all the samples of the same patient/cohort. Annotation of the variants was done using SnpEff (Cingolani P, Platts A, Wang le L et al. 2012, ) / SnpSift (Cingolani P, Patel VM, Coon M et al 2012) with the ensembl release 75 (Flicek, P et al 2014) gene annotations and the dbNSFP2.7 database (Liu, X et al. 2011, Liu, X et al. 2013), the GATK was used with variant annotations of dbsnp 138 (Sherry, ST et al. 2001), Cosmic v72 (Forbes et al. 2014), 1000 genomes phase 3 (1000 Genomes Project Consortium et al. 2012) and Exac 0.3 databases (Lek et al. 2016). The data was filtered for quality metrics similar to GATK recommendations (described below) and custom filters for population frequency and variant effect. For code and exact versioning see: https://github.com/mmterpstra/molgenis-c5-TumorNormal
Copy number variation data were generated using Samtools and Varscan (Li et al., 2009, Koboldt et al., 2012) (pubmed, pubmed link).
The GC-content normalised log2 fold change data was generated using using data of the tumor sample and the matched normal sample. Using a minimal segment size of 2000 bp, a maximal segment size
of 5000 bp, minimal base wise coverage of 1 and the alignments with a mapping quality greater than 40 to calculate the fold changes, otherwise the default settings are assumed. The segment calling was done with DNAcopy algoritmn
(Olshen et al., 2004). In addition to the default settings the circulair binary segmentation (CBS) calls were merged if they had an SD of < 0.5 with adjacent segments and weigths were added based on
weigth = mean(normal/sample)+delta(normal,sample)
. Where normal
= mean coverage per base in normal control and sample
= mean coverage per base in described sample.
The results were plotted using R after using perl for format conversion.
These are the atlternate workflows and the change compared to the original.
- Nugene
- In the nugene analysis the BQSR step is omitted because this is only meant for enritchment capturing a whole exome/genome. Using this on a small capturing kit might result in missing variant calls because of lack of observations. This happens because for BSQR you'll need approx 100*10^6 basepairs for recalibration.
- Also Landing probe removal by aligment location was performed with custom script from pipeline-util repo and quality trimming (Q>20) with bbduk was performed
- NugeneInc
- This is not the official nugene pipeline but modeled to look like it: trimming for landing probe sequence and quality >20 trim using bbduk from the bbmap package.
- RNAseq
- This has Hisat2 as aligner, used the GATK tool SplitNReads and uses different haplotypecaller settings.
- Exome
- This is considered the default.
- NugeneRNA
- combination of the Nugene and the RNAseq protocol changes.
- none
- WGS protocol (future work).
- Iontorrent
- Ion torrent workflow.... similar to nugene although work in progress.
- With non-polymorfic variants
- Does not filter for non polymorfic variants.
Name | project website | Article |
---|---|---|
Fastqc | bioinformatics.babraham.ac.uk | |
BWA | github | preprint |
Picard-tools | sourceforge | instructions below faq |
GATK toolkit | project home | instructions here |
VarScan | github | pubmed link |
SnpEff | sourceforge | pubmed |
SnpSift | sourceforge | pubmed |
SAMtools | project home | pubmed |
Compute | project home | citeseerx |
DNAcopy | bioconductor.org | bioconductor.org |
Cosmic | project home | pubmed |
1000 genomes | project home | pubmed |
dbSNP | project home | pubmed |
dbNSFP | project home | site pubmed, pubmed and drive link |
ensembl build 75 | archive site | doi.org link |
Exac | project home | instructions here |
FusionCatcher | github | instructions here |
manta | github | preprint |
HTSeq | project home | pubmed |
BBMap | sourceforge | na |
BEDTools2 | github | pubmed |
Hisat2 | project home github | nature |
CoNVaDING | github | pubmed |
Version managment done with lmod lua based implementation of environment modules
. Installation/deployment management done with Easybuild.
Here are the tools and software versions (for the current versions look at the *.siteconfig.csv):
software | version |
---|---|
FastQC | 0.11.9-Java-11 |
BWA (mem) | 0.7.17-GCCcore-11.3.0 |
picard | 2.26.10-Java-8-LTS |
R | 4.2.1-foss-2022a |
GATK | 3.8-1-0-Java-1.8.0_144 |
snpEff | 4.3t-Java-11-LTS |
VarScan | 2.4.3-Java-1.7.0_80 |
SAMtools | 1.9-GCCcore-7.3.0 |
HTSlib | 1.17-GCCcore-11.3.0 |
BCFtools | 1.17-GCCcore-11.3.0 |
VCFtools | 0.1.16-foss-2018b-Perl-5.30.0 |
open-cravat | 1.8.0-foss-2018b-Python-3.7.4 |
pipeline-util | 0.8.20-5-ga0a29bb-foss-2022a |
parallel | 20220722-GCCcore-11.3.0 |
TableToXlsx | 0.3.0-GCCcore-11.3.0-Perl-5.34.1 |
BBMap | 35.69-foss-2018b |
BEDTools | 2.27.1-foss-2018b |
HTSeq | 0.11.0-GCCcore-7.3.0-Python-3.7.4 |
R (rmarkdown) | 4.2.1-foss-2022a |
manta | 1.6.0-GCCcore-11.3.0 |
freebayes | 1.3.2-GCCcore-7.3.0 |
LoFreq | 2.1.5-foss-2018b-Python-2.7.16 |
multiqc | 1.12-GCCcore-11.3.0 |
R-bundle-ichorCNA | 0.3.2-20191219-foss-2022a-R-4.2.1 |
fgbio | 1.3.0 |
STAR | 2.7.3a-foss-2018b |
TrimGalore | 0.6.7-GCCcore-11.3.0 |
cutadapt | 4.2-GCCcore-11.3.0 |
The resource links for installing references/preparing them
name | site |
---|---|
GATK Bundle | human reference |
Ensembl | reference/gtf dowload |
UCSC Tools | format conversion/additional tools |
The workflow is a multistep protocol including the following main steps.
- Start with input reads in fastq format (from illumina sequencing)
- (
RNA
/NugeneRNA
) Perform FusionCather on PE reads or else skip this analysis. - (
Nugene
/NugeneRNA
) Add N6 barcode info into the read and trim by linker sequence and quality. steps detailed below.- Add N6 barcode into read.
- Trim on Linker sequence and on base qual < 20
- Align reads versus 1000g b37 reference genome using BWA or hisat for DNA or RNA respectively.
- (
Nugene
/NugeneRNA
) hard trim alignment on overlap with the landing probes. - Use Picard tools AddOrReplaceGroups to covert the sam into properly sorted and indexed bam format.
- (
Nugene
/NugeneRNA
) Expand readgroups based on the N6 barcodes: for each unique N6 barcode create a new readgroup. - Use Picard tools MergeBamFiles merge the .bam into a single .bam file for each sample.
- Use Picard tools MarkDuplicates to mark the duplicates in each sample.
- Use GATK indel realignment to know sites.
- Call variants using GATK HaplotypeCaller
- Annotate Variants using SNPEFF (funtional effects), SNPSIFT (dbsnfp) and GATK VariantAnnotator (exac / 1000g /cosmic).
- Using the GATK SelectVariants and FilterVariants split by
snv
andindel & mnp
then filter depending on the dataset.
list of definitions
Definition | Meaning |
---|---|
Readgroup | A subset of data with attached sample/library/machine/lane/etc.. info attached for use in Duplicate removal, BQSR, variant calling and other analysis steps to improve them. |
N6 barcode | The random barcode of nugene attached to the normal sample barcode. |
See pictures below for the different sub workflows:
Generic workflow
Nugene workflow
Nugene RNA workflow
The output contains 3 different types:
Project folder
- All the (intermediate) files of the analysis
- Use: to clean up ${project}_nobam.zip
- All the relevant files excluding the bam files.
- Subfolders
-
CovariantAnalysis/
- The base quality score recalibration results. This shows the effect of the analysis on the quality scores.
-
fastqc/
- The complete results of the fastqc tool
-
HsMetrics/
- The complete results of the picard CalculateHsMetrics tool. ${sample}.hsmetrics.log: calulated metrics like fraction of bases 20x covered and mean coverage. ${sample}.hsmetrics.pertargetcov.tsv: coverage for each target region
-
md/
- Collected metrics of the more important quality metrics in html/markdown format.
-
multipleMetrics/
- The complete results of the picard multipleMetrics tool. Many metrics included...
-
variantfiltering/
- The (soft) filtered vcf files, to make IGV(integrative genomics viewer)/bioinformatics people happy.
-
varscan.${controlSampleName}/
- Our CNV analysis see the varscan.normals/.pdf varscan.normals/.seg files for results. This shows the copynumber ratios versus the normal samples.
-
xlsx/
- The table dumps of the VCF files made for excel spreadsheet processing. Four types are present description,raw,min and filtered for viewing description of the headers,the raw variant calls (without filtering), the variant calls with a selected amount of columns and the variant calls filtered for population frequency and function. This for different type of calls: single nucletide variants (snv), small indels and multinucleotide polymorfisms (indelmnp) and structural variants called with manta (sv / work in progress).
-
This is done with the haplotype caller using -stand_call_conf 10.0
and -stand_emit_conf 20.0
(call bases with QUAL > 10).
Instead of the default calling with a QUAL >= 30 (see also the 'Filtering of variant calls' paragraph).
The recommended filtering procedure depends on the sample size, sequencing method, target region size and depth these are general recommendations. The filtering might benefit from less stringent filtering of the variant statistics to increase sensitivity. The refiltering is best done on the raw data because these still contain the removed variance. Also consider adding exac.ac/exac.an <> 0.02 and adding the population(european/asian/american/african) specific filters for exac,esp6500 and 1000g. If you document your filtering I might add it to the filtering pipeline.
The table below shows the filtering steps that are automatically performed. Note the different filtering strategies for the following subtypes.
- snv
- indel & mnp
Name | Expression | Applied on | Description |
---|---|---|---|
"LowQual" | "QUAL < 20" | both (snv and indel & mnp) | Variant statistic. Filter for the possibility (> 1/1000 or <30 pred scaled or > 1/100 or <20 pred scaled) that the variant call is wrong using a bayesian model. |
"QDlt2" | "QD < 2.0" | both | Variant statistic. Filter for the pred scaled possibility that the variant call is wrong divided by the depth < 2.0 |
"MQlt40" | "MQ < 40.0" | snv | Variant statistic. Filter snv for the pred scaled possibility that a mapping is wrong, capped 60, calculated with secondary hits using the base quality scores at the different positions to call it 0 or higher. Filter for unique mappings. |
"MQRankSumlt-12_5" | "MQRankSum < -12.5" | snv | Variant statistic. Filter snv for mutations in which the mutation or the reference has difficulties mapping depending on one another. |
"MQRankSumlt-20" | "MQRankSum < -20" | indel & mnp | Variant statistic. Filter indel for mutations in which the mutation or the reference has difficulties mapping depending on one another. |
"ReadPosRankSumlt-20" | "ReadPosRankSum < -20.0" | snv | Variant statistic. Fisher strand bias of the covering reads. High values == more bias. |
"FSgt60" | "FS > 60.0" | snv | Variant statistic. Fisher strand bias of the covering reads. High values == more bias. |
"FSgt200" | "FS > 200.0" | indel & mnp | Variant statistic. Tandem repeat annotation of the region. The more repeats the less likely the indel call is true. Sequencing has difficulties with repeats (with about >=8 repeating units). |
"RPAgt8" | "vc.getAttribute('RPA').0 > 8 | vc.getAttribute('RPA').1 > 8 | |
"TeMeermanAlleleBiasgt5" | TeMeermanAlleleBias > 5.0 | both | Variant statistic. Never filters anything -> legency artifact |
"NotPolymorfic" | all GT equal | both | Genotype annotation. All the genotypes are equal. If analysed with a proper control this variant is very unlikely to be harmful. |
"NotPutatativeHarmfulVariant" | see descr. | both | Functional annotation. everything that is not fuctional according the snpeff documentation under 'Effect prediction details' aka jexl expression :" "!((vc.hasAttribute('SNPEFFANN_ANNOTATION_IMPACT') && vc.getAttribute('SNPEFFANN_ANNOTATION_IMPACT').contains('HIGH')) |
"1000gMAFgt0.02" | see descr. | both | Population annotation. entire 1000g phase 1 alelle fequency of 2% aka "(vc.hasAttribute('1000gPhase1Snps.AF') &&(vc.getAttribute('1000gPhase1Snps.AF') > 0.02&&vc.getAttribute('1000gPhase1Snps.AF') < 0.98))" --filterName "1000gMAFgt0.02" |
"1000gEURMAFgt0.02" | see descr. | both | Population annotation. european 1000g phase 1 alelle fequency of 2% aka "(vc.hasAttribute('1000gPhase1Snps.EUR_AF') && (vc.getAttribute('1000gPhase1Snps.EUR_AF') > 0.02&&vc.getAttribute('1000gPhase1Snps.EUR_AF') < 0.98))" --filterName "1000gEURMAFgt0.02" |
"QDlt2andQdbyAflt8" | "QD < 2.0 && QD / AF < 8.0 | both | Variant statistic. Having both QD < 2 and QD/ AF < 8 in tekst:the qual divided by depth less then 2 and the qual divided by depth divided by allele frequency less than 8. The second parameter(QDAF is to avoid removing low AF (read like: rare) variant calls from the data.). Also note the QD is already calculated by using samples containing the variant. |
simple syntax /generation checking
make test
testdata from the broadinstitute from here
bash GenerateScripts.sh exomehuman38 samplesheet_tutorial_801.csv testgrch38 /data/umcg-mterpstra/apps/data/ftp.broadinstitute.org/bundle/bundle17jan2020/hg38/wgs_calling_regions.hg38.interval_list && bash .RunWorkFlowGeneration.sh && bash /scratch/umcg-mterpstra/projects/testgrch38/jobs/submit.sh)