Skip to content

Latest commit

 

History

History
267 lines (187 loc) · 12 KB

output.md

File metadata and controls

267 lines (187 loc) · 12 KB

phac-nml/viralassembly: Output

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from either the MultiQC report or the custom report, which both summarise the results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps (where a * indicates a final output kept in the top level results directory):

Additionally Pipeline information which includes report metrics generated during the workflow execution can also be found

Preprocessing

Initial processing steps and statistic gathering. The reference statistics are output to their own final folder while the other statistics are passed to the final multiqc report.

Reference Stats

Output files
  • reference/
    • genome.bed: Genomic information in bed format that has the coordiantes of the reference genome needed for nanopolish
    • refstats.txt: Genomic information in a format needed for clair3
    • *.fai: Samtools faidx fai file for reference genome

The reference files are generated with both awk and samtools and are needed as different inputs for downstream tools.

Artic Guppyplex

Select reads by size and generate size selected fastq files.

Chopper

Chopper filter and trim fastq reads by quality and length.

Nanostat

Nanostat generates plots and statistics on trimmed fastq files for the final multiqc reports.

nanostats_mqc


Variant Calling

Read mapping and variant calling. Note that only one of clair3, medaka, and nanopolish is used. In the end, final normalized passing and failing variants are output along with the BAM files to their respective folders.

Minimap2

Output files
  • bam/
    • *.sorted.bam: Sorted bam file from minimap2 and samtools

The sorted BAM file from minimap2 and samtools.

Artic Align_Trim

Amplicon only

Output files
  • bam/
    • *.trimmed.rg.sorted.bam: Artic align_trim output which normalises coverage and assigns reads to amplicons
    • *.primertrimmed.rg.sorted.bam: Artic align_trim output which normalises coverage and assigns reads to amplicons along with softmasking the primer sequences
      • The primertrimmed file is used for subsequent variant calling

See the artic core pipeline for more info on how align_trim trims the BAM files.

Clair3

Run clair3 variant caller on BAM files to create initial variant calls in VCF format.

Medaka

Run medaka variant caller on BAM files to create initial variant calls in VCF format.

Nanopolish

Run nanopolish variant caller on BAM files, fast5 files, and the sequencing summary file to create initial variant calls in VCF format.

Longshot

Output files
  • vcf/
    • *.longshot.merged.vcf: Longshot phased VCF file

Genotype and phase the variants from the initial medaka VCF variant file. Longshot

Variant Filter

Output files
  • vcf/
    • *.pass.vcf.gz: VCF file containing variants passing quality filters
    • *.pass.vcf.gz.tbi: VCF index file containing variants passing quality filters
    • *.fail.vcf: VCF file containing variants failing quality filters

Pass/Fail variants based on quality for the final consensus sequence generation.


Consensus Generation

Final consensus sequence generation based on passing/failing variants and sequencing depth.

Artic Mask

Mask low depth and failing variants to create a preconsensus sequence for BCFtools consensus.

BCFtools Norm

Output files
  • vcf/
    • *.pass.norm.vcf.gz: VCF file containing variants passing quality filters that have their indels normalized and reference positions fixed
      • Reference positions may need to be fixed if there are overlapping variants

BCFtools norm is utilized to fix locations in which one two variants overlap which during BCFtools consensus would crash the pipeline previously. BCFtools

BCFtools Consensus

Output files
  • consensus/
    • *.consensus.fasta: Fasta file containing the final output consensus sequence with applied variants and masked sites

Final output consensus sequence for the sample with variants applied and low coverage/failing variants masked with N's. BCFtools


QC and Reporting

All QC and reporting is only currently done on non-segmented viruses

SnpEff

Output files
  • snpeff/
    • *.ann.vcf: VCF file with variant annotations
    • *.csv: Variant annotation csv file

SnpEff is a genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

snpeff_mqc

Qualimap BAMQC

Qualimap BAMQC platform-independent application written in Java and R that provides a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts. The output is used in the final MultiQC reports.

qualimap_mqc

Samtools Flagstat

Samtools flagstat counts the number of alignments for each FLAG type. The output is used in the final MultiQC reports.

samtools_mqc

BCFtools Stats

BCFtools stats produces machine readable variant quality and statistics. The output is used in the final MultiQC reports

bcftools_mqc

Variation CSV

Output files
  • variation_csvs/
    • *_variation.csv: CSV file displaying positions where there is >= 15% variation from the reference base call

Custom python script using pysam to find positions in the pileup which have >= 15% variation from the reference sequence. This gives information on any mixed-sites along with identifying spots in the genome where there may be sequencing artifacts or issues. The CSV file can be viewed or a coloured table can be found in each sample MultiQC report or custom report.

variation_mqc

Amplicon Completeness

Amplicon completeness is calculated using a custom python script along with an amplicon bed file and the final consensus sequence. It reports how many bases were called in each amplicon and gives a final completeness value from 0 - 1.00.

completeness_mqc

QC Compilation

Output files
  • sample_csvs/
    • *.qc.csv: Individual sample CSV files containing sample stats
  • overall.qc.csv: Overall sample and run CSV file containing all sample stats

Final CSV file(s) for both individual samples and the overall run that combines and checks a variety of metrics giving a final QC value for each sample.

qc_mqc

MultiQC

Output files
  • sample_mqc/
    • *.report.html: Sample specific MultiQC HTML report containing visuals and tables
  • Overall-Run-MultiQC.report.html: Final overall MultiQC report containing visuals and tables for all samples combined

Final output reports generated by MultiQC based on the overall config and the sample config files which collate all of the outputs of the pipeline

sample_mqc_mqc

Custom Report

Output files
  • reportDashboard.html: Custom report dashboard displaying run metrics overall and for each sample

Custom RMarkdown report that contains sample and run information. Currently it can only be generated when running with conda so it is an output that has to be specified. It also still has a few issues relating to load times.

run_summary_custom Run summary page

sample_custom Example sample page

amplicons_custom Amplicons page


Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.