plant-food-research-open/assemblyqc is a Nextflow pipeline which evaluates assembly quality with multiple QC tools and presents the results in a unified html report. The tools are shown in the Pipeline Flowchart and their references are listed in CITATIONS.md.
%%{init: {
'theme': 'base',
'themeVariables': {
'fontSize': '52px",
'primaryColor': '#9A6421',
'primaryTextColor': '#ffffff',
'primaryBorderColor': '#9A6421',
'lineColor': '#B180A8',
'secondaryColor': '#455C58',
'tertiaryColor': '#ffffff'
}
}}%%
flowchart LR
forEachTag(Assembly) ==> VALIDATE_FORMAT[VALIDATE FORMAT]
VALIDATE_FORMAT ==> ncbiFCS[<span style="white-space: nowrap;">NCBI FCS ADAPTOR</span>]
ncbiFCS ==> Check{Check}
VALIDATE_FORMAT ==> ncbiGX[<span style="white-space: nowrap;">NCBI FCS GX</span>]
ncbiGX ==> Check
Check ==> |Clean|Run(Run)
Check ==> |Contamination|Skip(Skip All)
Skip ==> REPORT
VALIDATE_FORMAT ==> GFF_STATS[<span style="white-space: nowrap;">GENOMETOOLS GT STAT</span>]
Run ==> ASS_STATS[<span style="white-space: nowrap;">ASSEMBLATHON STATS</span>]
Run ==> BUSCO
Run ==> TIDK
Run ==> LAI
Run ==> KRAKEN2
Run ==> HIC_CONTACT_MAP[<span style="white-space: nowrap;">HIC CONTACT MAP</span>]
Run ==> MUMMER
Run ==> MINIMAP2
Run ==> MERQURY
MUMMER ==> CIRCOS
MUMMER ==> DOTPLOT
MINIMAP2 ==> PLOTSR
ASS_STATS ==> REPORT
GFF_STATS ==> REPORT
BUSCO ==> REPORT
TIDK ==> REPORT
LAI ==> REPORT
KRAKEN2 ==> REPORT
HIC_CONTACT_MAP ==> REPORT
CIRCOS ==> REPORT
DOTPLOT ==> REPORT
PLOTSR ==> REPORT
MERQURY ==> REPORT
- FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
- GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
- ASSEMBLATHON STATS: Assembly statistics
- GENOMETOOLS GT STAT: Annotation statistics
- NCBI FCS ADAPTOR: Adaptor contamination pass/fail
- NCBI FCS GX: Foreign organism contamination pass/fail
- BUSCO: Gene-space completeness estimation
- TIDK: Telomere repeat identification
- LAI: Continuity of repetitive sequences
- KRAKEN2: Taxonomy classification
- HIC CONTACT MAP: Alignment and visualisation of HiC data
- MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
- MERQURY: K-mer completeness, consensus quality and phasing assessment
Refer to usage, parameters and output documents for details.
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test
before running the workflow on actual data.
Prepare an assemblysheet.csv
file with following columns representing target assemblies and associated meta-data.
tag:
A unique tag which represents the target assembly throughout the pipeline and in the final reportfasta:
FASTA file
Now, you can run the pipeline using:
nextflow run plant-food-research-open/assemblyqc \
-profile <docker/singularity/.../institute> \
--input assemblysheet.csv \
--outdir <OUTDIR>
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
Download the pipeline to your /workspace/$USER
folder. Change the parameters defined in the pfr/params.json file. Submit the pipeline to SLURM for execution.
sbatch ./pfr_assemblyqc
plant-food-research-open/assemblyqc was originally written by Usman Rashid (@gallvp) and Ken Smith (@hzlnutspread).
Ross Crowhurst (@rosscrowhurst), Chen Wu (@christinawu2008) and Marcus Davy (@mdavy86) generously contributed their QC scripts.
Mahesh Binzer-Panchal (@mahesh-panchal) and Simon Pearce (@SPPearce) helped port the pipeline modules and sub-workflows to nf-core schema.
We thank the following people for their extensive assistance in the development of this pipeline:
The pipeline uses nf-core modules contributed by following authors:
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use plant-food-research-open/assemblyqc for your analysis, please cite it as:
AssemblyQC: A Nextflow pipeline for reproducible reporting of assembly quality.
Usman Rashid, Chen Wu, Jason Shiller, Ken Smith, Ross Crowhurst, Marcus Davy, Ting-Hsuan Chen, Ignacio Carvajal, Sarah Bailey, Susan Thomson & Cecilia H Deng.
Bioinformatics. 2024 July 30. doi: 10.1093/bioinformatics/btae477.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.