LCWGS_pipeline is a low-coverage whole-genome sequencing (lcWGS) pipeline to process low, high coverage paired Illumina short-read sequencing data and DNA microarray data.
- Low-coverage (lc):
- Preprocessing: adapter trimming using
trimmomatic
and duplicates removal usingfastuniq
- Fastqc
- Subsampling using
seqtk
- Breadth of coverage (see
mosdepth
) - Depth of coverage
- Jellyfish k-mer analysis of error rate
- Duplication rate
QUILT
imputation with user-supplied reference panel
- Preprocessing: adapter trimming using
- High-coverage (hc):
- Preprocessing
- Chunking up large fastqs for separate alignment using
seqkit
, followed by merging - Variant calling with
GATK
- DNA microarray (chip):
- Convertion to vcf file format and basic qc
- Preparing for imputation on the servers
- Others:
- Lots of plotting, filtering and calculation utilities (though yet under-developped)
- A
pipelines/config.json
file to specify sample names, etc. You should modify this file which is under directorypipelines
. - A
data/sample_tsvs/samples_lc.tsv
, adata/sample_tsvs/chip.tsv
and adata/sample_tsvs/samples_hc.tsv
file that stores sample name information for lc, hc and chip samples, respectively. Note that each bit of analysis can be run separately, and three corresponding snakemake master files are in place for separate data processing. - A bunch of
fastq
files for lc and hc samples. - A bunch of chip files for data processing. Our data requires a genotype file, a sample file and an annotation file.
- An index
fa
file, or severalfa
files if concatenate is setTrue
to join them up. - If k-mer analysis is to be performed, high coverage jellyfish
jf
files should be provided forclassify-kmers
to find errors. - If imputation is to be performed, reference panel (as
vcf
files) and allele frequency files (a bash script underscripts/extract_gnomAD_MAF.sh
is provided) should be provided. This part will be facilitated with (Prof. Robert Davis' QUILT package (https://github.com/rwdavies/QUILT) and a QUILT-wrap code (https://github.com/rwdavies/QUILT-wrap)).
- General:
ref38
: Ready-in-use reference fileconcatenate
:True
if several reference files are to be concatenatedsample_linker
: A linker file that tells correspondance of samples if they have multiple namesclean_fastq
:True
if preprocessing of fastq files is required - drop duplicates, trim adapters, etc.adapter
: File for performing adapter trimmingreheader
:True
if user has self-defined headers to replace after alignment, these header files should be put indata/bam_headers/
- Lc:
samples_lc
: All low-coverage (lc) sample namessubsample_depth_1x
:=10666667
, number of reads for a read of length 150 to be 1xsubsample_depth
: Depth to which subsample is performednum_coverage
: Length of cumsum coverage array (skew plot)rmdup
:True
if duplicates are to be removed before lc imputationrename_samples
: Used to be in rule concat inimputation.smk
, but currently deprecatedrename_samples_file
: Used to be in rule concat inimputation.smk
, but currently deprecatedrm_bed_regions
:True
if regions from the genome are to be removed in coverage analysis. Specific for the plot_sequencing_skew utility (likely to be optimised later)bed_regions
: Regions to be removedaccess_bed
: Accessible region from the 1KG accessibility track. This is more generally considered in coverage analysispanels
: List of reference panels to be used for imputationQUILT_HOME
: Address where theQUILT
software residesANALYSIS_DIR
: Address for the analysis folder. Should beresults/imputation/
RECOMB_POP
: Three-letter 1KG abbreviation for the population to be usedNGEN
: Number of generationsWINDOWSIZE
: Size of imputation chunksBUFFER
: Imputation buffer sizeBAMLIST
: Address of thebamlist.txt
file which stores all bam files to be imputedPANEL_NAME
: Name of the current imputation panel
- Hc:
samples_hc
: All high-coverage (hc) sample namesmake_chunk
:True
if the hc files need to be chunkedfastq_chunk_size
: Chunk sizehc_panel
: Name of the reference panel to be used for GATK HapCaller to call variants. Only sites in these vcfs will be called to ease comparisonbqsr_known_sites
: For GATK HapCaller to call variants
- Chip:
chip_genotypes
: Chip genotype filechip_samples
: Chip sample filechip_annotation
: Chip annotation file
- For now, the whole pipeline is separated into different snakemake files that groups a bunch of jobs together. There are three master files for either lc, hc and chip analysis are in place. To run a specific rule in a specific file, use, for example,
snakemake -s pipelines/master_lc.smk -c 1 alignment_all
(needless to say, don't forget to dry-run first by-n
). - Alternatively, a
submit_snakemake.sh
submission script is provided specifically for cluster users. You should first modify the files in theslurm/
folder to enable a correct communication between snakemake and your job management system. After that, you can run this file with 0, 3 or 4 parameters:- If no parameter is passed in, it generates the DAG for all rules.
- If multiple parameters are passed in, the first should be the name of the master file to run (lc, hc or chip), the second to be number of cores required, and the third should be the name of the snakemake rule file to be run (e.g.: alignment, chip_qc, etc.). The fourth is for other options that are passed to snakemake, like a dry-run option
-nr
. For example, you can submit by./submit_snakemake.sh lc 8 alignment_all
which will run all rules till the end of everything inalignment_all
. It will also run preprocessing rules as alignment requires pre-cleaning of the data. However, running multiple snakemake files are disencouraged, as between some rules the user might need to manually input something or run some scripts. It is always suggested to run the pipeline as granular (per snakemake rule file) as possible.
Online documentation is currently UNAVAILABLE, due to its lack of priority :)
However, feel free to reach out, submit bugs, and/or make suggestions on improvement of this library. All are welcomed!
The pipeline requires a variety of utilities and packages, including but not limited to:
- A Python 3.8+ environment with the following installations numpy, pandas, matplotlib, seaborn, statsmodels, scipy and lcwgsus.
- Other packages include QUILT, QCTOOL_v2, Iorek, GATK4, [bwa], [samtools], [R], [sqlite3], [seqkit], [bedtools], [fastqc], [multiqc], [jellyfish], [fastuniq], [trimmomatic], [seqtk], [bcftools], etc.
LCWGS_pipeline is not a published work, so citing the GitHub repo suffices.
See the main site of LCWGS_pipeline: https://github.com/Suuuuuuuus/LCWGS_pipeline.
Bugs shall be submitted to the issue tracker. Please kindly provide a REPRODUCIBLE example demonstrating the problem, otherwise debugging per se is foolhardy :)