conda install nextflowgit clone https://github.com/RasmussenLab/DoBSeqWF.git .# Docker
nextflow run main.nf -profile test -stub -with-docker
# Apptainer
nextflow run main.nf -profile test -stub -with-apptainer
# Conda
nextflow run main.nf -profile test -stub -with-conda# Docker
nextflow run main.nf -profile test -with-docker
# Apptainer
nextflow run main.nf -profile test -with-apptainer
# Conda
nextflow run main.nf -profile test -with-condaThe pipeline has multiple optional configurations found in nextflow.config.
Configurations can be supplied a configuration file, see config.json, and run with nextflow run main.nf -params-file config.json, or parameters can be added directly from the commandline:
nextflow run main.nf \
(-with-docker/-with-apptainer/-with-conda) \
--pooltable <path to pool fastq file table> \
(--decodetable <path to pool decode tsv> \)
--reference_genome <path to indexed reference genome> \
--bedfile <path to bedfile with target regions> \
--ploidy <integer>The pooltable.tsv should connect (user assigned) pool id's and their row/column arrangement to input FASTQ files; one tab-separated line for each pool.
pool_1 row path/to/sample1_R1.fq.gz path/to/sample1_R2.fq.gz
pool_2 row path/to/sample2_R1.fq.gz path/to/sample2_R2.fq.gz
pool_3 column path/to/sample3_R1.fq.gz path/to/sample3_R2.fq.gz
pool_4 column path/to/sample4_R1.fq.gz path/to/sample4_R2.fq.gzThe optional decodetable.tsv should map (user assigned) individual id's in the matrix to the corresponding row and column id's of each pool; one entry for each element in the matrix.
individual1 pool_1 pool_3
individual2 pool_2 pool_3
individual3 pool_1 pool_4
individual4 pool_2 pool_4The workflow will output a results folder containing multiple config dependent output files:
results
├── pinpointables.vcf # Merged VCF file containing all assigned variants
├── cram/ # CRAM files for each pool
├── context/ # TSV and JSON files with matrix information and pool-individual linkage.
├── logs/ # Log files for each process
├── variants/ # VCF files for each pool
├── variant_tables/ # TSV files converted from pool VCFs
├── variant_compilation/ # TSV files with aggregated variants, annotations and rescue probabilities
└── pinpoint_variants/
├── all_pins/ # All pinpointables for each sample in individual vcfs (*note)
├── unique_pins/ # All unique pinpointables for each sample in individual vcfs (*note)
├── *_merged.vcf.gz # All pinpointables for all samples in a single vcf without sample information
├── summary.tsv # Variant counts for each sample
└── lookup.tsv # Variant to sample lookup tableA central files is the pinpointables.vcf. This file contains all individually assigned variants. Since each variant contains information from two pools, these a presented as the sample columns: ROW and COLUMN.
Two annotations workflows are currently available. SnpEff for VCF output files and VEP for tabular compiled output. They can be applied by adding the following configurations and absolute paths to the JSON params file.
config.json
{
# Annotate output VCFs with SnpEff and clinvar
annotate: true
snpeff_db: "GRCh38.99"
snpeff_config: "snpEff.config"
snpeff_cache: "cache/"
clinvar_db: "clinvar_20230903.vcf.gz" # (*.tbi in same folder)
# Annotate tabular output with VEP
annotate_vep: true
vep_cache: "cache/" # (Current version 111.x)
# Optional VEP input:
danmac_db: "danmac.vcf.gz" # (*.tbi in same folder)
blacklist_bed: "hg38-blacklist.v2.sorted.bed.gz" # (*.tbi in same folder)
repeatmasker_bed: "repeatmasker.sorted.bed.gz" # (*.tbi in same folder)
gnomad_vcf: "gnomad.vcf.bgz" # (*.tbi in same folder)
utr_file: "uORF_5UTR_GRCh38_PUBLIC.txt"
alphamissense_tsv: "AlphaMissense_hg38.tsv.gz"
loftee_gerp_bw: "gerp_conservation_scores.homo_sapiens.GRCh38.bw"
loftee_human_ancestor: "human_ancestor.fa.gz"
loftee_sqlite: "loftee.sql"
}DoBSeqWF
├── LICENSE
├── VERSION
├── README.md
├── assets
│ ├── data
│ │ ├── reference_genomes
│ │ │ └── small
│ │ │ └── small_reference.*
│ │ └── test_data # Test data
│ │ ├── coordtable.tsv
│ │ ├── decodetable.tsv
│ │ ├── pools
│ │ │ └── *.fq.gz
│ │ ├── pooltable.tsv
│ │ ├── snvlist.tsv
│ │ └── target_calling.bed
│ ├── filter/ # Filter model modules and parameters
│ └── helper_scripts
│ └── simulator.py # Script for simulating minimal pipeline data
├── bin # Executable pipeline scripts
│ └── <script>.*
├── conf
│ ├── container.config # Container registry addresses for computational tools.
│ ├── ngc.config # Configuration profiles the NGC-HPC compute enviornment.
│ └── profiles.config # Configuration profile for default environment.
├── envs
│ └── <name>/
│ └── environment.yaml # Conda environment definitions
├── lib/ # Pipeline groovy utility functions
├── main.nf # Main workflow
├── modules/
│ └── <module>.nf # Module scripts
├── subworkflows/
│ └── <subworkflow>.nf # Subworkflow scripts
├── next.pbs # Helper script for running on NGC-HPC
└── nextflow.config # Workflow parametersCreate a wrapper script for qsub, so you don't have to keep track of working directory, group etc. again.
First do mkdir ~/bin. Then save the following script as a file named ~/bin/myqsub and make it executable by chmod +x ~/bin/myqsub.
#!/bin/bash
qsub -W group_list=icope_staging_r -A icope_staging_r -d $(pwd) "$@"
Add ~/bin to your path. You can have this done on log-in by appending the following line to your ~/.bashrc:
export PATH="$PATH:$HOME/bin"
git clone /ngc/projects/icope_staging_r/git/predisposed/.git .bash next.pbs -params-file test_config.json -stubbash next.pbs -params-file test_config.jsonmyqsub next.pbs -F "-params-file test_config.json"While the pipeline is still under development, it make sense to create new clones for each pipeline run, to keep track of possible changes done while running it. I propose this folder structure:
predisposed
├── git/DoBSeqWF # Temporary local workflow repository
├── resources # Reference genome and target files.
├── data/ # Raw data for each batch
│ ├── <batch_id_I>/
│ │ └── *.fq.gz
│ ├── <batch_id_II>/
│ │ └── *.fq.gz
│ └── <batch_id_III>/
│ └── *.fq.gz
│ └── ...
└── processed_data/ # Processed data for each batch
├── <batch_id_I>/
│ ├── DoBSeqWF/ # Clone repository here
│ │ ├── config.json # Configuration file
│ │ ├── pooltable.tsv # Pool table
│ │ └── decodetable.tsv # Decode table
│ └── results
│ ├── cram/ # CRAM files for each pool
│ ├── logs/ # Log files for each process
│ ├── variants/ # VCF files for each pool
│ ├── variant_tables/ # TSV files converted from pool VCFs
│ └── pinpoint_variants/
│ ├── all_pins/ # All pinpointables for each sample in individual vcfs (*note)
│ ├── unique_pins/ # All unique pinpointables for each sample in individual vcfs (*note)
│ ├── *_merged.vcf.gz # All pinpointables for all samples in a single vcf without sample information
│ ├── summary.tsv # Variant counts for each sample
│ └── lookup.tsv # Variant to sample lookup table
├── <batch_id_II>/
│ ├── DoBSeqWF/
│ └── results/
├── <batch_id_III>/
│ ├── DoBSeqWF/
└── results/
└── ...(*note) Each pinpointable variant can be represented by the horizontal or the vertical pools. In order not to loose any information, there are, for now, 6 vcf files for each sample. Four with representations from either dimension named {sample}_{pool}_{unique/all}_pins.vcf.gz and 2 with all pins merged named {sample}_{unique/all}.vcf.gz.
cd /ngc/projects2/dp_00005/data/predisposed/
mkdir -p data/<batch_id> processed_data/<batch_id>
mv /ssi/fastq/data /ngc/projects2/dp_00005/data/predisposed/data/<batch_id>/cd processed_data/<batch_id>
git clone /ngc/projects2/dp_00005/data/predisposed/git/DoBSeqWFcd DoBSeqWF
bash assets/helper_scripts/create_pooltable.sh ../../../data/<batch_id>/Fill out config.json with the correct paths and parameters. Decodetable is not needed for mapping only. Look into nextflow.config for possible parameters to set in the conifg.json.
myqsub next.pbs -F "-params-file config.json"tail nextflow.logIf the pipeline fails - it is likely due to resource constraints. Adjust as needed in the conf/profiles.config file under NGC, and rerun the PBS script. Be aware that any direct edits of the workflow scripts, ie. modules and subworkflows, can lead to complete re-run of the pipeline.