Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen
/ tmux
or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
The typical command for running the pipeline is as follows:
nextflow run dolphinnext/chipseq -profile docker --DOWNDIR /path/to/save/genome-data --reads '*_R{1,2}.fastq.gz' --mate 'pair' --genome_build mouse_mm10_refseq
If you're running for the first time, you need to enable --run_checkAndBuild
paramater as follows:
nextflow run dolphinnext/chipseq -profile docker --DOWNDIR /path/to/save/genome-data --reads '*_R{1,2}.fastq.gz' --mate 'pair' --genome_build mouse_mm10_refseq --run_checkAndBuild 'yes'
This will launch the pipeline with the docker
configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results
.nextflow_log # Log file from Nextflow
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version. In order to download latest version of the pipeline you need to run following command:
nextflow pull dolpinnext/chipseq
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker, test
- the order of arguments is important!
If -profile
is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH
.
docker
- A generic configuration profile to be used with Docker
- Pulls software from Dockerhub:
dolphinnext/chipseq
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub
test
- A profile with a complete configuration for automated testing
## First, download sample fastq files into `inputs` folder with the following command: mkdir -p inputs && cd inputs && wget https://galaxyweb.umassmed.edu/pub/dnext_data/test/reads/control_rep1_sm.fastq.gz https://galaxyweb.umassmed.edu/pub/dnext_data/test/reads/exper_rep1_sm.fastq.gz && cd .. ## Start testing pipeline: nextflow run dolphinnext/chipseq -profile docker,test ## In the test profile, --reads parameter assinged as: 'inputs/*.fastq.gz'
Use this to specify the location of your input FastQ files. For example:
--reads 'path/to/data/sample_*_{1,2}.fastq' --mate 'pair'
--reads 'path/to/data/sample_*.fastq' --mate 'single'
Please note the following requirements:
- The path must be enclosed in quotes
- The path must have at least one
*
wildcard character - When using the pipeline with paired end data, the path must use
{1,2}
notation to specify read pairs.
Two options (single or pair) available for --mate
parameter. If you have single-end data, you need to specify as 'single' and for paired-end data, you need to specify as 'pair'. For example:
--reads 'path/to/data/sample_*_{1,2}.fastq' --mate 'pair'
--reads 'path/to/data/sample_*.fastq' --mate 'single'
It is not possible to run a mixture of single-end and paired-end files in one run.
There are 5 different species supported in the UMMS-Biocore references. To run the pipeline, you must specify which to use with the --genome_build
flag.
List of genomes that are supported are:
- Human
--genome_build human_hg19_refseq
--genome_build human_hg38_gencode_v28
- Mouse
--genome_build mouse_mm10_refseq
- Rat
--genome_build rat_rn6_refseq
--genome_build rat_rn6_ensembl_v86
- Zebrafish
--genome_build zebrafish_GRCz11_ensembl_v95
--genome_build zebrafish_GRCz11_refseq
- C. elegans
--genome_build c_elegans_ce11_ensembl_ws245
Note: For new genome requests, please send e-mail to UMMS-Biocore(biocore@umassmed.edu).
If your indexes are not build before, you can enable --run_checkAndBuild
by assinging it's value to 'yes' which will check genome files in --DOWNDIR
and download into that directory. Afterwards it will start building indexes based on the selected parameters in the pipeline.
If you prefer, you can specify the full path to your reference genome disable --run_checkAndBuild
option.
--genome '[path to Fasta reference]' \
--genome_sizes '[path to genome_sizes file]' \
--gtf '[path to GTF file]' \
--bed '[path to bed12 file]' \
--commondb '[path to commondb directory when Bowtie/Bowtie2 indexes found for common reads (eg. ercc, rmsk etc.)] \
--star_index '[path to STAR index]' \
--bowtie_index '[path to Bowtie index]' \
--bowtie2_index '[path to Bowtie index]' \
By default, the pipeline uses Bowtie2 to align the raw FastQ reads to the reference genome.
By default, peaks are detected by tool called MACS2. MACS2 can be easily used for ChIP-Seq data alone, or with control sample with the increase of specificity. You need to enter sample names in array format as shown at below.
##Sample Definitions for example files: exper_rep1.fq, exper_rep2.fq, exper_rep3.fq, control_rep1.fq, control_rep2.fq, control_rep3.fq
--ChIP_Module_ChIP_Prep.output_prefix [array]
# Output files/tables will be created by using this output_prefix. eg.["experiment","control"]
--ChIP_Module_ChIP_Prep.sample_prefix [array]
# Use prefix of the sample to match files. You can use comma separated format to enter multiples files. eg.["exper_rep1,exper_rep2", "exper_rep3"]
--ChIP_Module_ChIP_Prep.input_prefix [array]
# Use prefix of the input (control) to match files. You can use comma separated format to enter multiples files. eg.["control_rep1,control_rep2", "control_rep3"]
--ChIP_Module_ChIP_Prep.macs2_callpeak_parameters [string @default=""]
# MACS2 callpeak parameters that found in their [documentation](https://github.com/taoliu/MACS)
--ChIP_Module_ChIP_Prep.peak_calling_type [string @options:"narrow","broad" @default:"narrow"]
# MACS2 peak calling type
--ChIP_Module_ChIP_Prep.band_width [integer @default:300]
# Band width for picking regions to compute fragment size.
--ChIP_Module_ChIP_Prep.compare_Custom_Bed [string @default:""]
# Enter custom bed file <full path> for comparison
Optionally, you can enable scripture and trim the peaks to increase the statistical power of the subsequent analysis.
--run_Scripture [string @options:"yes","no" @default:"no"]
--ChIP_Module_Scripture_peakrescore.window [integer @default:200]
--ChIP_Module_Scripture_peakrescore.trimFraction [fraction @default:0.15]
You can adjust bedtools coverage parameters for final count table.
--ChIP_Module_ChIP_Prep.bedtoolsCoverage_Parameters [string @default:"-sorted -nobuf -hist"]
Optinally, detected peaks analyzed by HOMER.
--Homer_Find_Motif_Module_homer_find_Motifs_Genome.size [integer @default:200]
# Size of the region for motif finding findMotifsGenome.pl of [HOMER](http://homer.ucsd.edu/homer/motif/). If you wish to find motifs using your peaks using their exact sizes, use the option "200"). However, for Transcription Factor peaks, most of the motifs are found +/- 50-75 bp from the peak center, making it better to use a fixed size rather than depend on your peak size.
If specific Adapter Removal is required, you can enable trimmomatic and enter the adapter sequence.
To enable adapter_removal:
--run_Adapter_Removal "yes"
--Adapter_Trimmer_Quality_Module_Adapter_Removal.phred = [@options:33,64 @default:33]
# Specifies the fastq quality encoding. Default is 33 which is now almost universally used, and 64 which is used in some older Illumina data
--Adapter_Trimmer_Quality_Module_Adapter_Removal.Adapter_Sequence [string]
# You can enter a single sequence or multiple sequences in different lines. Reverse sequences will not be removed.
--Adapter_Trimmer_Quality_Module_Adapter_Removal.min_length [int @default:10]
# Specifies the minimum length of reads to be kept
--Adapter_Trimmer_Quality_Module_Adapter_Removal.seed_mismatches [int @default:1]
# Specifies the maximum mismatch count which will still allow a full match to be performed
--Adapter_Trimmer_Quality_Module_Adapter_Removal.palindrome_clip_threshold [int @default:30]
# Specifies how accurate the match between the two -adapter ligated- reads must be for PE palindrome read alignment
--Adapter_Trimmer_Quality_Module_Adapter_Removal.simple_clip_threshold [int @default:5]
# Specifies how accurate the match between any adapter etc. sequence must be against a read.
--Adapter_Trimmer_Quality_Module_Adapter_Removal.discard_non_clipped [@options:"yes","no" @default:"yes"]
# Discard_non_clipped sequences (keep only sequences which contained the adapter)
Optianally, you can trim your reads by defining trimming lenghts as shown at below:
--run_Trimmer [@options:"yes","no" @default:"no"]
# Enables Trimmer by setting this parameter as "yes"
--Adapter_Trimmer_Quality_Module_Trimmer.phred = [@options:33,64 @default:33]
# Specifies the fastq quality encoding. Default is 33 which is now almost universally used, and 64 which is used in some older Illumina data
For Single End Reads :
--Adapter_Trimmer_Quality_Module_Trimmer.single_or_paired_end_reads "single"
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_5prime [int]
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_3prime [int]
For Paired End Reads :
--Adapter_Trimmer_Quality_Module_Trimmer.single_or_paired_end_reads "pair"
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_5prime_R1 [int]
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_3prime_R1 [int]
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_5prime_R2 [int]
--Adapter_Trimmer_Quality_Module_Trimmer.trim_length_3prime_R2 [int]
Optianally, you can trim your reads based on their quality. Trimmomatic works on both paired-end and single ended data. Alternatively fastx option (fastx_toolkit) could be used for single reads.
To use Trimmomatic :
--run_Quality_Filtering "yes"
--Adapter_Trimmer_Quality_Module_Quality_Filtering.phred = [@options:33,64 @default:33]
# Specifies the fastq quality encoding. Default is 33 which is now almost universally used, and 64 which is used in some older Illumina data
--Adapter_Trimmer_Quality_Module_Quality_Filtering.tool "trimmomatic"
--Adapter_Trimmer_Quality_Module_Quality_Filtering.window_size [int @default:10]
# Performs a sliding window trimming approach. It starts scanning at the 5' end and clips the read once the average quality within the window falls below a threshold (=required_quality).
--Adapter_Trimmer_Quality_Module_Quality_Filtering.required_quality_for_window_trimming [int @default:15]
# Specifies the average quality required for window trimming approach
--Adapter_Trimmer_Quality_Module_Quality_Filtering.leading [int @default:5]
# Cut bases off the start of a read, if below a threshold quality
--Adapter_Trimmer_Quality_Module_Quality_Filtering.trailing [int @default:5]
# Cut bases off the end of a read, if below a threshold quality
--Adapter_Trimmer_Quality_Module_Quality_Filtering.minlen [int @default:36]
# Specifies the minimum length of reads to be kept
To use fastx_toolkit :
--run_Quality_Filtering "yes"
--Adapter_Trimmer_Quality_Module_Quality_Filtering.tool "fastx"
--Adapter_Trimmer_Quality_Module_Quality_Filtering.minQuality [int @default:20]
# Minimum quality score to keep reads
--Adapter_Trimmer_Quality_Module_Quality_Filtering.minPercent [int @default:100]
# Minimum percent of bases that must have entered minQuality
Optianally,Bowtie2/Bowtie/STAR is used to count or filter out common reads (eg. rmsk, ercc etc.). You need to specify mapping set by entering following paramters in array format.
--run_Sequential_Mapping "yes"
--Sequential_Mapping_Module_Sequential_Mapping.remove_duplicates [@options:"yes","no" @default:"no"]
# Duplicates (both PCR and optical) will be removed from alignment file (bam) and separate count table will be created for comparison
--Sequential_Mapping_Module_Sequential_Mapping._select_sequence [array @options:"ercc","rmsk", "custom"]
# Sequence Set for Mapping. eg. ["ercc", "rmsk", "custom"]
--Sequential_Mapping_Module_Sequential_Mapping.index_directory [array]
# If custom sequence is defined please enter index directory of custom sequence(full path), otherwise you need to enter empty string. The index directory must include the full path and the name of the index file must only be the prefix of the fasta or index file. Index files and fasta files also need to have the same prefix.For STAR alignment, gtf file which has the same prefix, must be found in same directory. eg. ["", "", "/share/custom_seq_dir"]
--Sequential_Mapping_Module_Sequential_Mapping.name_of_the_index_file [array]
# If custom sequence is defined please enter name of the index or fasta file (prefix), otherwise you need to enter selected sequence as string. eg. ["ercc", "rmsk", "custom_seq_prefix"]
--Sequential_Mapping_Module_Sequential_Mapping._aligner = [array @options:"bowtie","bowtie2" @default:"bowtie2"]
# Aligner set for mapping: eg. ["bowtie", "bowtie2", "bowtie2"]
--Sequential_Mapping_Module_Sequential_Mapping.aligner_Parameters [array]
# Aligner parameters." eg. ["--threads 1","-N 1","-N 1"]
--params.Sequential_Mapping_Module_Sequential_Mapping.description [array]
# Description of index file (please don't use comma or quotes in this field). eg. ["ercc", "rmsk", "custom_seq_explanation"]
--Sequential_Mapping_Module_Sequential_Mapping.filter_Out = "[array @options:"Yes","No" @default:"Yes"]
# Select whether or not you want the reads mapped to this index filtered out of your total reads.
Optionally, you can convert bam files to TDF for IGV Genome Browser visualization by using IGVtools.
--run_IGV_TDF_Conversion "yes"
## For RSEM BAM output
--BAM_Analysis_RSEM_IGV_BAM2TDF_converter.igv_extention_factor [int @default:0]
# The read or feature is extended by the specified distance in bp prior to counting. This option is useful for chip-seq and rna-seq applications. The value is generally set to the average fragment length of the library minus the average read length.
--BAM_Analysis_RSEM_IGV_BAM2TDF_converter.igv_window_size [int @default:5]
# The window size over which coverage is averaged.
## For Tophat2 BAM output
--BAM_Analysis_Tophat2_IGV_BAM2TDF_converter.igv_extention_factor [int @default:0]
--BAM_Analysis_Tophat2_IGV_BAM2TDF_converter.igv_window_size [int @default:5]
## For STAR BAM output
--BAM_Analysis_STAR_IGV_BAM2TDF_converter.igv_extention_factor [int @default:0]
--BAM_Analysis_STAR_IGV_BAM2TDF_converter.igv_window_size [int @default:5]
## For HISAT2 BAM output
--BAM_Analysis_Hisat2_IGV_BAM2TDF_converter.igv_extention_factor [int @default:0]
--BAM_Analysis_Hisat2_IGV_BAM2TDF_converter.igv_window_size [int @default:5]
Optionally, you can convert bam files to bigwig files for UCSC Genome Browser visualization.
--run_BigWig_Conversion "yes"
Optionally, you can convert bam files to bigwig files for UCSC Genome Browser visualization.
--run_BigWig_Conversion "yes"
Optionally, you can enable RSeQC to calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions).
--run_RSeQC "yes"
The output directory where the results will be saved.
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
Specify the path to a specific config file (this is a core NextFlow command).
Note - you can use this to override pipeline defaults.
The dolphinnext/tools
repository contains some scripts used by the pipeline which may also be run manually:
gtf2bed
- Script used to generate the BED12 reference files used by RSeQC. Takes a
.gtf
file as input
- Script used to generate the BED12 reference files used by RSeQC. Takes a