nf-ATAC
An integrated pipeline for ATAC-seq data written in Nextflow with:heart:
Author: Stephen Zhang (stephen.zhang@monash.edu) Date: 5 Feb 2018
Introduction
nf-ATAC
pipeline for processing ATAC-seq data written in Nextflow script (https://www.nextflow.io/).
Currently in early stages, this README
will definitely be updated regularly (check often!)
Have an problem? Please log an issue on GitHub (https://github.com/zsteve/atac-seq-pipeline)
Dependencies Please make sure these tools are installed before running the pipeline:
MACS2
FastQC
cutadapt
bowtie2
picard/2.8.2
samtools
homer
please add to $PATHjvarkit
only need to use samjssnakeyaml
please add to $CLASSPATHsambamba
please add to $PATH
For QC, we require the following:
One can check that most dependencies are installed by running checkdep.sh
.
*At the current time, please manually confirm that snakeyaml
is installed!
Installing Nextflow
Nextflow can be downloaded by using the following command:
curl -s https://get.nextflow.io | bash
This will create a binary nextflow
in the working directory. You can add this binary to your PATH
for ease of use:
export PATH=$PATH:[your path here]
The pipeline can be executed by running nextflow
, specifying the script and relevant commandline arguments.
nextflow <script>.nf <command line arguments>
Running the pipeline - single sample
Data preparation
Paired-end read sample data in .fastq.gz
format should be located in a directory with the desired sample name. Read pairs should be distinguishable in the format *_R{1,2}*.fastq.gz
.
Configuration
There are a few parameters which must be specified correctly in config.yaml
before running the pipeline ... things will not work without these parameters
macs2 : --gsize
must be specified formacs2
to correctly call peaks.qc_report : bsgenome, txdb
must be specified for QC report generation usingATACseqQC
to work.bsgenome
must specifiy the BSgenome Biostrings package corresponding to the reference genome.txdb
must specify theGenomeFeatures
package containing transcript annotations for the reference genome.
Command
Nextflow will create a work
directory (containing pipeline data) in its working directory (i.e. .
). Final pipeline output files will be output to a desired directory, however these will generally be symlinks to the actual copy of the file within work/**/your_file_here
. It is very important that work
does not get deleted - otherwise your symlinks will mean nothing!
nextflow atac_pipeline.nf --num-cpus $NUM_CPUS
--jvarkit-path $JVARKIT_PATH
--input-dir $INPUT_DIR
--output-dir $OUTPUT_DIR
--config-file $CONFIG_FILE
--ref-genome-name $GENOME_NAME
--ref-genome-index $GENOME_INDEX
--ref-genome-fasta $GENOME_FASTA
NUM_CPUS
- maximum number of CPUs to use for the entire pipelineINPUT_DIR
- path of the directory containing R1,R2 dataOUTPUT_DIR
- path of the directory to write outputs to (will be created if it doesn't already exist). This can be the same as INPUT_DIR.CONFIG_FILE
(OPTIONAL) - path toconfig.yaml
(in case one wants custom parameters for pipeline components).GENOME_NAME
- name of the reference genome (e.g.danRer10
,hg18
)GENOME_INDEX
- path tobowtie2
indexes for reference genomeGENOME_FASTA
- path toFASTA
sequence of reference genomeJVARKIT_PATH
- path to installation ofjvarkit
.
Nextflow will output its data to your directory of choice.
Running the pipeline - multiple samples
Data preparation
For each sample, create a folder SAMPLE_ID/
containing the paired-end read data in fastq.gz
format. Create a sample table as a text file:
- Each line corresponds to one sample. Fields are as follows:
[Sample_ID] [path to sample input directory] [path to sample output directory]
Command
Pipeline will read in samples from the sample table .txt
file and attempt to process those samples in parallel.
nextflow atac_pipeline.nf --num-cpus $NUM_CPUS
--jvarkit-path $JVARKIT_PATH
--config-file $CONFIG_FILE
--multi-sample
--sample-table $SAMPLE_TABLE
--ref-genome-name $GENOME_NAME
--ref-genome-index $GENOME_INDEX
--ref-genome-fasta $GENOME_FASTA