Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



30 Commits

Repository files navigation


An integrated pipeline for ATAC-seq data written in Nextflow with:heart:

Author: Stephen Zhang ( Date: 5 Feb 2018

Introduction nf-ATAC pipeline for processing ATAC-seq data written in Nextflow script ( Currently in early stages, this README will definitely be updated regularly (check often!)

Have an problem? Please log an issue on GitHub (

Dependencies Please make sure these tools are installed before running the pipeline:

For QC, we require the following:

One can check that most dependencies are installed by running *At the current time, please manually confirm that snakeyaml is installed!

Installing Nextflow

Nextflow can be downloaded by using the following command:

curl -s | bash

This will create a binary nextflow in the working directory. You can add this binary to your PATH for ease of use:

export PATH=$PATH:[your path here]

The pipeline can be executed by running nextflow, specifying the script and relevant commandline arguments.

nextflow <script>.nf <command line arguments>

Running the pipeline - single sample

Data preparation

Paired-end read sample data in .fastq.gz format should be located in a directory with the desired sample name. Read pairs should be distinguishable in the format *_R{1,2}*.fastq.gz.


There are a few parameters which must be specified correctly in config.yaml before running the pipeline ... things will not work without these parameters

  • macs2 : --gsize must be specified for macs2 to correctly call peaks.
  • qc_report : bsgenome, txdb must be specified for QC report generation using ATACseqQC to work. bsgenome must specifiy the BSgenome Biostrings package corresponding to the reference genome. txdb must specify the GenomeFeatures package containing transcript annotations for the reference genome.


Nextflow will create a work directory (containing pipeline data) in its working directory (i.e. .). Final pipeline output files will be output to a desired directory, however these will generally be symlinks to the actual copy of the file within work/**/your_file_here. It is very important that work does not get deleted - otherwise your symlinks will mean nothing!

nextflow --num-cpus $NUM_CPUS
			  --jvarkit-path $JVARKIT_PATH
			  --input-dir $INPUT_DIR
			  --output-dir $OUTPUT_DIR
			  --config-file $CONFIG_FILE
			  --ref-genome-name $GENOME_NAME
			  --ref-genome-index $GENOME_INDEX
			  --ref-genome-fasta $GENOME_FASTA
  • NUM_CPUS - maximum number of CPUs to use for the entire pipeline
  • INPUT_DIR - path of the directory containing R1,R2 data
  • OUTPUT_DIR - path of the directory to write outputs to (will be created if it doesn't already exist). This can be the same as INPUT_DIR.
  • CONFIG_FILE (OPTIONAL) - path to config.yaml (in case one wants custom parameters for pipeline components).
  • GENOME_NAME - name of the reference genome (e.g. danRer10, hg18)
  • GENOME_INDEX - path to bowtie2 indexes for reference genome
  • GENOME_FASTA - path to FASTA sequence of reference genome
  • JVARKIT_PATH - path to installation of jvarkit.

Nextflow will output its data to your directory of choice.

Running the pipeline - multiple samples

Data preparation

For each sample, create a folder SAMPLE_ID/ containing the paired-end read data in fastq.gz format. Create a sample table as a text file:

  • Each line corresponds to one sample. Fields are as follows:
[Sample_ID] [path to sample input directory] [path to sample output directory]


Pipeline will read in samples from the sample table .txt file and attempt to process those samples in parallel.

nextflow --num-cpus $NUM_CPUS
			  --jvarkit-path $JVARKIT_PATH
			  --config-file $CONFIG_FILE
			  --sample-table $SAMPLE_TABLE
			  --ref-genome-name $GENOME_NAME
			  --ref-genome-index $GENOME_INDEX
			  --ref-genome-fasta $GENOME_FASTA