Main outcome: During the first session you will build a FastQC pipeline to learn the basics of Nextflow including:
- Parameters
- Processes (inputs, outputs & scripts)
- Channels
- Operators
- Configuration
We have Conda installed so we can run the following to install Nextflow:
conda init bash
exec -l bash
Then create a new conda environment:
conda create -n class4
conda activate class4
And install Nextflow there with:
conda install -c bioconda nextflow
You can then test your installation of Nextflow with:
nextflow run hello
Now that we have Nextflow & Docker installed we're ready to run our first script
- Create a file
main.nf
& open this in your favourite code/text editor - In this file write the following:
// main.nf
params.fastq_files_list = false
println "My fastq_files_list: ${params.fastq_files_list}"
The first line initialises a new variable (params.fastq_files_list
) & sets it to false
The second line prints the value of this variable on execution of the pipeline.
We can now run this script & set the value of params.reads
to one of our FASTQ files in the testdata folder with the following command:
nextflow run main.nf --fastq_files_list fastq_files_list.csv
This should return the value you passed on the command line
Here we learnt how to define parameters & pass command line arguments to them in Nextflow
Nextflow allows the execution of any command or user script by using a process
definition.
A process is defined by providing three main declarations: the process inputs, the process outputs and finally the command script.
In our main script we want to add the following:
- input
- output
- script
Here we created the variable fastq_files_list
which is a file
from the command line input.
We can then create the process fastqc
including:
- the directive
publishDir
to specify which folder to copy the output files to - the inputs where we declare a
file
reads
from our variablereads
- the output which is anything ending in
_fastqc.zip
or_fastqc.html
which will go into afastqc_results
channel - the script where we are running the
fastqc
command on ourreads
variable
We can then run our script with the following command:
nextflow run main.nf --fastq_list fastq_files_list.csv
Channels are the preferred method of transferring data in Nextflow & can connect two processes or operators.
In our main.nf we can add the following:
//main.nf
// Re-usable component to create a channel with the links of the files by reading the design file
Channel
.fromPath(params.fastq_list)
.ifEmpty { error "No file with list of fastq files to download found at the location ${params.fastq_list}" }
.splitCsv(sep: ',', skip: 1)
.map { accession, fastq1, fastq2 -> [ accession, file(fastq1), file(fastq2) ] }
.set { ch_fastq_files }
// Re-usable process skeleton that performs a simple operation, listing files
process fastqc {
tag "${accession}"
echo true
publishDir "results", mode: 'copy'
input:
set val(accession), file(fastq_1), file(fastq_2) from ch_fastq_files_subsetted
output:
file "*_fastqc.{zip,html}" into ch_fastqc_results
script:
"""
fastqc $fastq_1 $fastq_2
"""
}
The fastq_list
variable is now equal to a channel which contains the accesion id and paired-end FASTQ data. Therefore, the input declaration has also changed to reflect this by declaring the value accesion
. This accesion
can be used as a tag for when the pipeline is run. Also, as we are now declaring two inputs the set
keyword has to be used.
To run the pipeline:
nextflow run main.nf --fastq_list fastq_files_list.csv
Configuration, such as parameters, containers & resources eg memory can be set in config
files such as nextflow.config
.
For example our nextflow.config
file might look like this:
// nextflow.config contents
docker.enabled = true
params.reads = false
process {
cpus = 2
memory = "2.GB"
withName: fastqc {
container = "lifebitai/fastqc"
}
}
Here we have enabled docker by default, initialised parameters, set resources & containers. It is best practice to keep these in the config
file so that they can more easily be set or removed. Containers & params.reads
can then be removed from main.nf
.
The pipeline can now be run with the following:
nextflow run main.nf --fastq_list fastq_files_list.csv