nf-mixcr
is nextflow pipeline running MiXCR to build T-cell repertoire from illumina sequencing.
Nextflow makes your life easier by managing for you the input files, output files and jobs without having to install any program apart Nextflow itself and a container runner (singularity or docker).
The pipeline runs the mixcr analyze
program on each read pair placed listed in a samplesheet file, generates the QC and clones tables automatically.
flowchart TD
A(Samplesheet) --> B[mixcr analyze]
B[Samplesheet Check] -->|on each sample| C[mixcr analyze]
C -->|on each sample| D[mixcr exportclones]
C -->|on all sample| E[mixcr exportQC align]
C -->|on all sample| F[mixcr exportQC chainusage]
C -->|on each sample| G[mixcr exportQC coverage]
C -->|on each sample| H[mixcr export report]
Full list of run programs:
- mixcr analyze
- mixcr exportclones
- mixcr exportQC align
- mixcr exportQC chainusage
- mixcr exportQC coverage
- mixcr export report
NB: I assume you have a minimal knowledge of terminal and bash and you'll be able to run the following lines.
nf-mixcr
does not require lots of dependencies to run.
If you plan to run it on a cluster (like Eddie), there are big chances you do not need to install anything.
The only dependencies are:
- Nextflow
- Docker or Singularity
- MiXCR (for activation only!)
My advice for installation is to use conda (Miniforge) package manager.
conda create -n nf-mixcr_env
conda activate nf-mixcr_env
conda install -c milaboratories nextflow singularity mixcr
Before going further, you will need a licence for using MiXCR.
If you don't have one, please visit this page and fill in the form.
If you are an academic, lucky you, it's free! If you're not, please check the commercial licensing page.
Once you received your licence, please run the command mixcr activate-license
and copy paste your license key.
NOPE! π
But first, let's check if the pipeline is running correctly. The test profile can be use to run to the pipeline with toy datasets automatically downloaded from the repository.
You can start the test by running:
nextflow run sguizard/nf-mixcr -profile singularity,test,<Institution>
or if you use docker in place of singularity:
nextflow run sguizard/nf-mixcr -profile docker,test,<Institution>
The place holder must be replaced by your cluster profile. The list of available configs can be found on nf-core website.
NB: singularity
or docker
profile might be skipped if they are already defined in your institution profile.
To keep files sorted between inputs, outputs and working directories, I start by creating a directory for the analysis (TCR_project) and create a data directory where I store the reads and other inputs files:
TCR_project/
βββ data
βββ imgt.202312-3.sv8.json.gz
βββ mixcr_analyze.config
βββ read_1.fastq.gz
βββ read_2.fastq.gz
βββ samplesheet.csv
A sampleesheet must be provided. This file is a three columns comma-separated value table. The columns are id
, read1
, read2
and each value must be separated by a comma. Each line gives the location of the fastq file associated with a unique ID.
id,read1,read2
SAMP1,./data/read_1.fastq.gz,./data/read_2.fastq.gz
If the specie studied is different from Human (hsa) or Mouse (mmu), you'll need to provide a library of reference V, D, J, C genes. The IMGT provides libraries for a large panel of specie which can be used with mixcr. The data can be downloaded here. Please, don't decompress the file and keep the '.json.gz'
extension.
MiXCR gather multiple tools and each of them are highly configurable. Implementing all MiXCR options in the pipeline would be highly time consuming. As a tradeoff, I decided to make use of a configuration file to set up mixcr analyze
parameters. You can find a template configuration file here, modify it with your needs. You can also run the pipeline with the option --get_ma_conf
to get a copy.
Each line between the central square brackets is a mixcr analyze
option. If needed, you can add options by inserting a new line at the end of the option, write your option between simple quotes and ending the line with a comma.
process {
withName: MIXCR_ANALYZE {
cpus = 8
ext.args = {
[
'--species cat',
'--rna',
'--tag-pattern "^N{4:6}GCTCACCTTTTTCAGGTCCTC(R1:*)\\^N{4:6}GCAGTGGTATCAACGCAGAGT(UMI:TN{4}TN{4}TN{4}TCTTGGGG)(R2:*)"',
'--rigid-left-alignment-boundary',
'--floating-right-alignment-boundary J',
'--ADDITIONAL-OPTION and_its_value',
].join(' ').trim()
}
}
}
The classical command line to run the pipeline looks like this:
nextflow run sguizard/nf-mixcr \
-profile <Institution> \
-c data/mixcr_analyze.config \
--samplesheet data/samplesheet.csv \
--preset generic-amplicon-with-umi \
--study My_project
You will set two kind of options:
- Nextflow options, with simple dash (eg.
-profile
) - Pipeline options, with double dash (eg.
--samplesheet
)
The nextflow options that need to be used are:
-profile
: select the adhoc virtualization technology (docker or singularity) and the profile of your cluster (eg. eddie). Profiles are separated by commas (eg. docker,eddie).-c
: define additional configuration. Please add the mandatorymixcr_analyze.config
file here.
The pipeline options are:
--samplesheet
: The path to the samplesheet listing samples as describe above--preset
: mixcr analyze preset to use. (eg.generic-amplicon-with-umi
)--library
: V, D, J, C reference genes library--study
: An ID that will be used as prefix for global report files (Default: TCR)--outdir
: the name of the directory where the results will be publish (Default: results)--get_ma_conf
: Download a copy of templatemixcr_analysis.config
and stop--get_sing_fix
: Download a copy offix_singularity-mount_home.config
and stop
Some option must be defined for each run and can't be omitted. The compulsory options are:
-profile
-c
(mixcr_analysis.config)--samplesheet
--preset
The results of the pipeline will be stored in the directory defined by the --outdir
option. For each process/program, one directory will be created to store the results. An additional directory, pipeline_info
, gather reports about pipeline execution.
<outdir name>/
|-- 01_mixcr_analysis
|-- 02_mixcr_exportClones
|-- 03_mixcr_exportQc_align
|-- 03_mixcr_exportQc_chainusage
|-- 03_mixcr_exportQc_coverage
|-- 04_mixcr_exportReports
`-- pipeline_info
01_mixcr_analysis
|-- SAMP1.align.report.json
|-- SAMP1.align.report.txt
|-- SAMP1.assemble.report.json
|-- SAMP1.assemble.report.txt
|-- SAMP1.clns
|-- SAMP1.clones_TRB.tsv
|-- SAMP1.log
|-- SAMP1_non_refined.vdjca
|-- SAMP1.qc.json
|-- SAMP1.qc.txt
|-- SAMP1.refined.vdjca
|-- SAMP1.refine.report.json
`-- SAMP1.refine.report.txt
This directory gather the results of the programs launched by MiXCR. With the preset generic-amplicon-with-umi
, mixcr analyze align
, mixcr analyze refineTagsAndSort
, mixcr analyze assemble
and mixcr analyze export
are run.
02_mixcr_exportClones
`-- SAMP1_exportClones_<TRB/IGL>.tsv
mixcr exportClones
generates a tabulation separated value file listing detected clones.
03_mixcr_exportQc_align
|-- TCR_exportQC_align.pdf
`-- TCR_exportQC_align.png
mixcr exportQc align
use the results of each analyzed samples to generate align report.
It describes the reads status (correctly/incorrectly align).
03_mixcr_exportQc_chainusage
|-- TCR_exportQC_chainUsage.pdf
`-- TCR_exportQC_chainUsage.png
Exports chain usage summary of each sample.
03_mixcr_exportQc_coverage
|-- SAMP1_exportQC_coverage.pdf
|-- SAMP1_exportQC_coverage_R0.png
|-- SAMP1_exportQC_coverage_R1.png
`-- SAMP1_exportQC_coverage_R2.png
Exports anchor points coverage by the library. It separately plots coverage for R1, R2 and overlapping reads.
04_mixcr_exportReports
|-- SAMP1.report.json
`-- SAMP1.report.txt
These files contains the report of each tool launched by mixcr analyze
.
pipeline_info
|-- <timestamp>_execution_report.html
|-- <timestamp>_execution_timeline.html
`-- <timestamp>_execution_trace.txt
These are the reports generated by Nextflow about the pipeline run.
The execution report contains information about jobs, their running time, the resources used and the command used alongside the pipeline version used.
The execution timeline display the running time and order in which jobs have been launched.
The execution trace report gather the raw data about job execution (included job running directory in work directory).
Dear Roslin eddie users,
Using Nextflow is not as straigth forward as it should be.
Most of the time, it's necessary to add a custom configuration file to fix some issues.
That's why profiles exists. The Roslin bioinformqtics group has created the Roslin profile to makes nextflow execution as smooth as possible.
To use it, please specify the -profile roslin
option in your command line.
To being sure that MiXCR can correctly access to your license, your home directory must be mounted in the container.
This can be fixed by adding a configuration file.
No worries, no need to write anything.
Run the pipeline with the --get_sing_fix
option and it will download a configuration file that will fix this issue.
Then in the command line, you can add the -c fix_singularity-mount_home.config
option.
nextflow run sguizard/nf-mixcr \
-profile roslin \
-c data/mixcr_analyze.config \
-c data/fix_singularity-mount_home.config \
--samplesheet data/samplesheet.csv \
--preset generic-amplicon-with-umi \
--library data/imgt.202312-3.sv8.json.gz \
--study TCR_cat_project
Contributions are welcome! Just try to following the code formatting the best as you can.
Please cite my work if you use it in own research, thanks! π
SΓ©bastien Guizard. (2024). sguizard/nf-mixcr: nf-mixcr v1.0.1 (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.10678867
This pipeline is very inspired by nf-core templates and even borrow few parts of it, notably the institution configs.
Please also check the nf-core website! It gathers great, easy to use pipelines and it is maintained by wonderful peoples!