Tosca is presented and described further in our preprint:
Anob M. Chakrabarti, Ira A. Iosub, Flora C. Y. Lee, Jernej Ule, Nicholas M. Luscombe. bioRxiv (2022).
- Introduction
- Pipeline summary
- Quick start (testing)
- Quick start (running)
- Pipeline parameters
- Pipeline outputs
Tosca is a Nextflow pipeline for the analysis of hiCLIP or proximity ligation (e.g. PARIS, SPLASH, COMRADES) sequencing data. It is containerised using Docker to ensure ease of installation. It is optimised for use on high-performance computing (HPC) clusters, but can also run locally depending on the size of the data set.
- Adapter and quality trimming (
Cutadapt) - Premapping to remove spliced reads (
STAR) - Hybrid identification (
pblatandtoscatools) - UMI-based deduplication (
toscatoolsand modifiedUMI-tools) - Hybrid clustering (
toscatools) - Annotation (
toscatools) - Duplex and structure analysis and binding energy characterisation (
toscatools) - Visualisation (
toscatools)- BAM
- BED
- Arc plots
- Contact matrices
- QC (
MultiQC)
- Ensure
NextflowandDockerorSingularityare installed on your system - Pull the main version of the pipeline from the GitHub repository:
nextflow pull amchakra/tosca -r main
- Run the provided test dataset:
nextflow run amchakra/tosca -r main -profile test,docker
or
nextflow run amchakra/tosca -r main -profile test,singularity
- Review the results
- Ensure
NextflowandDockerorSingularityare installed on your system - Pull the main version of the pipeline from the GitHub repository:
nextflow pull amchakra/tosca -r main
- Download and unpack pre-generated reference files. We have generated these for human and mouse (they are ~25GB each).
wget -q reference.tar.gz
tar -xzvf reference.tar.gz
- Prepare a
samplesheet.csvwith your sample names and paths to your FASTQ files, following the template:
sample,fastq
sample1,/path/to/file1.fastq.gz
sample2,/path/to/file2.fastq.gz
sample3,/path/to/file3.fastq.gz
- Run the pipeline (the minimum parameters have been specified):
nextflow run amchakra/tosca -r main \
-profile singularity \
--input samplesheet.csv \
--genomesdir /path/to/reference \
--org human
-profilecan be used to specifytest,docker,singularityandcrickdepending on the system being used and resources available. Others can be found at nf-core.
--inputspecifies the input sample sheet--outdirspecifies the output results directory- default:
./results
- default:
--tracedirspecifies the pipeline run trace directory- default:
./results/pipeline_info
- default:
Either --genomesdir and --org or all of the other reference files need to be specified
--genomesdirspecifies the genome reference directory--orgspecifies the organism (options are currently:human,mouse)--genome_faispecifies the genome FASTA index--star_genomespecifies the genome STAR index--regions_gtfspecifies the genome gene/region/biotype annotation GTF (generated byiCount-Mini)--transcript_faspecifies the pseudo-transcriptome FASTA--transcript_faispecifies the pseudo-transcriptome FASTA index--transcript_gtfspecifies the pseudo-transcriptome annotation GTF
--adapterspecifies the adapter sequence for Cutadapt- default:
AGATCGGAAGAGC
- default:
--min_qualityspecifies the minimum quality score for Cutadapt- default:
10
- default:
--min_readlengthspecifiies the minimum read length after trimming for Cutadapt- default:
16
- default:
--split_sizespecifies number of reads per FASTQ file when splitting for parallelised alignment- default:
100000
- default:
--star_argsspecifies optional additional STAR aligmnent parameters--step_sizespecifies pblat step size- default:
5
- default:
--tile_sizespecifies pblat tile size- default:
11
- default:
--min_scorespecifies pblat minimum score- default:
15
- default:
--evaluespecifies pblat e-value threshold- default:
0.001
- default:
--maxhitsspecfies maximum number of pblat alignments per read- default:
100
- default:
--dedup_methodspecifies the UMI deduplication method (options are:none,unique,percentile,cluster,adjacency,directional)- default:
directional
- default:
--umi_separatorspecifies the UMI separator in the FASTQ read name- default:
_
- default:
--chunk_numberspecifies the number of chunks into which to split the hybrid files for parallelised processing- default:
100
- default:
--percent_overlapspecifies the minimum percentage that one of the two hybrid arms need to overlap to be counted as overlapping- default:
0.75
- default:
--sample_sizespecifies the sample size to subsample hybrids reads per gene prior to clustering- default:
-1i.e. no subsampling
- default:
--analyse_structurespecifies whether to analyse the duplex structure for each hybrid read- default:
false
- default:
--shuffled_mfespecifies whether to generate a control shuffled mean minimum free energy for each hybrid read- default:
false
- default:
--clusters_onlyspecifies whether to analyse the structure for hybrid reads that are in a cluster- default:
true
- default:
--atlasspecifies whether to generate an atlas of duplexes by combining hybrids from all the samples- default:
true
- default:
--goiis a plain text file with one gene of interest per line to be visualised--bin_sizespecifies the size of each bin when generating the contact map matrices- default:
100
- default:
--breaksspecifies the breaks for grouping the arcs by colour- default:
0,0.3,0.8,1
- default:
--skip_premapskips premapping to the genome and filtering of spliced reads--skip_atlasskips generation of an atlas by combining all the samples--skip_qcskips generation of QC plots and MultiQC report
Tosca outputs results in a number of subfolders:
.
├── mapped
├── hybrids
├── clusters
├── igv
├── maps
├── nonhybrids
└── pipeline_info
mappedcontains all the partial read alignments used for calculating valid hybrids:*.blast8.gz
hybridscontains files that have the identified hybrids as TSV files:*.hybrids.tsv.gzcontains all the hybrids*.hybrids.dedup.tsv.gzcontains the deduplicated hybrids*.hybrids.clustered.tsv.gzcontains the deduplicated hybrids with clusters calculated that identify the unique duplexes/RNA structure they represent*.hybrids.gc.tsv.gzcontains the deduplicated hybrids with genomic coordinates calculated*.hybrids.gc.annotated.tsv.gzcontains the deduplicated hybrids with genomic coordinates, gene, region and biotypes calculated.
clusterscontains files that have the identified clusters as TSV files:*.clusters.tsv.gzcontains all the collapsed clusters*.clusters.gc.tsv.gzcontains the collapsed clusters with genomic coordinates calculated*.clusters.gc.annotated.tsv.gzcontains the collapsed clusters with genomic coordinates, gene, region and biotypes calculated.
igvcontains files than can be used to visualise the results in IGV:*.bamcontains all the hybrids in BAM format. Optional flags can be used to colour/group by experiment, hybrid cluster, read orientation, and hybridisation energy*.bedcontains the clusters (i.e. unique duplexes) in BED format*.bpcontains arc representations of the clusters coloured by number
mapscontains contact map files (if genes of interest have been specified):*.mat.rdsis an R matrix with the raw contact map matrix*.{bin_size}_binned.map.tsv.gzis the matrix in long format binned using {bin_size}
nonhybridscontains those sequencing reads that did not contain a hybrid:*.nonhybrid.fastq.gz
pipeline_infocontains the execution reports, traces and timelines generated by Nextflow:execution_report.htmlexecution_timeline.htmlexecution_trace.txt