Contents
This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run
conda env create -f envs/snakemake.yml
conda activate snakemake
to create and activate a conda environment named snakemake
. Once all the input files are ready, run the pipeline on a SLURM server environment with
./run_pipeline.sh
Terms
- split-pool tag: the individaul sequences that are added during a single round of split-pool barcoding (DPM, EVEN, ODD, TERM)
- split-pool barcode: the permutation of split-pool tags that uniquely identifes a cluster
- Antibody-ID: a 9 nt sequence within the antibody oligo that uniquely identifies a type of antibody
The pipeline relies on scripts written in Java, Bash, Python. This pipeline has been validated using Java version 8.0.322 (openjdk version "1.8.0_322"
) and Bash version 4.2.46. Versions of Python are specified in conda environments described in envs/
, along with other third-party programs and packages that this pipeline depends on.
Workflow
- Define samples and paths to FASTQ files (
fastq2json.py
or manually generatesamples.json
) - Split FASTQ files into chunks for parallel processing
- Adaptor trimming (Trim Galore!)
- Barcode identification
- Split DPM (DNA) and BPM (antibody oligo) reads into separate files
- DPM read workflow:
- DPM Trimming (cutadapt)
- Alignment (bowtie2)
- Chromosome relabeling (add "chr") and filtering (removing non-canonical chromosomes)
- Masking (based on ENCODE blacklists)
- BPM read workflow:
- BPM Trimming (cutadapt)
- FASTQ to BAM conversion
- Cluster generation
- Cluster assignment and antibody specific BAM file creation
- Summary plots
- DPM and BPM cluster size distributions
- Maximum representation oligo ECDFs
- Summary statistics
- MultiQC (trimming, alignments)
- Ligation efficiency
- Cluster statistics
- Read assignment statistics
Notes
- How are antibody oligo PCR duplicates removed? In step 6.ii above (
fastq_to_bam.py
), oligo UMIs are converted to integers and used as the reference position (column 4) in the BAM file. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated. - How are DNA sequence PCR duplicates removed? We assume that DNA molecules from a batch of cells stochastically fragmented in the crosslinking ChIP assay are unlikely to have the exact same start and end positions. Therefore, we consider identical DNA fragments (same start and end positions) in the same cluster to be PCR duplicates. As for the antibody oligo PCR duplicates, when clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated.
See also visualizations of the pipeline generated by Snakemake (commands below assume that Graphviz is installed):
- ./dag.pdf:
snakemake --dag | dot -Tpdf > dag.pdf
- ./filegraph.pdf:
snakemake --filegraph | dot -Tpdf > filegraph.pdf
- ./rulegraph.pdf:
snakemake --rulegraph | dot -Tpdf > rulegraph.pdf
Directory structure (input = user-configured files, output = files generated by the pipeline, pipeline = scripts and files provided with the pipeline by this repository)
- (pipeline)
assets/
- (pipeline)
envs/
- (pipeline)
scripts/
- (output)
workup/
alignments/
alignments_parts/
clusters/
clusters_parts/
fastqs/
logs/
qc/
splitbams/
splitfq/
trimmed/
ligation_efficiency.txt
- (input)
cluster.yaml
- (input)
config.yaml
- (input)
config.txt
/example_config.txt
- (input)
format.txt
/example_format.txt
- (pipeline)
fastq2json.py
- (pipeline)
run_pipeline.sh
- (input)
samples.json
/example_samples.json
- (pipeline)
Snakefile
All paths are relative to the project directory.
config.yaml
: YAML file containing the processing settings and paths of required input files.
-
samples.json
: JSON file with the location of FASTQ files (read1, read2) to process.-
config.yaml
key to specify the path to this file:samples
-
This can be prepared using
fastq2json.py --fastq_dir <path_to_directory_of_FASTQs>
or manually formatted as follows:{ "sample1": { "R1": ["<path_to_data>/sample1_R1.fastq.gz"], "R2": ["<path_to_data>/sample1_R2.fastq.gz"] }, "sample2": { "R1": ["<path_to_data>/sample2_R1.fastq.gz"], "R2": ["<path_to_data>/sample2_R2.fastq.gz"] }, ... }
-
The pipeline (in particular, the script
scripts/bash/split_fastq.sh
) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.- If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
-
Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (ligation efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
-
-
assets/bpm.fasta
: FASTA file containing the sequences of antibody oligo barcodesconfig.yaml
key to specify the path to this file:cutadapt_oligos
- Used by:
cutadapt
(Snakefilerule cutadapt_oligo
) - Each sequence should be preceeded by
^
to anchor the sequence during cutadapt trimming (see Snakefilerule cutadapt_oligo
).
-
assets/dpm96.fasta
: FASTA file containing the sequences of DPM tagsconfig.yaml
key to specify the path to this file:cutadapt_dpm
- Used by:
cutadapt
(Snakefilerule cutadapt_dpm
) - Each of these sequences are 10 nt long, consisting of a unique 9 nt DPM_Bottom sequences as originally designed for SPRITE (technically, only the first 8 nt are unique, and the 9th sequence is always a
T
), plus aT
that is complementary to a 3'A
added to a chromatin DNA sequence via dA-tailing.
-
config.txt
: Text file containing the sequences of split-pool tags and the split-pool barcoding setup.config.yaml
key to specify the path to this file:bID
(for "barcode ID")- Used by:
scripts/java/BarcodeIdentification_v1.2.0.jar
(Snakefilerule barcode_id
) andscripts/python/fastq_to_bam.py
(Snakefilerule fastq_to_bam
) - Format: SPRITE configuration file (see our SPRITE GitHub Wiki or Nature Protocols paper for details).
-
Blank lines and lines starting with
#
are ignored. -
An example barcoding configuration file is annotated below:
# Barcoding layout for read 1 and read 2 # - Y represents a terminal tag # - ODD, EVEN, and DPM indicate their respective tags # - SPACER accounts for the 7-nt sticky ends that allow ligation between tags READ1 = DPM READ2 = Y|SPACER|ODD|SPACER|EVEN|SPACER|ODD|SPACER|EVEN|SPACER|ODD # DPM tag sequences formatted as tab-delimited lines # 1. Tag category: DPM # 2. Tag name: must contain "DPM", such as "DPM<xxx>" # 3. Tag sequence (see assets/dpm96.fasta) # 4. Tag error tolerance: acceptable Hamming distance between # expected tag sequence (column 3) and tag sequence in the read DPM DPM6B1 TCGAGTCT 0 DPM DPM6B10 TGATGCAT 0 ... # Antibody oligo barcode sequences formatted as tab-delimited lines # - Identical format as for DPM tag sequences, except that the tag name (column 2) # must contain "BEAD", such as "BEAD_<name of antibody>" DPM BEAD_AB1 GGAACAGTT 0 DPM BEAD_AB2 CGCCGAATT 0 ... # Split-pool tag sequences: same 4-column tab-delimited format as the # "DPM and antibody oligo barcode sequences" section above, except that # Tag category (column 1) is now ODD, EVEN, or Y EVEN A1 ATACTGCGGCTGACG 2 EVEN A2 GTAGGTTCTGGAATC 2 ... ODD A1 TTCGTGGAATCTAGC 2 ODD A2 CCTGTGCGTTAGAGT 2 ... Y A1 TATTATGGT 0 Y A2 GAGATGGAT 0 ...
-
-
format.txt
: Tab-delimited text file indicating which split-pool barcode tags are valid in which round of split-pool barcoding (i.e., at which positions in the barcoding string).config.yaml
key to specify the path to this file:format
- Used by:
scripts/python/split_dpm_bpm_fq.py
(Snakefilerule split_bpm_dpm
) - Column 1 indicates the zero-indexed position of the barcode string where a tag can be found.
- Term barcode tags (Y) are position
0
; the second to last round of barcoding tags are position1
; etc. A value of-1
in the position column indicates that the barcode tag was not used in the experiment.
- Term barcode tags (Y) are position
- Column 2 indicates the name of the tag.
- Column 3 is the tag sequence.
- Column 4 is the edit acceptable edit distance for this sequence.
-
assets/blacklist_hg38.bed
,assets/blacklist_mm10.bed
: blacklisted genomic regions for ChIP-seq data-
For human genome release hg38, we use ENCFF356LFX from ENCODE. For mouse genome release mm10, we use mm10-blacklist.v2.bed.gz.
-
Reference paper: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. doi:10.1038/s41598-019-45839-z
-
Example code used to download them into the
assets/
directory:wget -O - https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz | zcat | sort -V -k1,3 > "assets/blacklist_hg38.bed" wget -O - https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz | zcat | sort -V -k1,3 > "assets/blacklist_mm10.bed"
-
-
assets/index_mm10/*.bt2
,assets/index_hg38/*.bt2
: Bowtie 2 genome index-
config.yaml
key to specify the path to the index:bowtie2_index: {'mm10': <mm10_index_prefix>, 'hg38': <hg38_index_prefix>}
-
If you do not have an existing Bowtie 2 index, you can download pre-built indices from the Bowtie 2 developers:
# for human primary assembly hg38 mkdir -p assets/index_hg38 wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip -j -d assets/index_hg38 GRCh38_noalt_as.zip \*.bt2 # for mouse primary assembly mm10 mkdir -p assets/index_mm10 wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip unzip -j -d assets/index_mm10 mm10.zip \*.bt2
This will create a set of files under
assets/index_hg38
orassets/index_mm10
. If we want to use themm10
genome assembly, for example, the code above will populateassets/index_mm10
with the following files:mm10.1.bt2
,mm10.2.bt2
,mm10.3.bt2
,mm10.4.bt2
,mm10.rev.1.bt2
,mm10.rev.2.bt2
. The path prefix to this index (as accepted by thebowtie2 -x <bt2-idx>
argument) is thereforeassets/index_mm10/mm10
, which is set in the configuration file,config.yaml
.Note that the pre-built indices linked above use UCSC chromosome names (
chr1
,chr2
, ...,chrX
,chrY
,chrM
). If your alignment indices use Ensembl chromosome names (1
,2
, ...,X
,Y
,MT
), this pipeline includes a step to convert chromosome names in BAM files to UCSC chromosome names.
-
-
Barcode Identification Efficiency (
workup/ligation_efficiency.txt
)- A statistical summary of how many barcode tags were found per read and the proportion of reads with a matching barcode at each barcode position.
-
Cluster file (
workup/clusters/<sample>.clusters
)- Each line in a cluster file represents a single cluster. The first column is the cluster barcode. The remainder of the line is a tab deliminated list of reads. DNA reads are formated as
DPM[strand]_chr:start-end
and Antibody ID oligo reads are formated asBPM[]_<AntibodyID>:<UMI>-0
.
- Each line in a cluster file represents a single cluster. The first column is the cluster barcode. The remainder of the line is a tab deliminated list of reads. DNA reads are formated as
-
Cluster statistics (
workup/clusters/cluster_statistics.txt
)- The number of clusters and BPM or DPM reads per library.
-
Cluster size distribtion (
workup/clusters/[BPM,DPM]_cluster_distribution.pdf
)- The distribution showing the proportion of clusters that belong to each size category.
-
Cluster size read distribution (
workup/clusters/[BPM,DPM]_read_distribution.pdf
)- The distribution showing the proportion of reads that belong to clusters of each size category. This can be more useful than the number of clusters since relatively few large clusters can contain many sequencing reads (i.e., a large fraction of the library) while many small clusters will contain few sequencing reads (i.e., a much smaller fraction of the library).
-
Maximum Representation Oligo ECDF (
workup/clusters/Max_representation_ecdf.pdf
)- A plot showing the distribution of proportion of BPM reads in each cluster that belong to the maximum represented Antibody ID in that cluster. A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of antibody ID oligos. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameters (
min_oligo
,proportion
) for cluster assignment.
- A plot showing the distribution of proportion of BPM reads in each cluster that belong to the maximum represented Antibody ID in that cluster. A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of antibody ID oligos. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameters (
-
Maximum Representation Oligo Counts ECDF (
workup/clusters/Max_representation_ecdf.pdf
)- A plot showing the distribution of number of BPM reads in each cluster that belong to the maximum represented Antibody ID in that cluster. If clusters are nearly unique in Antibody ID composition, this plot is a surrogate for BPM size distribtuion. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters (
min_oligo
,proportion
) for cluster assignment.
- A plot showing the distribution of number of BPM reads in each cluster that belong to the maximum represented Antibody ID in that cluster. If clusters are nearly unique in Antibody ID composition, this plot is a surrogate for BPM size distribtuion. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters (
-
BAM Files for Individual Antibodies (
workup/splitbams/*.bam
)- Thresholding criteria (
min_oligos
,proportion
,max_size
) for assigning individual clusters to individual antibodies are set inconfig.yaml
. - The "none" BAM file (
<sample>.DNA.merged.labeled_none.bam
) contains DNA reads from clusters without Antibody ID reads. - THe "ambigious" BAM file (
<sample>.DNA.merged.labeled_ambiguous.bam
) contains DNA reads from clusters that failed theproportion
thresholding criteria. - The "uncertain" BAM file (
<sample>.DNA.merged.labeled_uncertain.bam
) contains DNA reads from clusters that failed themin_oligo
thresholding criteria. - The "filtered" BAM file (
<sample>.DNA.merged.labeled_filtered.bam
) contains DNA reads from clusters that failed the max_size thresholding criteria.
- Thresholding criteria (
-
Read Count Summary for Individual Antibodies (
workup/splitbams/splitbam_statistics.txt
)- The number of read counts contained within each individual BAM file assigned to individual antibodies.
Adapted from the SPRITE and RNA-DNA SPRITE pipelines by Isabel Goronzy (@igoronzy).
Other contributors
- Benjamin Yeh (@bentyeh)
- Andrew Perez (@HeyDrew64)
- Mario Blanco (@mrblanco)
- Mitchell Guttman (@mitchguttman)