Contents
This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run
conda env create -f envs/snakemake.yaml
conda activate snakemake
to create and activate a conda environment named snakemake
. Once all the input files are ready, run the pipeline on a SLURM server environment with
./run_pipeline.sh
After the pipeline finishes, explore cluster properties in-depth using the cluster quality control Jupyter notebook (chipdip_qc.ipynb
). The conda environment specified by envs/chipdip_qc.yaml
includes all dependencies used by the Jupyter notebook, excluding Jupyter itself. If you already have a Jupyter setup, you can add this as a new kernel. Otherwise, include the dependency jupyterlab=4.0.4
to create a Jupyter setup.
Other common usage notes
- To run the pipeline on a local computer (e.g., laptop), comment out or remove the
--cluster-config cluster.yaml
and--cluster "sbatch ..."
arguments within./run_pipeline.sh
, and set the number of jobs-j <#>
to the number of local processors available. run_pipeline.sh
passes any additional arguments to snakemake.- To perform a dry-run:
./run_pipeline.sh --dry-run
- To force (re)execution of all rules regardless of past output:
./run_pipeline.sh --forceall
- To extend the wait time (in seconds) for snakemake to check for files after rules are completed:
./run_pipeline.sh --latency-wait 30
- In distributed computing environments (e.g., on a high performance computing (HPC) cluster), there may be significant latency for files to appear across the shared filesystem, so a high wait time (e.g., 30+ seconds) may be required.
- (in development) To check for common mistakes and formatting issues of
config.yaml
,config.txt
,format.txt
, andchrom_map.txt
:./run_pipeline.sh validate
- To remove intermediate files generated by the pipeline:
./run_pipeline.sh clean
- To perform a dry-run:
- To create a reusable ChIP-DIP pipeline conda environment, run
and modify the
conda env create -f envs/chipdip.yaml
conda_env
key inconfig.yaml
to"chipdip"
.
To test that the pipeline runs correctly in your compute environment, you can verify that you obtain correct cluster statistics and demultiplexed BAM files from the provided example data. Note that this assumes that you are using the default configuration settings and have the chipdip
conda environment installed:
conda activate chipdip
tests/verify_merged_splitbams_from_example_data.sh
Expected output upon successful verification:
AB1-A1 matches reference.
AB2-A2 matches reference.
MD5 checksum of cluster_statistics.txt matches reference.
Hardware: This pipeline was developed to be run on an HPC cluster, but it can also be run locally on a personal computer.
- See benchmarks below for a sense of how hardware requirements scale with dataset size.
- Note: the
chipdip
conda environment itself takes ~2.2 GB disk space, and the unzipped mm10 and hg38 Bowtie 2 indices occupy 3.6 GB and 3.9 GB, respectively. The disk space numbers in the benchmark table below do not account for the size of the conda environment, Bowtie 2 indices, or temporary disk usage during the pipeline run. - Recommended hardware: 4+ CPU cores, 24+ GB memory, 60 GB free disk space
Operating system: Linux, macOS
- Lack of Windows support is due to our use of the Bioconda channel for creating the conda environments described in
envs/
. Bioconda currently does not support Windows. However, we have successfully run this pipeline on Windows Subsystem for Linux (WSL). - While Bioconda supports macOS and has built an extensive repository of packages for x86_64-based Macs (i.e., with Intel processors; conda platform
osx-64
), only a limited number of packages (or versions of packages) have been built for new ARM-based Macs (i.e., with M-series Apple Silicon processors; conda platformosx-arm64
). Consequently, the versions specified by the environment YAML files described inenvs/
may not be available on Bioconda for theosx-arm64
platform.- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the
osx-64
platform, and executing the pipeline programs under Rosetta translation:- Create a reusable environment using
conda env create -f envs/chipdip.yaml --platform osx-64
, and modify theconda_env
key inconfig.yaml
to"chipdip"
. - While macOS should automatically prompt to install Rosetta if needed, it can also be manually installed using the terminal command
/usr/sbin/softwareupdate --install-rosetta
.
- Create a reusable environment using
- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the
Code intepreters and runtimes: The pipeline relies on scripts written in Java, Bash, Python and has been validated using the following versions:
- Java: 8.0.322 through 11.0.22 (the
envs/chipdip.yaml
conda environment file currently uses version 8.0.412) - Bash: 4.2 through 5.1
- Python: 3.9+ (the
envs/chipdip.yaml
conda environment file currently uses version 3.10)
Packages: Additional required third-party programs and packages are specified in conda environments described in envs/
.
- Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- The version of
deeptools
used (3.5.2) requires amatplotlib
version between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used bydeeptools
version 3.5.2. Newer versions ofdeeptools
have been updated to support the newermatplotlib
APIs. - NumPy version 2.x is not currently supported.
- The version of
Dataset | CPU cores and architecture | Operating System | Pipeline wall time (HH:MM:SS) | Core-hours utilized (HH:MM:SS) | Maximum RAM utilization | Disk space of output files | Benchmark times include creating chipdip conda environment? |
Notes |
---|---|---|---|---|---|---|---|---|
example dataset provided in data/ : 4,532 paired reads (≤ 130 bp read 1, ≤ 150 bp read 2) |
2 cores of AMD EPYC 7763 | Ubuntu 20.04.6 LTS | 00:13:01 | 00:20:28 | 2.91 GB | 20 MB | No | run "locally" on a single node on GitHub Codespaces |
example dataset provided in data/ |
Intel Core i5-4300U | Ubuntu 22.04.4 LTS via WSL | 00:12:23 | 00:41:31 | 2.76 GB | 14 MB | No | Hyperthreading enabled; snakemake run with --jobs 4 |
example dataset provided in data/ |
Apple M2 (4 performance + 4 efficiency cores) | macOS Sequoia 15.1.1 | 00:04:45 | 00:14:06 | 2.33 GB | 16 MB | No | conda environment targeting osx-64 platform; snakemake run with --jobs 4 |
example dataset provided in data/ |
2 cores of Intel Xeon Gold 6130 | RHEL9.3 | 00:21:20 | 00:26:35 | 827.49 MB | 20 MB | Yes | run "locally" on a single node on Caltech's HPC |
49,222,185 paired reads (89 bp read 1, 209 bp read 2) | 8 cores of Intel Xeon Gold 6130 | RHEL9.3 | 02:43:38 | 09:49:34 | 10.65 GB | 45 GB | Yes | run "locally" on a single node on Caltech's HPC |
Terms
- split-pool tag: the individual sequences that are added during a single round of split-pool barcoding (DPM, EVEN, ODD, TERM)
- split-pool barcode: the permutation of split-pool tags that uniquely identifies a cluster
- Antibody ID: a 9 nt sequence within the antibody oligo that uniquely identifies a type of antibody. In the pipeline, these are identified as a "BPM" tag.
- cluster: a set of reads sharing the same split-pool barcode. Antibody oligo reads in the cluster are used to assign genomic DNA reads to specific antibodies.
Workflow
- Define samples and paths to FASTQ files (
fastq2json.py
or manually generatesamples.json
) - Split FASTQ files into chunks for parallel processing
- Adaptor trimming (Trim Galore!)
- Barcode identification
- Split genomic DNA (DPM) and antibody oligo (BPM) reads into separate files
- Genomic DNA read workflow:
- DPM trimming (cutadapt)
- Alignment (bowtie2)
- Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., removing non-canonical chromosomes)
- Masking (based on ENCODE blacklists)
- Antibody oligo read workflow:
- BPM trimming (cutadapt)
- FASTQ to BAM conversion
- Cluster generation
- Cluster assignment and antibody specific BAM file creation
- Antibody-specific bigWig file creation
- Summary plots
- Genomic DNA and antibody oligo cluster size distributions
- Maximum representation oligo ECDFs
- Summary statistics
- MultiQC (trimming, alignments)
- Barcode identification efficiency
- Cluster statistics
- Read assignment statistics
Notes / FAQ
- How are antibody oligo PCR duplicates removed? In step 6.ii above (
fastq_to_bam.py
), oligo UMIs are converted to integers and used as the reference position (column 4) in the BAM file. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated. - How are DNA sequence PCR duplicates removed? We assume that DNA molecules from a batch of cells stochastically fragmented in the crosslinking ChIP assay are unlikely to have the exact same start and end positions. Therefore, we consider identical DNA fragments (same start and end positions) in the same cluster to be PCR duplicates. As for the antibody oligo PCR duplicates, when clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated.
- The pipeline is non-deterministic, but the output should be functionally equivalent between runs. Specifically, the non-determinism affects the order in which reads are sorted in BAM files (even after running
samtools sort
), and which representative read from a set of PCR duplicates is kept, none of which should affect downstream analyses (e.g., bigWig generation, peak calling, etc.). Non-determinism arises from the following causes of reads being processed in different orders:- If
num_chunks
is set to greater than 1 inconfig.yaml
, then the order in which chunks of reads are processed can change between pipeline runs. - If more than 1 processor is available, then Bowtie 2 will align multiple reads in parallel (
-p
option), and the output order is non-deterministic. We do not use the--reorder
flag to fix the output order for performance reasons. - If
samples.json
specifies multiple samples, andmerge_sample
is set totrue
inconfig.yaml
, then the @RG records in the merged BAM file headers will have unique suffixes appended to their IDs. Seesamtools merge
documentation.
- If
See also visualizations of the pipeline generated by Snakemake (commands below assume that Graphviz is installed):
- ./dag.pdf:
snakemake --dag | dot -Tpdf > dag.pdf
- ./filegraph.pdf:
snakemake --filegraph | dot -Tpdf > filegraph.pdf
- ./rulegraph.pdf:
snakemake --rulegraph | dot -Tpdf > rulegraph.pdf
We will refer to 4 directories:
-
Working directory: We follow the Snakemake documentation in using the term "working directory" to refer to
either the directory in which you have invoked Snakemake or whatever was specified for
--directory
or theworkdir:
directive- This is also where Snakemake creates a
.snakemake
directory within which it installs conda environments and keeps track of metadata regarding the pipeline.
- This is also where Snakemake creates a
-
Pipeline directory: where the software (including the
Snakefile
and scripts) residesenvs/
scripts/
fastq2json.py
Snakefile
-
Input directory: where configuration and data files reside
assets/
data/
cluster.yaml
: paths are specified relative to the working directoryconfig.yaml
: paths are specified relative to the working directoryconfig.txt
/example_config.txt
format.txt
/example_format.txt
samples.json
/example_samples.json
: paths are specified relative to the working directoryrun_pipeline.sh
: the paths in the arguments--snakefile <path to Snakefile>
,--cluster-config <path to cluster.yaml>
, and--configfile <path to config.yaml>
are relative to where you runrun_pipeline.sh
-
Output directory: where to place this directory can be changed in
config.yaml
alignments/
alignments_parts/
bigwigs/
clusters/
: cluster file and cluster statisticsclusters_parts/
fastqs/
: adapter-trimmed reads with identified barcodes appended to the read name<sample>_R1.part_<###>.barcoded_bpm.fastq.gz
: read 1 of fully-barcoded templates with a BPM tag, corresponding to antibody oligos<sample>_R1.part_<###>.barcoded_dpm.fastq.gz
: read 1 of fully-barcoded templates with a DPM tag, corresponding to genomic DNA<sample>_R1.part_<###>.barcoded_short.fastq.gz
: read 1 of templates where one or more expected tags could not be identified<sample>_R1.part_<###>.barcoded_other.fastq.gz
: read 1 of templates where one or more tags were identified in the wrong position in the barcode
logs/
qc/
splitbams/
split_fastq/
: unprocessed reads split into chunks for parallel processingtrimmed/
: adapter- and tag-trimmed reads<sample>_R<1|2>.part_<###>_val_<1|2>.fq.gz
: adapter-trimmed reads<sample>_R1.part_<###>.barcoded_bpm.RDtrim.fastq.gz
: read 1 of fully-barcoded templates with a BPM tag (corresponding to antibody oligos) where the BPM tag has been trimmed<sample>_R1.part_<###>.barcoded_dpm.RDtrim.fastq.gz
: read 1 of fully-barcoded templates with a DPM tag (corresponding to genomic DNA) where the DPM tag has been trimmed
barcode_identification_efficiency.txt
pipeline_counts.txt
For reproducibility, we recommend keeping the pipeline, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra output subdirectory (by default, workup/
) created upon running this pipeline.
However, the pipeline directory can also be kept separate and used repeatedly on different datasets.
- The working directory should stay with the input directory, so that the
.snakemake
folder containing the Snakemake pipeline metadata (that keeps track of which steps of the pipeline have completed) is paired with the configuration files.- Assuming that the above directory structure is followed, most of the paths in
config.yaml
can remain relative paths to configuration files and asset files in the input directory. The only paths inconfig.yaml
that need to be modified arescripts_dir
andoutput_dir
. - Modify the
--snakefile <path to Snakefile>
argument inrun_pipeline.sh
to point to the Snakefile in the pipeline directory. - Run
run_pipeline.sh
from the input directory.
- Assuming that the above directory structure is followed, most of the paths in
- To reuse the
chipdip
conda environment, create a discoverablechipdip
conda environment (e.g.,conda env create -f envs/chipdip.yaml
) and set theuse_existing_conda_env
option inconfig.yaml
totrue
.
-
config.yaml
: Pipeline configuration - YAML file containing the processing settings and paths of required input files. As noted above, paths are specified relative to the working directory.- Required? Yes. Must be provided in the working directory, or specified via
--configfile <path_to_config.yaml>
when invoking Snakemake. - Required keys
scripts_dir
: path to scripts folder in the pipeline directorysamples
: path tosamples.json
filebarcode_config
: path to barcode config file (e.g.,config.txt
)bowtie2_index
: path to Bowtie 2 genome indexcutadapt_dpm
: path to DPM sequencescutadapt_oligos
: path to Antibody ID sequencesbead_umi_length
: integer length of bead oligo UMIs
- Optional keys: If these keys are omitted from
config.yaml
or set tonull
, then they will take on the default values indicated. For keys whose default values arenull
, setting them tonull
will produce behaviors as described below.output_dir
(default ="workup"
): path to create the output directory within which all intermediate and output files are placed.temp_dir
(default ="/central/scratch"
): path to a temporary directory, such as used by the-T
option of GNU sortbarcode_format
(default =null
): path to barcode format file (e.g.,format.txt
). Ifnull
, no barcode validation is performed.conda_env
(default ="envs/chipdip.yaml"
): either a path to a conda environment YAML file ("*.yml" or "*.yaml") or the name of an existing conda environment. If the path to a conda environment YAML file, Snakemake will create a new conda environment within the.snakemake
folder of the working directorymask
(default =null
): path to BED file of genomic regions to ignore, such as ENCODE blacklist regions; reads mapping to these regions are discarded. Ifnull
, no masking is performed.path_chrom_map
(default =null
): path to chromosome name map file. Ifnull
, chromosome renaming and filtering are skipped, and the final BAM and/or bigWig files will use all chromosome names as-is from the Bowtie 2 index.num_chunks
(default =2
): integer giving the number of chunks to split FASTQ files from each sample into for parallel processinggenerate_splitbams
(default =false
): boolean value indicating whether to generate separate BAM files for each antibody targetmin_oligos
(default =2
): integer giving the minimum count of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with theproportion
andmax_size
criteriaproportion
(default =0.8
): float giving the minimum proportion of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with themin_oligos
andmax_size
criteriamax_size
(default =10000
): integer giving the maximum count of deduplicated genomic DNA reads in a cluster for that cluster to be to be assigned to the corresponding antibody target; this criteria is intersected (AND) with theproportion
andmax_size
criteriamerge_samples
(default =false
): boolean indicating whether to merge cluster files and target-specific BAM and bigWig files across samplesbinsize
(default =false
): integer specifying bigWig binsize; set tofalse
to skip bigWig generation. Only relevant if generate_splitbams istrue
.bigwig_normalization
(default ="None"
): normalization strategy for calculating coverage from reads; passed to the--normalizeUsing
argument for thebamCoverage
command from the deepTools suite. As of version 3.5.2, deepToolsbamCoverage
curently supportsRPKM
,CPM
,BPM
,RPGC
, orNone
. Only relevant if bigWig generation is requested (i.e.,generate_splitbams
istrue
andbinsize
is notfalse
).effective_genome_size
(default =null
): integer specifying effective genome size (see deepTools documentation for a definition). Ifnull
, effective genome size is computed as the number of unmasked sequences in the Bowtie 2 index, after selecting for chromosomes specified in the chromosome name map file and excluding regions specified by the mask file. Only relevant if bigWig generation is requested using normalization strategyRPGC
(i.e.,generate_splitbams
istrue
,binsize
is notfalse
, andbigwig_normalization
isRPGC
).email
(default =null
): email to send error notifications to if errors are encountered during the pipeline. Ifnull
, no emails are sent.
- Additional notes
null
values can be specified explicitly (e.g.,format: null
) or implicitly (e.g.,format:
).- For keys
format
,mask
,path_chrom_map
, andemail
, an empty string""
is treated identically to if the value isnull
.
- Required? Yes. Must be provided in the working directory, or specified via
-
samples.json
: Samples file - JSON file with the paths of FASTQ files (read1, read2) to process.-
Required? Yes.
-
config.yaml
key to specify the path to this file:samples
-
This can be prepared using
fastq2json.py --fastq_dir <path_to_directory_of_FASTQs>
or manually formatted as follows:{ "sample1": { "R1": ["<path_to_data>/sample1_R1.fastq.gz"], "R2": ["<path_to_data>/sample1_R2.fastq.gz"] }, "sample2": { "R1": ["<path_to_data>/sample2_R1.fastq.gz"], "R2": ["<path_to_data>/sample2_R2.fastq.gz"] }, ... }
-
Data assumptions:
- FASTQ files are gzip-compressed.
- Read names do not contain two consecutive colons (
::
). This is required because the pipeline adds::
to the end of read names before adding barcode information; the string::
is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
-
The pipeline (in particular, the script
scripts/bash/split_fastq.sh
) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.- If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
-
Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
-
The provided sample read files under the
data/
folder were simulated via a Google Colab notebook. The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody IDBEAD_AB1-A1
) and TCF12 (simulated as corresponding to Antibody IDBEAD_AB2-A2
). -
Sample names (the keys of the samples JSON file) cannot contain any periods (
.
). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation ofscripts/python/threshold_tag_and_split.py:label_bam_file()
.
-
-
assets/bpm.fasta
: FASTA file containing the sequences of Antibody IDs- Required? Yes.
config.yaml
key to specify the path to this file:cutadapt_oligos
- Used by:
cutadapt
(Snakefilerule cutadapt_oligo
) - Each sequence should be preceeded by
^
to anchor the sequence during cutadapt trimming (see Snakefilerule cutadapt_oligo
).
-
assets/dpm96.fasta
: FASTA file containing the sequences of DPM tags- Required? Yes.
config.yaml
key to specify the path to this file:cutadapt_dpm
- Used by:
cutadapt
(Snakefilerule cutadapt_dpm
) - Each of these sequences are 10 nt long, consisting of a unique 9 nt DPM_Bottom sequences as originally designed for SPRITE (technically, only the first 8 nt are unique, and the 9th sequence is always a
T
), plus aT
that is complementary to a 3'A
added to a genomic DNA sequence via dA-tailing.
-
config.txt
: Barcode config file - text file containing the sequences of split-pool tags and the expected split-pool barcode structure.- Required? Yes.
config.yaml
key to specify the path to this file:barcode_config
- Used by:
scripts/java/BarcodeIdentification_v1.2.0.jar
(Snakefilerule barcode_id
),scripts/python/fastq_to_bam.py
(Snakefilerule fastq_to_bam
), andscripts/python/barcode_identification_efficiency.py
(Snakefilerule barcode_identification_efficiency
).- This file is also parsed in the Snakefile itself to determine the length of the barcode (i.e., the number of rounds of barcoding) and if
generate_splitbams
is set totrue
inconfig.yaml
, the set of antibody targets for which to generate individual de-multiplexed BAM files (and bigWig file too, if requested).
- This file is also parsed in the Snakefile itself to determine the length of the barcode (i.e., the number of rounds of barcoding) and if
- Format: SPRITE configuration file (see our SPRITE GitHub Wiki or Nature Protocols paper for details).
-
Blank lines and lines starting with
#
are ignored. -
An example barcoding configuration file is annotated below:
# Barcoding layout for read 1 and read 2 # - Y represents a terminal tag # - ODD, EVEN, and DPM indicate their respective tags # - SPACER accounts for the 7-nt sticky ends that allow ligation between tags READ1 = DPM READ2 = Y|SPACER|ODD|SPACER|EVEN|SPACER|ODD|SPACER|EVEN|SPACER|ODD # DPM tag sequences formatted as tab-delimited lines # 1. Tag category: DPM # 2. Tag name: must contain "DPM", such as "DPM<xxx>"; must NOT contain "BEAD" # - Can only contain alphanumeric characters, underscores, and hyphens, # i.e., must match the regular expression "[a-zA-Z0-9_\-]+" # 3. Tag sequence (see assets/dpm96.fasta) # 4. Tag error tolerance: acceptable Hamming distance between # expected tag sequence (column 3) and tag sequence in the read DPM DPMBot6_1-A1 TGGGTGTT 0 DPM DPMBot6_2-A2 TGACATGT 0 ... # Antibody ID sequences formatted as tab-delimited lines # - Identical format as for DPM tag sequences, except that Tag name (column 2) # must start with "BEAD_". # - Tag sequences must match assets/bpm.fasta DPM BEAD_AB1-A1 GGAACAGTT 0 DPM BEAD_AB2-A2 CGCCGAATT 0 ... # Split-pool tag sequences: same 4-column tab-delimited format as the # DPM and Antibody ID section above, except that # Tag category (column 1) is now ODD, EVEN, or Y. # Tag name must NOT contain "BEAD" or "DPM". EVEN EvenBot_1-A1 ATACTGCGGCTGACG 2 EVEN EvenBot_2-A2 GTAGGTTCTGGAATC 2 ... ODD OddBot_1-A1 TTCGTGGAATCTAGC 2 ODD OddBot_2-A2 CCTGTGCGTTAGAGT 2 ... Y NYStgBot_1-A1 TATTATGGT 0 Y NYStgBot_2-A2 TAGCTACCTT 0 ...
-
- Notes regarding the entries in
example_config.txt
- Names: Each name ends with
#-Well
(for example,4-A4
) where the#
gives the row-major index of the tag in a 96-well plate, andWell
denotes the corresponding row and column. - Because only the first 2 antibody IDs are included in the example dataset, the other antibody ID rows are commented out. This prevents generation of empty (0 byte) placeholder files for the other 94 antibody IDs.
- Sequences
- The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant
'T'
).example_config.txt
therefore only includes the unique 8 bp sequences. - The design of EVEN and ODD tags allows for 17 bp of unique sequence, but only 16 bp are used in the published SPRITE tag set (in bottom tags, the 17th bp is currently a constant
'T'
).example_config.txt
further trims the 1st unique bp from the bottom tag, leaving only the middle 15 bp unique bottom sequence. - The design of Y (terminal) tags allows for 9-12 bp of unique sequence.
- The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant
- Names: Each name ends with
-
format.txt
: Barcode format file - tab-delimited text file indicating which split-pool barcode tags are valid in which round of split-pool barcoding (i.e., at which positions in the barcoding string).- Required? No, but highly recommended.
config.yaml
key to specify the path to this file:barcode_format
- Used by:
scripts/python/split_bpm_dpm.py
(Snakefilerule split_bpm_dpm
) - Column 1 indicates the zero-indexed position of the barcode string where a tag can be found.
- Term barcode tags (Y) are position
0
; the second to last round of barcoding tags are position1
; etc. A value of-1
in the position column indicates that the barcode tag was not used in the experiment.
- Term barcode tags (Y) are position
- Column 2 indicates the name of the tag. This must be the same as the name of the tag in the barcode config file. If the same tag is used in multiple barcoding rounds, then it should appear multiple times in column 2, but with different values in column 1 indicating which rounds it is used in.
-
chrom_map.txt
: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).- Required? No, but necessary if using a blacklist mask that uses different chromosome names than used in the Bowtie 2 index.
config.yaml
key to specify the path to this file:path_chrom_map
- Used by:
scripts/python/rename_and_filter_chr.py
(Snakefilerule rename_and_filter_chr
,rule merge_mask
, andrule effective_genome_size
) - Column 1 specifies chromosomes (following naming convention used in the index) to keep.
- The order of chromosomes provided here is maintained in the SAM/BAM file header, and consequently specifies the coordinate sorting order at the reference sequence level.
- Column 2 specifies new chromosome names for the corresponding chromosomes in column 1.
- The provided
chrom-map.txt
in this repository contains examples for retaining only canonical human or mouse chromosomes (i.e., excluding alternate loci, unlocalized sequences, and unplaced sequences) and renaming them to UCSC chromosome names (i.e.,chr1
,chr2
, ...,chrX
,chrY
,chrM
) as needed. The header of provided file also includes more detailed documentation about the specific format requirements, such as allowed characters.
-
assets/blacklist_hg38.bed
,assets/blacklist_mm10.bed
: blacklisted genomic regions for ChIP-seq data- Required? No, but highly recommended.
config.yaml
key to specify the path to this file:mask
- Used by: Snakefile
rule merge_mask
, whose output is used byrule repeat_mask
andrule effective_genome_size
- For human genome release hg38, we use ENCFF356LFX from ENCODE. For mouse genome release mm10, we use mm10-blacklist.v2.bed.gz. These BED files use UCSC chromosome names (e.g.,
chr1
,chr2
, ...). The pipeline performs chromosome name remapping (if specified) before this step.-
Reference paper: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. doi:10.1038/s41598-019-45839-z
-
Example code used to download them into the
assets/
directory:wget -O - https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz | zcat | sort -V -k1,3 > "assets/blacklist_hg38.bed" wget -O - https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz | zcat | sort -V -k1,3 > "assets/blacklist_mm10.bed"
-
-
assets/index_mm10/*.bt2
,assets/index_hg38/*.bt2
: Bowtie 2 genome index-
Required? Yes.
-
config.yaml
key to specify the path to the index:bowtie2_index
-
Used by: Snakefile
rule bowtie2_align
andrule effective_genome_size
-
If you do not have an existing Bowtie 2 index, you can download pre-built indices from the Bowtie 2 developers:
# for human primary assembly hg38 mkdir -p assets/index_hg38 wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip -j -d assets/index_hg38 GRCh38_noalt_as.zip \*.bt2 # for mouse primary assembly mm10 mkdir -p assets/index_mm10 wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip unzip -j -d assets/index_mm10 mm10.zip \*.bt2
This will create a set of files under
assets/index_hg38
orassets/index_mm10
. If we want to use themm10
genome assembly, for example, the code above will populateassets/index_mm10
with the following files:mm10.1.bt2
,mm10.2.bt2
,mm10.3.bt2
,mm10.4.bt2
,mm10.rev.1.bt2
,mm10.rev.2.bt2
. The path prefix to this index (as accepted by thebowtie2 -x <bt2-idx>
argument) is thereforeassets/index_mm10/mm10
, which is set in the configuration file,config.yaml
.Note that the pre-built indices linked above use UCSC chromosome names (
chr1
,chr2
, ...,chrX
,chrY
,chrM
). If your alignment indices use different chromosome names (e.g., Ensembl chromosome names are1
,2
, ...,X
,Y
,MT
), updatechrom-map.txt
such that chromosome names in BAM files are converted to UCSC chromosome names. You can check the names of the reference sequences used to build the index by using the commandbowtie2-inspect -n <bt2-idx>
.
-
-
cluster.yaml
: Cluster config file- Required? Yes if running on a compute cluster, such as a SLURM environment.
- The path to this file is specified using the
--cluster-config
argument to thesnakemake
program -- i.e., in therun_pipeline.sh
script. - This file specifies the resources available to each rule. All variables/directives available to shell commands in the Snakefile are also availble here (see Snakemake's documentation).
- Note: for a SLURM system, the directories for the standard output and standard error files specified by the
output:
anderror:
directives must exist prior to execution of the corresponding rule/job (see issue #3).
These files are generated in the output directory, which is specified by the output_dir
key in the pipeline configuration file.
-
Barcode identification efficiency (
barcode_identification_efficiency.txt
): A statistical summary of how many tags were found per read and the proportion of reads with a matching tag at each position.- The first type of statistic describes how many tags were identified per read. For example, consider a dataset of 10000 adapter-trimmed reads with the expected tag structure as specified in the
example_config.txt
file: 1 DPM tag on Read 1, 6 tags (Odd, Even, or Terminal) on Read 2.170 (1.7%) reads found with 1 tag.
For a small fraction reads, only 1 tag was identified; this is to be expected, whether due to ligation errors, PCR artifacts, or sequencing errors. These reads are output tofastqs/<sample_name>.part<###>.barcoded_short.fastq.gz
and are not used for analysis.7500 (75.0%) reads found with 7 tags.
This is the expected result, where all 7 tags are identified in the majority of reads. In thesplit_bpm_dpm
rule, these reads are split intofastqs/<sample_name>.part<###>.barcoded_dpm.fastq.gz
orfastqs/<sample_name>.part<###>.barcoded_bpm.fastq.gz
, depending on whether a read corresponds to genomic DNA or an antibody oligo.
- The second type of statistic describes at which positions tags were identified in the reads.
9800 (98.0%) reads found with tag in position 1 (read 1, DPM).
As expected, a DPM-category tag is identified at the start of read 1.9700 (97.0%) reads found with tag in position 2 (read 2, Y).
As expected, a terminal tag is identified at the start of read 2.
- The first type of statistic describes how many tags were identified per read. For example, consider a dataset of 10000 adapter-trimmed reads with the expected tag structure as specified in the
-
Pipeline counts (
pipeline_counts.txt
): A tree-like summary of how many reads remained at each step of the pipeline, produced per aliquot and in aggregate. This can be used to quickly view the proportion of reads corresponding to antibody oligos versus genomic DNA reads; the proportion of properly barcoded reads; etc.- The 4 columns of numbers are as follows:
- The number of reads remaining after that step
- The proportion of reads remaining relative to the immediately previous step of the pipeline
- The proportion of reads remaining relative to the read type - antibody oligo (
bpm
) or genomic DNA (dpm
) - The proportion of reads remaining relative to the starting number of reads.
- A tabular version is saved to
qc/pipeline_counts.csv
- The 4 columns of numbers are as follows:
-
Effective genome size file (
effective_genome_size.txt
): Text file with a single value of the computed effective genome size. Only generated if bigWig generation is requested with normalization strategyRPGC
. -
Cluster file (
clusters/<sample>.clusters
): Tab-delimited file, where each line represents a single cluster.- The first column is the cluster barcode.
- The remainder of the line is a list of reads. DNA reads are formated as
DPM[strand]_chr:start-end
, and antibody oligo reads are formated asBPM[]_<AntibodyID>:<UMI>-0
.
-
Cluster statistics (
clusters/cluster_statistics.txt
): The number of clusters and antibody oligo (BPM) or genomic DNA (DPM) reads per library. -
Cluster size distribtion (
clusters/[BPM,DPM]_cluster_distribution.pdf
): The proportion of clusters that belong to each size category. -
Cluster size read distribution (
clusters/[BPM,DPM]_read_distribution.pdf
): The proportion of reads that belong to clusters of each size category.- This can be more useful than the number of clusters since relatively few large clusters can contain many sequencing reads (i.e., a large fraction of the library) while many small clusters will contain few sequencing reads (i.e., a much smaller fraction of the library).
-
Maximum representation oligo eCDF (
clusters/Max_representation_ecdf.pdf
): A plot showing the distribution of proportion of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter
proportion
for cluster assignment.
- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter
-
Maximum representation oligo counts ECDF (
clusters/Max_representation_counts.pdf
): A plot showing the distribution of number of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribtuion. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters
min_oligo
for cluster assignment.
- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribtuion. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters
-
BAM files for individual antibodies (
splitbams/*.bam
)
- Thresholding criteria (
min_oligos
,proportion
,max_size
) for assigning individual clusters to individual antibodies are set inconfig.yaml
. - The "none" BAM file (
<sample>.DNA.merged.labeled_none.bam
) contains DNA reads from clusters without antibody oligo reads. - THe "ambigious" BAM file (
<sample>.DNA.merged.labeled_ambiguous.bam
) contains DNA reads from clusters that failed theproportion
thresholding criteria. - The "uncertain" BAM file (
<sample>.DNA.merged.labeled_uncertain.bam
) contains DNA reads from clusters that failed themin_oligo
thresholding criteria. - The "filtered" BAM file (
<sample>.DNA.merged.labeled_filtered.bam
) contains DNA reads from clusters that failed themax_size
thresholding criteria.
-
Read count summary for individual antibodies (
splitbams/splitbam_statistics.txt
)- The number of read counts contained within each individual BAM file assigned to individual antibodies.
-
BigWig files for individual antibodies (
bigwigs/*.bw
)- BigWigs are generated using
deeptools bamCoverage
with binsize set inconfig.yaml
. Normalization is performed using effective genome size (--normalizeUsing RPGC
), which is calculated as the size of chromosomes selected viachrom_map.txt
minus the size of regions in the mask for those chromosomes.
- BigWigs are generated using
Calculator for uniqueness of bead barcodes: Google Colab notebook (also available as GitHub Gist). This calculator can be used to calculate one of two values:
- Given the number of beads and the number of possible unique barcodes, calculate the proportion of beads expected to be uniquely barcoded.
- Given the number of beads and a desired proportion of beads to be uniquely barcoded, calculate the number of barcodes required.
Publications
- Perez AA, Goronzy IN, Blanco MR, Yeh BT, Guo JK, Lopes CS, Ettlin O, Burr A, Guttman M. ChIP-DIP maps binding of hundreds of proteins to DNA simultaneously and identifies diverse gene regulatory elements. Nat Genet. 2024 Dec;56(12):2827–2841. doi:10.1038/s41588-024-02000-5. PMID: 39587360.
- Perez AA, Goronzy IN, Blanco MR, Guo JK, Guttman M. ChIP-DIP: A multiplexed method for mapping hundreds of proteins to DNA uncovers diverse regulatory elements controlling gene expression. bioRxiv; 2023. doi:10.1101/2023.12.14.571730. PMID: 38187704; PMCID: PMC10769186.
Developers
- Isabel Goronzy (@igoronzy): Adapted this pipeline from the SPRITE and RNA-DNA SPRITE pipelines.
- Benjamin Yeh (@bentyeh): Streamlined and documented the pipeline. Added more extensive QC outputs. Created simulated data for testing the pipeline. Created the Jupyter notebook for in-depth cluster analysis.
Other contributors
- Andrew Perez (@HeyDrew64)
- Mario Blanco (@mrblanco)
- Mitchell Guttman (@mitchguttman)