Overview

Contents

Overview
Directory Structures
Input Files
- Configuration Files
- Resource Files
Workflow Profile Configuration
Output Files
Additional Resources
Credits

Overview

Quick Start

This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run

conda env create -f workflow/envs/snakemake.yaml
conda activate snakemake

to create and activate a conda environment named snakemake. Once all the input files are ready, run the pipeline locally with

# by default, uses the workflow profile workflow/profiles/default
./run_pipeline.sh

or on SLURM cluster with

./run_pipeline.sh --workflow-profile workflow/profiles/slurm

After the pipeline finishes, explore cluster properties in-depth using the cluster quality control Jupyter notebook (chipdip_qc.ipynb). The conda environment specified by workflow/envs/chipdip_qc.yaml includes all dependencies used by the Jupyter notebook, excluding Jupyter itself. If you already have a Jupyter setup, you can add this as a new kernel. Otherwise, include the dependency jupyterlab=4.0.4 to create a Jupyter setup.

Other common usage notes

run_pipeline.sh passes any additional arguments to snakemake.
- To perform a dry-run: ./run_pipeline.sh --dry-run
- To force (re)execution of all rules regardless of past output: ./run_pipeline.sh --forceall
- (in development) To check for common mistakes and formatting issues of config.yaml, config.txt, format.txt, and chrom_map.txt: ./run_pipeline.sh validate
- To remove intermediate files generated by the pipeline: ./run_pipeline.sh clean
To create a reusable ChIP-DIP pipeline conda environment named "chipdip", run
```
conda env create -f workflow/envs/chipdip.yaml
```
and modify the conda_env key in config.yaml to "chipdip".

To test that the pipeline runs correctly in your compute environment, you can verify that you obtain correct cluster statistics and demultiplexed BAM files from the provided example data. Note that this assumes that you are using the default configuration settings and have the chipdip conda environment installed:

conda activate chipdip
tests/verify_merged_splitbams_from_example_data.sh results tests/assets

Expected output upon successful verification:

AB1-A1 matches reference.
AB2-A2 matches reference.
MD5 checksum of cluster_statistics.txt matches reference.

System Requirements

Hardware: This pipeline was developed to be run on a high performance computing (HPC) cluster, but it can also be run locally on a personal computer.

See benchmarks below for a sense of how hardware requirements scale with dataset size.
Note: the chipdip conda environment itself takes ~2.2 GB disk space, and the unzipped mm10 and hg38 Bowtie 2 indices occupy 3.6 GB and 3.9 GB, respectively. The disk space numbers in the benchmark table below do not account for the size of the conda environment, Bowtie 2 indices, or temporary disk usage during the pipeline run.
Recommended hardware: 4+ CPU cores, 24+ GB memory, 60 GB free disk space

Operating system: Linux, macOS

Lack of Windows support is due to our use of the Bioconda channel for creating the conda environments described in workflow/envs/. Bioconda currently does not support Windows. However, we have successfully run this pipeline on Windows Subsystem for Linux (WSL).
While Bioconda supports macOS and has built an extensive repository of packages for x86_64-based Macs (i.e., with Intel processors; conda platform osx-64), only a limited number of packages (or versions of packages) have been built for new ARM-based Macs (i.e., with M-series Apple Silicon processors; conda platform osx-arm64). Consequently, the versions specified by the environment YAML files described in workflow/envs/ may not be available on Bioconda for the osx-arm64 platform.
- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the osx-64 platform, and executing the pipeline programs under Rosetta translation:
  - Create a reusable environment using conda env create -f workflow/envs/chipdip.yaml --platform osx-64, and modify the conda_env key in config.yaml to "chipdip".
  - While macOS should automatically prompt to install Rosetta if needed, it can also be manually installed using the terminal command /usr/sbin/softwareupdate --install-rosetta.

Code interpreters and runtimes: The pipeline relies on scripts written in Java, Bash, Python and has been validated using the following versions:

Java: 8.0.322 through 11.0.22 (the workflow/envs/chipdip.yaml conda environment file currently uses version 8.0.412)
Bash: 4.2 through 5.2
Python: 3.9+ (the workflow/envs/chipdip.yaml conda environment file currently uses version 3.10)

Packages: Additional required third-party programs and packages are specified in conda environments described in workflow/envs/.

Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- The version of deeptools used (3.5.2) requires a matplotlib version between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used by deeptools version 3.5.2. Newer versions of deeptools have been updated to support the newer matplotlib APIs.
- NumPy version 2.x is not currently supported.

Benchmarks

Dataset	CPU cores and architecture	Operating System	Pipeline wall time (HH:MM:SS)	Core-hours utilized (HH:MM:SS)	Maximum RAM utilization	Disk space of output files	Benchmark times include creating `chipdip` conda environment?	Notes
example dataset provided in `data/`: 4,532 paired reads (≤ 130 bp read 1, ≤ 150 bp read 2)	2 cores of AMD EPYC 7763	Ubuntu 20.04.6 LTS	00:13:01	00:20:28	2.91 GB	20 MB	No	run "locally" on a single node on GitHub Codespaces
example dataset provided in `data/`	Intel Core i5-4300U	Ubuntu 22.04.4 LTS via WSL	00:12:23	00:41:31	2.76 GB	14 MB	No	Hyperthreading enabled; `snakemake` run with `--jobs 4`
example dataset provided in `data/`	Apple M2 (4 performance + 4 efficiency cores)	macOS Sequoia 15.1.1	00:04:45	00:14:06	2.33 GB	16 MB	No	conda environment targeting `osx-64` platform; `snakemake` run with `--jobs 4`
example dataset provided in `data/`	2 cores of Intel Xeon Gold 6130	RHEL9.3	00:21:20	00:26:35	827.49 MB	20 MB	Yes	run "locally" on a single node on Caltech's HPC cluster
49,222,185 paired reads (89 bp read 1, 209 bp read 2)	8 cores of Intel Xeon Gold 6130	RHEL9.3	02:43:38	09:49:34	10.65 GB	45 GB	Yes	run "locally" on a single node on Caltech's HPC cluster

Pipeline

Terms

split-pool tag: the individual sequences that are added during a single round of split-pool barcoding (DPM, EVEN, ODD, TERM)
split-pool barcode: the permutation of split-pool tags that uniquely identifies a cluster
Antibody ID: a 9 nt sequence within the antibody oligo that uniquely identifies a type of antibody. In the pipeline, these are identified as a "BPM" tag.
cluster: a set of reads sharing the same split-pool barcode. Antibody oligo reads in the cluster are used to assign genomic DNA reads to specific antibodies.

Workflow

Define samples and paths to FASTQ files (fastq2json.py or manually generate samples.json)
Split FASTQ files into chunks for parallel processing
Adaptor trimming (Trim Galore!)
Barcode identification
Split genomic DNA (DPM) and antibody oligo (BPM) reads into separate files
Genomic DNA read workflow:
1. DPM trimming (Cutadapt)
2. Alignment (Bowtie 2)
3. Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., removing non-canonical chromosomes)
4. Masking (bedtools; use ENCODE blacklists)
Antibody oligo read workflow:
1. BPM trimming (Cutadapt)
2. FASTQ to BAM conversion
Cluster generation
Cluster assignment and antibody specific BAM file creation
Antibody-specific bigWig file creation (deepTools bamCoverage)
Summary plots
1. Genomic DNA and antibody oligo cluster size distributions
2. Maximum representation oligo ECDFs
Summary statistics
1. MultiQC (trimming, alignments)
2. Barcode identification efficiency
3. Cluster statistics
4. Read assignment statistics

Notes / FAQ

How are antibody oligo PCR duplicates removed? In step 6.ii above (fastq_to_bam.py), oligo UMIs are converted to integers and used as the reference position (column 4) in the BAM file. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated.
How are DNA sequence PCR duplicates removed? We assume that DNA molecules from a batch of cells stochastically fragmented in the crosslinking ChIP assay are unlikely to have the exact same start and end positions. Therefore, we consider identical DNA fragments (same start and end positions) in the same cluster to be PCR duplicates. As for the antibody oligo PCR duplicates, when clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated.
The pipeline is non-deterministic, but the output should be functionally equivalent between runs. Specifically, the non-determinism affects the order in which reads are sorted in BAM files (even after running samtools sort), and which representative read from a set of PCR duplicates is kept, none of which should affect downstream analyses (e.g., bigWig generation, peak calling, etc.). Non-determinism arises from the following causes of reads being processed in different orders:
- If num_chunks is set to greater than 1 in config.yaml, then the order in which chunks of reads are processed can change between pipeline runs.
- If more than 1 processor is available, then Bowtie 2 will align multiple reads in parallel (-p option), and the output order is non-deterministic. We do not use the --reorder flag to fix the output order for performance reasons.
- If samples.json specifies multiple samples, and merge_sample is set to true in config.yaml, then the @RG records in the merged BAM file headers will have unique suffixes appended to their IDs. See samtools merge documentation.

See also visualizations of the pipeline generated by Snakemake (commands below assume that Graphviz is installed):

images/dag.pdf: snakemake --dag | dot -Tpdf > images/dag.pdf
images/filegraph.pdf: snakemake --filegraph | dot -Tpdf > images/filegraph.pdf
images/rulegraph.pdf: snakemake --rulegraph | dot -Tpdf > images/rulegraph.pdf

Directory Structures

We will refer to 4 directories:

Working directory: We follow the Snakemake documentation in using the term "working directory" to refer to

either the directory in which you have invoked Snakemake or whatever was specified for --directory or the workdir: directive
- This is where Snakemake creates a .snakemake directory within which it installs conda environments and keeps track of metadata regarding the pipeline (e.g., which steps have finished successfully, which steps failed, and which steps have yet to be run).
- Rule commands are executed relative to this directory.
Workflow directory: where the Snakefile and scripts reside
- envs/
- scripts/
- profiles/
  - <profile_name>/
    - config.yaml: paths are interpreted relative to the working directory.
- fastq2json.py
- Snakefile
Input directory: where configuration and data files reside
- data/
- config/
  - config.yaml: paths are interpreted relative to the working directory
  - config.txt / example_config.txt
  - format.txt / example_format.txt
  - samples.json / example_samples.json: paths are interpreted relative to the working directory
- resources/
- run_pipeline.sh: the paths in the arguments --snakefile <path to Snakefile>, --profile <path to profile>, --workflow-profile <path to workflow profile>, or --configfile <path to config.yaml> are relative to where you run run_pipeline.sh
Output directory: where to place this directory can be changed in config.yaml via the output_dir key.
- alignments/
- alignments_parts/
- bigwigs/
- clusters/: cluster file and cluster statistics
- clusters_parts/
- fastqs/: adapter-trimmed reads with identified barcodes appended to the read name
  - <sample>_R1.part_<###>.barcoded_bpm.fastq.gz: read 1 of fully-barcoded templates with a BPM tag, corresponding to antibody oligos
  - <sample>_R1.part_<###>.barcoded_dpm.fastq.gz: read 1 of fully-barcoded templates with a DPM tag, corresponding to genomic DNA
  - <sample>_R1.part_<###>.barcoded_short.fastq.gz: read 1 of templates where one or more expected tags could not be identified
  - <sample>_R1.part_<###>.barcoded_other.fastq.gz: read 1 of templates where one or more tags were identified in the wrong position in the barcode
- logs/
- qc/
- splitbams/
- split_fastq/: unprocessed reads split into chunks for parallel processing
- trimmed/: adapter- and tag-trimmed reads
  - <sample>_R<1|2>.part_<###>_val_<1|2>.fq.gz: adapter-trimmed reads
  - <sample>_R1.part_<###>.barcoded_bpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a BPM tag (corresponding to antibody oligos) where the BPM tag has been trimmed
  - <sample>_R1.part_<###>.barcoded_dpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a DPM tag (corresponding to genomic DNA) where the DPM tag has been trimmed
- barcode_identification_efficiency.txt
- pipeline_counts.txt

For reproducibility, we recommend keeping the workflow, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra output subdirectory (by default, results/) created upon running this pipeline.

However, the workflow directory can also be kept separate and used repeatedly on different datasets.

The working directory should stay with the input directory, so that the .snakemake folder containing the Snakemake pipeline metadata is paired with the configuration files.
- Update paths in config.yaml
  - The scripts_dir key must point to the workflow scripts directory.
  - Assuming that the above directory structure is followed, the other paths can remain relative paths to configuration files and resource files in the input directory.
- Update run_pipeline.sh: add --snakefile <path_to_Snakefile> and --workflow-profile <path_to_profile> arguments to the Snakemake invocation to use the Snakefile and workflow profile from the workflow directory.
- Run run_pipeline.sh from the input directory.
To reuse the chipdip conda environment, create a discoverable chipdip conda environment (e.g., conda env create -f envs/chipdip.yaml) and set the use_existing_conda_env option in config.yaml to true.
Example: reuse the workflow and an existing "chipdip" conda environment on a new experiment, submitting jobs to a SLURM cluster
- Workflow directory at /home/workflows/chipdip/: envs/, scripts/, profiles/, Snakefile
- Input directory at /home/experiments/new/: data/, config/, resources/, run_pipeline.sh
  - config/config.yaml: update the following keys as shown
```
conda_env: "chipdip"
scripts_dir: "/home/workflows/chipdip/scripts"
output_dir: "results"
```
  - run_pipeline.sh: update the Snakemake invocation as shown
```
snakemake --snakefile /home/workflows/chipdip/Snakefile --workflow-profile /home/workflows/chipdip/profiles/slurm
```
- Invoke Snakemake from the input directory (which is used as the working directory).
```
cd /home/experiments/new/
conda activate snakemake
./run_pipeline.sh
```
- The output directory will be /home/experiments/new/results.

Input Files

Documentation for input files in this section is identical to that at config/README.md.

Configuration Files

These files are located under <input_directory>/config/.

config.yaml: Pipeline configuration - YAML file containing the processing settings and paths of required input files. Paths are specified relative to the working directory.
- Required? Yes. Must be provided in the working directory, or specified via --configfile <path_to_config.yaml> when invoking Snakemake.
- Required keys
  - scripts_dir: path to scripts folder in the workflow directory
  - samples: path to samples.json file
  - barcode_config: path to barcode config file (e.g., config.txt)
  - bowtie2_index: path to Bowtie 2 genome index
  - cutadapt_dpm: path to DPM sequences
  - cutadapt_oligos: path to Antibody ID sequences
  - bead_umi_length: integer length of bead oligo UMIs
- Optional keys: If these keys are omitted from config.yaml or set to null, then they will take on the default values indicated.
  - output_dir (default = "results"): path to create the output directory within which all intermediate and output files are placed.
  - temp_dir (default = $TMPDIR (if set) or "/tmp"): path to a temporary directory with sufficient free disk space, such as used by the -T option of GNU sort
  - barcode_format (default = null): path to barcode format file (e.g., format.txt). If null, no barcode validation is performed.
  - conda_env (default = "envs/chipdip.yaml"): either a path to a conda environment YAML file ("*.yml" or "*.yaml") or the name of an existing conda environment. If the path to a conda environment YAML file, Snakemake will create a new conda environment within the .snakemake folder of the working directory. If a relative path is used, the path is interpreted as relative to the Snakefile.
  - mask (default = null): path to BED file of genomic regions to ignore, such as ENCODE blacklist regions; reads mapping to these regions are discarded. If null, no masking is performed.
  - path_chrom_map (default = null): path to chromosome name map file. If null, chromosome renaming and filtering are skipped, and the final BAM and/or bigWig files will use all chromosome names as-is from the Bowtie 2 index.
  - num_chunks (default = 2): integer between 1 and 99 giving the number of chunks to split FASTQ files from each sample into for parallel processing
  - generate_splitbams (default = false): boolean value indicating whether to generate separate BAM files for each antibody target
  - min_oligos (default = 2): integer giving the minimum count of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the proportion and max_size criteria
  - proportion (default = 0.8): float giving the minimum proportion of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the min_oligos and max_size criteria
  - max_size (default = 10000): integer giving the maximum count of deduplicated genomic DNA reads in a cluster for that cluster to be to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the proportion and max_size criteria
  - merge_samples (default = false): boolean indicating whether to merge cluster files and target-specific BAM and bigWig files across samples
  - binsize (default = false): integer specifying bigWig binsize; set to false to skip bigWig generation. Only relevant if generate_splitbams is true.
  - bigwig_normalization (default = "None"): normalization strategy for calculating coverage from reads; passed to the --normalizeUsing argument for the bamCoverage command from the deepTools suite. As of version 3.5.2, deepTools bamCoverage currently supports RPKM, CPM, BPM, RPGC, or None. Only relevant if bigWig generation is requested (i.e., generate_splitbams is true and binsize is not false).
  - effective_genome_size (default = null): integer specifying effective genome size (see deepTools documentation for a definition). If null, effective genome size is computed as the number of unmasked sequences in the Bowtie 2 index, after selecting for chromosomes specified in the chromosome name map file and excluding regions specified by the mask file. Only relevant if bigWig generation is requested using normalization strategy RPGC (i.e., generate_splitbams is true, binsize is not false, and bigwig_normalization is RPGC).
  - email (default = null): email to send error notifications to if errors are encountered during the pipeline. If null, no emails are sent.
- Additional notes
  - null values can be specified explicitly (e.g., format: null) or implicitly (e.g., format: ).
  - For keys format, mask, path_chrom_map, and email, an empty string "" is treated identically to if the value is null.
samples.json: Samples file - JSON file with the paths of FASTQ files (read1, read2) to process.
- Required? Yes.
- config.yaml key to specify the path to this file: samples
- This can be prepared using fastq2json.py --fastq_dir <path_to_directory_of_FASTQs> or manually formatted as follows:
```
{
   "sample1": {
     "R1": [
       "<path_to_data>/sample1_run1_R1.fastq.gz",
       "<path_to_data>/sample1_run2_R1.fastq.gz",
     ],
     "R2": [
       "<path_to_data>/sample1_run1_R2.fastq.gz",
       "<path_to_data>/sample1_run2_R2.fastq.gz",
     ]
   },
   "sample2": {
     "R1": [
       "<path_to_data>/sample2_R1.fastq.gz"
     ],
     "R2": [
       "<path_to_data>/sample2_R2.fastq.gz"
     ]
   },
   ...
}
```
- Data assumptions:
  - FASTQ files are gzip-compressed.
  - Read names do not contain two consecutive colons (::). This is required because the pipeline adds :: to the end of read names before adding barcode information; the string :: is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
- If there are multiple FASTQ files per read orientation per sample (as shown for sample1 in the example above), the pipeline will concatenate them and process them together as the same sample.
- Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
- The provided sample read files under the data/ folder were simulated via a Google Colab notebook. The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody ID BEAD_AB1-A1) and TCF12 (simulated as corresponding to Antibody ID BEAD_AB2-A2).
- Sample names (the keys of the samples JSON file) cannot contain any periods (.). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation of workflow/scripts/python/threshold_tag_and_split.py:label_bam_file().
config.txt: Barcode config file - text file containing the sequences of split-pool tags and the expected split-pool barcode structure.
- Required? Yes.
- config.yaml key to specify the path to this file: barcode_config
- Used by: scripts/java/BarcodeIdentification_v1.2.0.jar (Snakefile rule barcode_id), scripts/python/fastq_to_bam.py (Snakefile rule fastq_to_bam), and scripts/python/barcode_identification_efficiency.py (Snakefile rule barcode_identification_efficiency).
  - This file is also parsed in the Snakefile itself to determine the length of the barcode (i.e., the number of rounds of barcoding) and if generate_splitbams is set to true in config.yaml, the set of antibody targets for which to generate individual de-multiplexed BAM files (and bigWig file too, if requested).
- Format: SPRITE configuration file (see our SPRITE GitHub Wiki or Nature Protocols paper for details).
  - Blank lines and lines starting with # are ignored.
  - An example barcoding configuration file is annotated below:
```
# Barcoding layout for read 1 and read 2
# - Y represents a terminal tag
# - ODD, EVEN, and DPM indicate their respective tags
# - SPACER accounts for the 7-nt sticky ends that allow ligation between tags
READ1 = DPM
READ2 = Y|SPACER|ODD|SPACER|EVEN|SPACER|ODD|SPACER|EVEN|SPACER|ODD

# DPM tag sequences formatted as tab-delimited lines
# 1. Tag category: DPM
# 2. Tag name: must contain "DPM", such as "DPM<xxx>"; must NOT contain "BEAD"
#    - Can only contain alphanumeric characters, underscores, and hyphens,
#      i.e., must match the regular expression "[a-zA-Z0-9_\-]+"
# 3. Tag sequence (see resources/dpm96.fasta)
# 4. Tag error tolerance: acceptable Hamming distance between
#    expected tag sequence (column 3) and tag sequence in the read
DPM	DPMBot6_1-A1	TGGGTGTT	0
DPM	DPMBot6_2-A2	TGACATGT	0
...

# Antibody ID sequences formatted as tab-delimited lines
# - Identical format as for DPM tag sequences, except that Tag name (column 2)
#   must start with "BEAD_".
# - Tag sequences must match resources/bpm.fasta
DPM	BEAD_AB1-A1	GGAACAGTT	0
DPM	BEAD_AB2-A2	CGCCGAATT	0
...

# Split-pool tag sequences: same 4-column tab-delimited format as the
#   DPM and Antibody ID section above, except that
#   Tag category (column 1) is now ODD, EVEN, or Y.
#   Tag name must NOT contain "BEAD" or "DPM".
EVEN	EvenBot_1-A1	ATACTGCGGCTGACG	2
EVEN	EvenBot_2-A2	GTAGGTTCTGGAATC	2
...
ODD	OddBot_1-A1	TTCGTGGAATCTAGC	2
ODD	OddBot_2-A2	CCTGTGCGTTAGAGT	2
...
Y	NYStgBot_1-A1	TATTATGGT	0
Y	NYStgBot_2-A2	TAGCTACCTT	0
...
```
- Notes regarding the entries in example_config.txt
  - Names: Each name ends with #-Well (for example, 4-A4) where the # gives the row-major index of the tag in a 96-well plate, and Well denotes the corresponding row and column.
  - Because only the first 2 antibody IDs are included in the example dataset, the other antibody ID rows are commented out. This prevents generation of empty (0 byte) placeholder files for the other 94 antibody IDs.
  - Sequences
    - The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant 'T'). example_config.txt therefore only includes the unique 8 bp sequences.
    - The design of EVEN and ODD tags allows for 17 bp of unique sequence, but only 16 bp are used in the published SPRITE tag set (in bottom tags, the 17th bp is currently a constant 'T'). example_config.txt further trims the 1st unique bp from the bottom tag, leaving only the middle 15 bp unique bottom sequence.
    - The design of Y (terminal) tags allows for 9-12 bp of unique sequence.
format.txt: Barcode format file - tab-delimited text file indicating which split-pool barcode tags are valid in which round of split-pool barcoding (i.e., at which positions in the barcoding string).
- Required? No, but highly recommended.
- config.yaml key to specify the path to this file: barcode_format
- Used by: scripts/python/split_bpm_dpm.py (Snakefile rule split_bpm_dpm)
- Column 1 indicates the zero-indexed position of the barcode string where a tag can be found.
  - Term barcode tags (Y) are position 0; the second to last round of barcoding tags are position 1; etc. A value of -1 in the position column indicates that the barcode tag was not used in the experiment.
- Column 2 indicates the name of the tag. This must be the same as the name of the tag in the barcode config file. If the same tag is used in multiple barcoding rounds, then it should appear multiple times in column 2, but with different values in column 1 indicating which rounds it is used in.
chrom_map.txt: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).
- Required? No, but necessary if using a blacklist mask that uses different chromosome names than used in the Bowtie 2 index.
- config.yaml key to specify the path to this file: path_chrom_map
- Used by: scripts/python/rename_and_filter_chr.py (Snakefile rule rename_and_filter_chr, rule merge_mask, and rule effective_genome_size)
- Column 1 specifies chromosomes (following naming convention used in the index) to keep.
  - The order of chromosomes provided here is maintained in the SAM/BAM file header, and consequently specifies the coordinate sorting order at the reference sequence level.
- Column 2 specifies new chromosome names for the corresponding chromosomes in column 1.
- The provided chrom-map.txt in this repository contains examples for retaining only canonical human or mouse chromosomes (i.e., excluding alternate loci, unlocalized sequences, and unplaced sequences) and renaming them to UCSC chromosome names (i.e., chr1, chr2, ..., chrX, chrY, chrM) as needed. The header of provided file also includes more detailed documentation about the specific format requirements, such as allowed characters.

Resource Files

These files are located under <input_directory>/resources/.

bpm.fasta: FASTA file containing the sequences of Antibody IDs
- Required? Yes.
- config.yaml key to specify the path to this file: cutadapt_oligos
- Used by: cutadapt (Snakefile rule cutadapt_oligo)
- Each sequence should be preceded by ^ to anchor the sequence during cutadapt trimming (see Snakefile rule cutadapt_oligo).
dpm96.fasta: FASTA file containing the sequences of DPM tags
- Required? Yes.
- config.yaml key to specify the path to this file: cutadapt_dpm
- Used by: cutadapt (Snakefile rule cutadapt_dpm)
- Each of these sequences are 10 nt long, consisting of a unique 9 nt DPM_Bottom sequences as originally designed for SPRITE (technically, only the first 8 nt are unique, and the 9th sequence is always a T), plus a T that is complementary to a 3' A added to a genomic DNA sequence via dA-tailing.

blacklist_hg38.bed, blacklist_mm10.bed: blacklisted genomic regions for ChIP-seq data
- Required? No, but highly recommended.
- config.yaml key to specify the path to this file: mask
- Used by: Snakefile rule merge_mask, whose output is used by rule repeat_mask and rule effective_genome_size
- For human genome release hg38, we use ENCFF356LFX from ENCODE. For mouse genome release mm10, we use mm10-blacklist.v2.bed.gz. These BED files use UCSC chromosome names (e.g., chr1, chr2, ...). The pipeline performs chromosome name remapping (if specified) before this step.
  - Reference paper: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. doi:10.1038/s41598-019-45839-z
  - Example code used to download them into the resources/ directory:
```
wget -O - https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz |
    zcat |
    sort -V -k1,3 > "resources/blacklist_hg38.bed"

wget -O - https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz |
    zcat |
    sort -V -k1,3 > "resources/blacklist_mm10.bed"
```
index_mm10/*.bt2, index_hg38/*.bt2: Bowtie 2 genome index
- Required? Yes.
- config.yaml key to specify the path to the index: bowtie2_index
- Used by: Snakefile rule bowtie2_align and rule effective_genome_size
- If you do not have an existing Bowtie 2 index, you can download pre-built indices from the Bowtie 2 developers:
```
# for human primary assembly hg38
mkdir -p resources/index_hg38
wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip -j -d resources/index_hg38 GRCh38_noalt_as.zip \*.bt2

# for mouse primary assembly mm10
mkdir -p resources/index_mm10
wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip
unzip -j -d resources/index_mm10 mm10.zip \*.bt2
```
  This will create a set of files under resources/index_hg38 or resources/index_mm10. If we want to use the mm10 genome assembly, for example, the code above will populate resources/index_mm10 with the following files: mm10.1.bt2, mm10.2.bt2, mm10.3.bt2, mm10.4.bt2, mm10.rev.1.bt2, mm10.rev.2.bt2. The path prefix to this index (as accepted by the bowtie2 -x <bt2-idx> argument) is therefore resources/index_mm10/mm10, which is set in the configuration file, config.yaml.
  
  Note that the pre-built indices linked above use UCSC chromosome names (chr1, chr2, ..., chrX, chrY, chrM). If your alignment indices use different chromosome names (e.g., Ensembl chromosome names are 1, 2, ..., X, Y, MT), update chrom-map.txt such that chromosome names in BAM files are converted to UCSC chromosome names. You can check the names of the reference sequences used to build the index by using the command bowtie2-inspect -n <bt2-idx>.

Workflow Profile Configuration

These files are located under <workflow_directory>/profiles/.

<profile>/config[.vX+].yaml: Workflow profile config file

Required? Yes if running on a compute cluster, such as a SLURM environment.
The path to this file is specified using the --workflow-profile argument to the snakemake program -- i.e., in the run_pipeline.sh script.
The optional .vX+ part of the filename denotes the minimum supported Snakemake major version X.
This file specifies workflow-default command line options for Snakemake as well as the resources available to each rule.
- Paths are interpreted relative to the working directory. Global environment variables such as $USER can be used. [Snakemake documentation]
- Values for resource specifications can be dynamically set, including use of Python code like int(f"{threads}").
Two profiles are provided with this workflow. Key options are described below.
- default: for local execution
  - cores: maximum number of CPU cores to use in parallel
- slurm: for execution on a SLURM server
  - jobs: maximum number of jobs to submit at a time
  - latency-wait: number of seconds to wait for output files of a job to be visible on the file system after the job finished to account for file system latency. This is particularly relevant for distributed computing environments (e.g., on an HPC cluster) with shared/networked file systems.
  - default-resources
    - Other potentially useful SLURM options that are not currently specified in the profile config (see SLURM exectuor plugin documentation):
      - slurm_partition: <string>
      - slurm_account: <string>
  - slurm-logdir: directory to save SLURM log files; used by the SLURM executor plugin
    - SLURM standard output and standard error log files are combined into a single log file output to <slurm-logdir>/<rule>/<wildcards>/<jobid>.log where <wildcards> is an underscore-delimited string of wildcards.
  - slurm-delete-logfiles-older-than: SLURM log files will be deleted after this many days; used by the SLURM executor plugin
  - Other potentially useful Snakemake options that are not currently specified in the profile config:
    - max-threads: <int> - Define the maximum number of threads available to any individual rule.
    - cores: <int> - Define the maximum number of total cores to request from the cluster scheduler.

Output Files

These files are generated in the output directory, which is specified by the output_dir key in the pipeline configuration file.

Barcode identification efficiency (barcode_identification_efficiency.txt): A statistical summary of how many tags were found per read and the proportion of reads with a matching tag at each position.
- The first type of statistic describes how many tags were identified per read. For example, consider a dataset of 10000 adapter-trimmed reads with the expected tag structure as specified in the example_config.txt file: 1 DPM tag on Read 1, 6 tags (Odd, Even, or Terminal) on Read 2.
  - 170 (1.7%) reads found with 1 tag. For a small fraction reads, only 1 tag was identified; this is to be expected, whether due to ligation errors, PCR artifacts, or sequencing errors. These reads are output to fastqs/<sample_name>.part<###>.barcoded_short.fastq.gz and are not used for analysis.
  - 7500 (75.0%) reads found with 7 tags. This is the expected result, where all 7 tags are identified in the majority of reads. In the split_bpm_dpm rule, these reads are split into fastqs/<sample_name>.part<###>.barcoded_dpm.fastq.gz or fastqs/<sample_name>.part<###>.barcoded_bpm.fastq.gz, depending on whether a read corresponds to genomic DNA or an antibody oligo.
- The second type of statistic describes at which positions tags were identified in the reads.
  - 9800 (98.0%) reads found with tag in position 1 (read 1, DPM). As expected, a DPM-category tag is identified at the start of read 1.
  - 9700 (97.0%) reads found with tag in position 2 (read 2, Y). As expected, a terminal tag is identified at the start of read 2.
Pipeline counts (pipeline_counts.txt): A tree-like summary of how many reads remained at each step of the pipeline, produced per aliquot and in aggregate. This can be used to quickly view the proportion of reads corresponding to antibody oligos versus genomic DNA reads; the proportion of properly barcoded reads; etc.
- The 4 columns of numbers are as follows:
  1. The number of reads remaining after that step
  2. The proportion of reads remaining relative to the immediately previous step of the pipeline
  3. The proportion of reads remaining relative to the read type - antibody oligo (bpm) or genomic DNA (dpm)
  4. The proportion of reads remaining relative to the starting number of reads.
- A tabular version is saved to qc/pipeline_counts.csv
Effective genome size file (effective_genome_size.txt): Text file with a single value of the computed effective genome size. Only generated if bigWig generation is requested with normalization strategy RPGC.
Cluster file (clusters/<sample>.clusters): Tab-delimited file, where each line represents a single cluster.
- The first column is the cluster barcode.
- The remainder of the line is a list of reads. DNA reads are formatted as DPM[strand]_chr:start-end, and antibody oligo reads are formatted as BPM[]_<AntibodyID>:<UMI>-0.
Cluster statistics (clusters/cluster_statistics.txt): The number of clusters and antibody oligo (BPM) or genomic DNA (DPM) reads per library.
Cluster size distribution (clusters/[BPM,DPM]_cluster_distribution.pdf): The proportion of clusters that belong to each size category.
Cluster size read distribution (clusters/[BPM,DPM]_read_distribution.pdf): The proportion of reads that belong to clusters of each size category.
- This can be more useful than the number of clusters since relatively few large clusters can contain many sequencing reads (i.e., a large fraction of the library) while many small clusters will contain few sequencing reads (i.e., a much smaller fraction of the library).
Maximum representation oligo eCDF (clusters/Max_representation_ecdf.pdf): A plot showing the distribution of proportion of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.
- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter proportion for cluster assignment.
Maximum representation oligo counts ECDF (clusters/Max_representation_counts.pdf): A plot showing the distribution of number of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.
- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribution. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters min_oligo for cluster assignment.
BAM files for individual antibodies (splitbams/*.bam)

Thresholding criteria (min_oligos, proportion, max_size) for assigning individual clusters to individual antibodies are set in config.yaml.
The "none" BAM file (<sample>.DNA.merged.labeled_none.bam) contains DNA reads from clusters without antibody oligo reads.
The "ambiguous" BAM file (<sample>.DNA.merged.labeled_ambiguous.bam) contains DNA reads from clusters that failed the proportion thresholding criteria.
The "uncertain" BAM file (<sample>.DNA.merged.labeled_uncertain.bam) contains DNA reads from clusters that failed the min_oligo thresholding criteria.
The "filtered" BAM file (<sample>.DNA.merged.labeled_filtered.bam) contains DNA reads from clusters that failed the max_size thresholding criteria.

Read count summary for individual antibodies (splitbams/splitbam_statistics.txt)
- The number of read counts contained within each individual BAM file assigned to individual antibodies.
BigWig files for individual antibodies (bigwigs/*.bw)
- BigWigs are generated using deeptools bamCoverage with binsize set in config.yaml. Normalization is performed using effective genome size (--normalizeUsing RPGC), which is calculated as the size of chromosomes selected via chrom_map.txt minus the size of regions in the mask for those chromosomes.

Additional Resources

Calculator for uniqueness of bead barcodes: Google Colab notebook (also available as GitHub Gist). This calculator can be used to calculate one of two values:

Given the number of beads and the number of possible unique barcodes, calculate the proportion of beads expected to be uniquely barcoded.
Given the number of beads and a desired proportion of beads to be uniquely barcoded, calculate the number of barcodes required.

Publications

Perez AA, Goronzy IN, Blanco MR, Yeh BT, Guo JK, Lopes CS, Ettlin O, Burr A, Guttman M. ChIP-DIP maps binding of hundreds of proteins to DNA simultaneously and identifies diverse gene regulatory elements. Nat Genet. 2024 Dec;56(12):2827–2841. doi:10.1038/s41588-024-02000-5. PMID: 39587360.
Perez AA, Goronzy IN, Blanco MR, Guo JK, Guttman M. ChIP-DIP: A multiplexed method for mapping hundreds of proteins to DNA uncovers diverse regulatory elements controlling gene expression. bioRxiv; 2023. doi:10.1101/2023.12.14.571730. PMID: 38187704; PMCID: PMC10769186.

Credits

Developers

Isabel Goronzy (@igoronzy): Adapted this pipeline from the SPRITE and RNA-DNA SPRITE pipelines.
Benjamin Yeh (@bentyeh): Streamlined and documented the pipeline. Added more extensive QC outputs. Created simulated data for testing the pipeline. Created the Jupyter notebook for in-depth cluster analysis.

Other contributors

Andrew Perez (@HeyDrew64)
Mario Blanco (@mrblanco)
Mitchell Guttman (@mitchguttman)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Quick Start

System Requirements

Benchmarks

Pipeline

Directory Structures

Input Files

Configuration Files

Resource Files

Workflow Profile Configuration

Output Files

Additional Resources

Credits

About

Uh oh!

Releases 3

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
config		config
data		data
images		images
resources		resources
tests		tests
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md
chipdip_qc.ipynb		chipdip_qc.ipynb
fastq2json.py		fastq2json.py
run_pipeline.sh		run_pipeline.sh

License

GuttmanLab/chipdip-pipeline

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick Start

System Requirements

Benchmarks

Pipeline

Directory Structures

Input Files

Configuration Files

Resource Files

Workflow Profile Configuration

Output Files

Additional Resources

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors 2

Languages