Skip to content

Commit

Permalink
Fix bug where original read name could affect demultiplexing
Browse files Browse the repository at this point in the history
Bug description
- The bug was unlikely to have affected real-world usage, but it did affect the output produced from the toy dataset included in the repository
- During barcode identification, the pipeline adds identified tags to read names such that the read names acquire the following structure:

  @<original read name>::[tag1][tag2]...[tagN]

  Most subsequent scripts use the double colon delimiter to separate the original read name from the tags. However, 2 scripts `barcode_identification_efficiency.py` and `fastq_to_bam.py` did not use the `::` delimiter and instead only relied on matching the bracketed tag name structure `[tag]`. Consequently, if the read name itself had a bracketed structure (as do the reads in the toy dataset: e.g., `@[BEAD_AB1-A1][OddBot_5-A5][EvenBot_10-A10][OddBot_46-D10][EvenBot_45-D9][OddBot_67-F7][NYStgBot_83-G11]_CAATGATG`), then the "tags" in the original read name (rather than those identified during the pipeline barcode identification step) were used.
- Fix: The two scripts have been updated to identify tags only after the `::` delimiter in read names.

Other changes
- Improve documentation about pipeline assumptions.
- Add target wildcard constraints to Snakefile to prevent rule conflicts - i.e., to ensure that each desired output file can only be generated by 1 rule. This should also dramatically speed up DAG generation by Snakemake.
- Improve pipeline verification script
  - Enforce locale (export LC_ALL=C) to fix sorting order
  - Use natural chromosome, start position, end position sorting for the test BED files.
  - Update MD5 checksums accordingly

TODO
- Incorporate assumptions into validation rule
  • Loading branch information
bentyeh committed Feb 7, 2025
1 parent 2703961 commit 08f7306
Show file tree
Hide file tree
Showing 7 changed files with 239 additions and 222 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,11 +263,14 @@ However, the pipeline directory can also be kept separate and used repeatedly on
}
```
- Data assumptions:
- FASTQ files are gzip-compressed.
- Read names do not contain two consecutive colons (`::`). This is required because the pipeline adds `::` to the end of read names before adding barcode information; the string `::` is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
- The pipeline (in particular, the script `scripts/bash/split_fastq.sh`) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.
- If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
- Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
- The provided sample read files under the `data/` folder were simulated via a [Google Colab notebook](https://colab.research.google.com/drive/1CyjY0fJSiBl4vCz6FGFuT3IZEQR5XYlI). The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody ID `BEAD_AB1-A1`) and TCF12 (simulated as corresponding to Antibody ID `BEAD_AB2-A2`).
- Sample names (the keys of the samples JSON file) must be unique prior to any periods in the name, due to a current implementation quirk of `scripts/python/threshold_tag_and_split.py:label_bam_file()`
- Sample names (the keys of the samples JSON file) cannot contain any periods (`.`). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation of `scripts/python/threshold_tag_and_split.py:label_bam_file()`.
3. <a name="bpm-fasta">`assets/bpm.fasta`</a>: FASTA file containing the sequences of Antibody IDs
- Required? Yes.
Expand Down
4 changes: 3 additions & 1 deletion Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Aim: A Snakemake workflow to process CHIP-DIP data

import json
import os
import re
import sys
import datetime
import pandas as pd
Expand Down Expand Up @@ -470,7 +471,8 @@ onerror:
shell('mail -s "an error occurred" ' + email + ' < {log}')

wildcard_constraints:
sample = "[^\.]+"
sample = "[^\.]+",
target = "|".join([re.escape(x) for x in TARGETS])

# remove all output, leaving just the following in the workup folder:
# - bigwigs/
Expand Down
10 changes: 6 additions & 4 deletions scripts/python/barcode_identification_efficiency.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,16 @@ def count_tags_in_fastq_file(self, fastqfile):
self._total += 1

def count_tags_in_name(self, name):
tags = self._pattern.findall(name)
name_split = name.split('::', 1)
if len(name_split) == 1:
tags = self._pattern.findall(name)
else:
tags = self._pattern.findall(name_split[1])
num_found = 0
pos = 0
for tag in tags:
for pos, tag in enumerate(tags):
if tag != "NOT_FOUND":
num_found += 1
self._position_count[pos] += 1
pos += 1
self._aggregate_count[num_found] += 1

def print_to_stdout(self):
Expand Down
2 changes: 1 addition & 1 deletion scripts/python/fastq_to_bam.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def convert_reads(path_in, path_out, header, UMI_length):
counter += 1
if counter % 100000 == 0:
print(counter, file=sys.stderr)
match = PATTERN.search(qname)
match = PATTERN.search(qname.split('::')[1])
target_name = list(match.groups())[0]
aligned_segment = initialize_alignment(header, qname, target_name, seq, UMI_length)
output_bam.write(aligned_segment)
Expand Down
Loading

0 comments on commit 08f7306

Please sign in to comment.