Fix bug where original read name could affect demultiplexing

Bug description - The bug was unlikely to have affected real-world usage, but it did affect the output produced from the toy dataset included in the repository - During barcode identification, the pipeline adds identified tags to read names such that the read names acquire the following structure: @<original read name>::[tag1][tag2]...[tagN] Most subsequent scripts use the double colon delimiter to separate the original read name from the tags. However, 2 scripts `barcode_identification_efficiency.py` and `fastq_to_bam.py` did not use the `::` delimiter and instead only relied on matching the bracketed tag name structure `[tag]`. Consequently, if the read name itself had a bracketed structure (as do the reads in the toy dataset: e.g., `@[BEAD_AB1-A1][OddBot_5-A5][EvenBot_10-A10][OddBot_46-D10][EvenBot_45-D9][OddBot_67-F7][NYStgBot_83-G11]_CAATGATG`), then the "tags" in the original read name (rather than those identified during the pipeline barcode identification step) were used. - Fix: The two scripts have been updated to identify tags only after the `::` delimiter in read names. Other changes - Improve documentation about pipeline assumptions. - Add target wildcard constraints to Snakefile to prevent rule conflicts - i.e., to ensure that each desired output file can only be generated by 1 rule. This should also dramatically speed up DAG generation by Snakemake. - Improve pipeline verification script - Enforce locale (export LC_ALL=C) to fix sorting order - Use natural chromosome, start position, end position sorting for the test BED files. - Update MD5 checksums accordingly TODO - Incorporate assumptions into validation rule
GuttmanLab · Feb 7, 2025 · 08f7306 · 08f7306
1 parent 2703961
commit 08f7306
Show file tree

Hide file tree

Showing 7 changed files with 239 additions and 222 deletions.
diff --git a/README.md b/README.md
@@ -263,11 +263,14 @@ However, the pipeline directory can also be kept separate and used repeatedly on
      }
      ```
 
+   - Data assumptions:
+     - FASTQ files are gzip-compressed.
+     - Read names do not contain two consecutive colons (`::`). This is required because the pipeline adds `::` to the end of read names before adding barcode information; the string `::` is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
    - The pipeline (in particular, the script `scripts/bash/split_fastq.sh`) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.
      - If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
    - Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
    - The provided sample read files under the `data/` folder were simulated via a [Google Colab notebook](https://colab.research.google.com/drive/1CyjY0fJSiBl4vCz6FGFuT3IZEQR5XYlI). The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody ID `BEAD_AB1-A1`) and TCF12 (simulated as corresponding to Antibody ID `BEAD_AB2-A2`).
-   - Sample names (the keys of the samples JSON file) must be unique prior to any periods in the name, due to a current implementation quirk of `scripts/python/threshold_tag_and_split.py:label_bam_file()`
+   - Sample names (the keys of the samples JSON file) cannot contain any periods (`.`). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation of `scripts/python/threshold_tag_and_split.py:label_bam_file()`.
 
 3. <a name="bpm-fasta">`assets/bpm.fasta`</a>: FASTA file containing the sequences of Antibody IDs
    - Required? Yes.

diff --git a/Snakefile b/Snakefile
@@ -4,6 +4,7 @@ Aim: A Snakemake workflow to process CHIP-DIP data
 
 import json
 import os
+import re
 import sys
 import datetime
 import pandas as pd
@@ -470,7 +471,8 @@ onerror:
         shell('mail -s "an error occurred" ' + email + ' < {log}')
 
 wildcard_constraints:
-    sample = "[^\.]+"
+    sample = "[^\.]+",
+    target = "|".join([re.escape(x) for x in TARGETS])
 
 # remove all output, leaving just the following in the workup folder:
 # - bigwigs/

diff --git a/scripts/python/barcode_identification_efficiency.py b/scripts/python/barcode_identification_efficiency.py
@@ -82,14 +82,16 @@ def count_tags_in_fastq_file(self, fastqfile):
                 self._total += 1
 
     def count_tags_in_name(self, name):
-        tags = self._pattern.findall(name)
+        name_split = name.split('::', 1)
+        if len(name_split) == 1:
+            tags = self._pattern.findall(name)
+        else:
+            tags = self._pattern.findall(name_split[1])
         num_found = 0
-        pos = 0
-        for tag in tags:
+        for pos, tag in enumerate(tags):
             if tag != "NOT_FOUND":
                 num_found += 1
                 self._position_count[pos] += 1
-            pos += 1
         self._aggregate_count[num_found] += 1
 
     def print_to_stdout(self):

diff --git a/scripts/python/fastq_to_bam.py b/scripts/python/fastq_to_bam.py
@@ -69,7 +69,7 @@ def convert_reads(path_in, path_out, header, UMI_length):
             counter += 1
             if counter % 100000 == 0:
                 print(counter, file=sys.stderr)
-            match = PATTERN.search(qname)
+            match = PATTERN.search(qname.split('::')[1])
             target_name = list(match.groups())[0]
             aligned_segment = initialize_alignment(header, qname, target_name, seq, UMI_length)
             output_bam.write(aligned_segment)