From 08f730670dbd9b1e38f879886745672d16fa2c44 Mon Sep 17 00:00:00 2001 From: Benjamin Yeh Date: Fri, 7 Feb 2025 12:20:33 -0800 Subject: [PATCH] Fix bug where original read name could affect demultiplexing Bug description - The bug was unlikely to have affected real-world usage, but it did affect the output produced from the toy dataset included in the repository - During barcode identification, the pipeline adds identified tags to read names such that the read names acquire the following structure: @::[tag1][tag2]...[tagN] Most subsequent scripts use the double colon delimiter to separate the original read name from the tags. However, 2 scripts `barcode_identification_efficiency.py` and `fastq_to_bam.py` did not use the `::` delimiter and instead only relied on matching the bracketed tag name structure `[tag]`. Consequently, if the read name itself had a bracketed structure (as do the reads in the toy dataset: e.g., `@[BEAD_AB1-A1][OddBot_5-A5][EvenBot_10-A10][OddBot_46-D10][EvenBot_45-D9][OddBot_67-F7][NYStgBot_83-G11]_CAATGATG`), then the "tags" in the original read name (rather than those identified during the pipeline barcode identification step) were used. - Fix: The two scripts have been updated to identify tags only after the `::` delimiter in read names. Other changes - Improve documentation about pipeline assumptions. - Add target wildcard constraints to Snakefile to prevent rule conflicts - i.e., to ensure that each desired output file can only be generated by 1 rule. This should also dramatically speed up DAG generation by Snakemake. - Improve pipeline verification script - Enforce locale (export LC_ALL=C) to fix sorting order - Use natural chromosome, start position, end position sorting for the test BED files. - Update MD5 checksums accordingly TODO - Incorporate assumptions into validation rule --- README.md | 5 +- Snakefile | 4 +- .../barcode_identification_efficiency.py | 10 +- scripts/python/fastq_to_bam.py | 2 +- tests/assets/AB1-A1.bed | 218 +++++++++--------- tests/assets/AB2-A2.bed | 203 ++++++++-------- ...rify_merged_splitbams_from_example_data.sh | 19 +- 7 files changed, 239 insertions(+), 222 deletions(-) diff --git a/README.md b/README.md index f8adf25..f0d0a09 100644 --- a/README.md +++ b/README.md @@ -263,11 +263,14 @@ However, the pipeline directory can also be kept separate and used repeatedly on } ``` + - Data assumptions: + - FASTQ files are gzip-compressed. + - Read names do not contain two consecutive colons (`::`). This is required because the pipeline adds `::` to the end of read names before adding barcode information; the string `::` is used as a delimiter in the pipeline to separate the original read name from the identified barcode. - The pipeline (in particular, the script `scripts/bash/split_fastq.sh`) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample. - If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file. - Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples. - The provided sample read files under the `data/` folder were simulated via a [Google Colab notebook](https://colab.research.google.com/drive/1CyjY0fJSiBl4vCz6FGFuT3IZEQR5XYlI). The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody ID `BEAD_AB1-A1`) and TCF12 (simulated as corresponding to Antibody ID `BEAD_AB2-A2`). - - Sample names (the keys of the samples JSON file) must be unique prior to any periods in the name, due to a current implementation quirk of `scripts/python/threshold_tag_and_split.py:label_bam_file()` + - Sample names (the keys of the samples JSON file) cannot contain any periods (`.`). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation of `scripts/python/threshold_tag_and_split.py:label_bam_file()`. 3. `assets/bpm.fasta`: FASTA file containing the sequences of Antibody IDs - Required? Yes. diff --git a/Snakefile b/Snakefile index 8525d90..d6a50bf 100755 --- a/Snakefile +++ b/Snakefile @@ -4,6 +4,7 @@ Aim: A Snakemake workflow to process CHIP-DIP data import json import os +import re import sys import datetime import pandas as pd @@ -470,7 +471,8 @@ onerror: shell('mail -s "an error occurred" ' + email + ' < {log}') wildcard_constraints: - sample = "[^\.]+" + sample = "[^\.]+", + target = "|".join([re.escape(x) for x in TARGETS]) # remove all output, leaving just the following in the workup folder: # - bigwigs/ diff --git a/scripts/python/barcode_identification_efficiency.py b/scripts/python/barcode_identification_efficiency.py index 08b7f93..567e179 100755 --- a/scripts/python/barcode_identification_efficiency.py +++ b/scripts/python/barcode_identification_efficiency.py @@ -82,14 +82,16 @@ def count_tags_in_fastq_file(self, fastqfile): self._total += 1 def count_tags_in_name(self, name): - tags = self._pattern.findall(name) + name_split = name.split('::', 1) + if len(name_split) == 1: + tags = self._pattern.findall(name) + else: + tags = self._pattern.findall(name_split[1]) num_found = 0 - pos = 0 - for tag in tags: + for pos, tag in enumerate(tags): if tag != "NOT_FOUND": num_found += 1 self._position_count[pos] += 1 - pos += 1 self._aggregate_count[num_found] += 1 def print_to_stdout(self): diff --git a/scripts/python/fastq_to_bam.py b/scripts/python/fastq_to_bam.py index c701c4b..d64529d 100755 --- a/scripts/python/fastq_to_bam.py +++ b/scripts/python/fastq_to_bam.py @@ -69,7 +69,7 @@ def convert_reads(path_in, path_out, header, UMI_length): counter += 1 if counter % 100000 == 0: print(counter, file=sys.stderr) - match = PATTERN.search(qname) + match = PATTERN.search(qname.split('::')[1]) target_name = list(match.groups())[0] aligned_segment = initialize_alignment(header, qname, target_name, seq, UMI_length) output_bam.write(aligned_segment) diff --git a/tests/assets/AB1-A1.bed b/tests/assets/AB1-A1.bed index ee5d04c..ecb4873 100644 --- a/tests/assets/AB1-A1.bed +++ b/tests/assets/AB1-A1.bed @@ -1,4 +1,113 @@ chr12 51853672 51853736 +chr19 3321759 3321824 +chr19 3360168 3360231 +chr19 3389055 3389118 +chr19 3576206 3576270 +chr19 3719163 3719228 +chr19 3719163 3719228 +chr19 3724738 3724803 +chr19 4000592 4000657 +chr19 4019149 4019213 +chr19 4037717 4037781 +chr19 4125334 4125399 +chr19 4148106 4148170 +chr19 4163160 4163225 +chr19 4163160 4163225 +chr19 4169455 4169520 +chr19 4255777 4255841 +chr19 4294329 4294393 +chr19 4294329 4294393 +chr19 4573223 4573288 +chr19 4625575 4625640 +chr19 4625575 4625640 +chr19 4812506 4812571 +chr19 4812506 4812571 +chr19 4838943 4839008 +chr19 4927788 4927853 +chr19 4962259 4962324 +chr19 5366634 5366698 +chr19 5428048 5428113 +chr19 5512666 5512730 +chr19 5529393 5529457 +chr19 5529592 5529657 +chr19 5568445 5568510 +chr19 5601588 5601653 +chr19 5609920 5609985 +chr19 5688457 5688522 +chr19 5724407 5724472 +chr19 5731429 5731493 +chr19 5771586 5771650 +chr19 5786308 5786373 +chr19 5786308 5786373 +chr19 5793009 5793074 +chr19 5798627 5798692 +chr19 5799717 5799782 +chr19 5847924 5847989 +chr19 5850107 5850172 +chr19 5877794 5877858 +chr19 5883916 5883980 +chr19 5923789 5923854 +chr19 5923997 5924062 +chr19 5923997 5924062 +chr19 6018124 6018188 +chr19 6040209 6040273 +chr19 6040209 6040273 +chr19 6061161 6061226 +chr19 6061161 6061226 +chr19 6077066 6077131 +chr19 6099617 6099667 +chr19 6242218 6242280 +chr19 6276290 6276355 +chr19 6276290 6276355 +chr19 6331662 6331727 +chr19 6363241 6363305 +chr19 6364443 6364508 +chr19 6392955 6393020 +chr19 6400330 6400395 +chr19 6660174 6660239 +chr19 6679319 6679381 +chr19 6819016 6819081 +chr19 6840729 6840794 +chr19 6951093 6951156 +chr19 6974314 6974379 +chr19 7018411 7018474 +chr19 7019228 7019293 +chr19 7040930 7040994 +chr19 7056442 7056506 +chr19 7064564 7064629 +chr19 7064564 7064629 +chr19 7206344 7206409 +chr19 7206344 7206409 +chr19 7217382 7217447 +chr19 7264057 7264122 +chr19 7341920 7341985 +chr19 7483343 7483408 +chr19 7495167 7495232 +chr19 7516435 7516499 +chr19 8712956 8713021 +chr19 8736557 8736621 +chr19 8741194 8741258 +chr19 8758861 8758926 +chr19 8798356 8798421 +chr19 8798356 8798421 +chr19 8798356 8798421 +chr19 8818915 8818980 +chr19 8818915 8818980 +chr19 8886874 8886939 +chr19 8888003 8888067 +chr19 8888003 8888068 +chr19 8888550 8888612 +chr19 8928151 8928216 +chr19 8942636 8942701 +chr19 8944705 8944769 +chr19 9055894 9055959 +chr19 9055894 9055959 +chr19 9065915 9065979 +chr19 9911393 9911458 +chr19 9954017 9954082 +chr19 9980230 9980295 +chr19 9982121 9982185 +chr19 9983054 9983118 chr19 10017841 10017906 chr19 10059794 10059858 chr19 10104899 10104962 @@ -100,20 +209,13 @@ chr19 32978320 32978384 chr19 33008777 33008842 chr19 33008777 33008842 chr19 33101957 33102021 -chr19 3321759 3321824 -chr19 3360168 3360231 -chr19 3389055 3389118 chr19 34848153 34848218 chr19 34858812 34858877 chr19 34877804 34877867 chr19 34879228 34879293 chr19 34921494 34921559 chr19 35303193 35303258 -chr19 3576206 3576270 chr19 36926467 36926532 -chr19 3719163 3719228 -chr19 3719163 3719228 -chr19 3724738 3724803 chr19 37252097 37252162 chr19 37375428 37375493 chr19 37376111 37376176 @@ -128,9 +230,6 @@ chr19 38054020 38054085 chr19 38054287 38054352 chr19 38054287 38054352 chr19 38819088 38819153 -chr19 4000592 4000657 -chr19 4019149 4019213 -chr19 4037717 4037781 chr19 40587524 40587587 chr19 40587524 40587587 chr19 40589284 40589348 @@ -138,14 +237,9 @@ chr19 40633128 40633191 chr19 40769417 40769481 chr19 40769417 40769481 chr19 40818627 40818692 -chr19 4125334 4125399 chr19 41258700 41258765 chr19 41263079 41263144 chr19 41310994 41311059 -chr19 4148106 4148170 -chr19 4163160 4163225 -chr19 4163160 4163225 -chr19 4169455 4169520 chr19 41808196 41808261 chr19 41831323 41831388 chr19 41831323 41831388 @@ -162,13 +256,12 @@ chr19 42129244 42129308 chr19 42166138 42166202 chr19 42170100 42170165 chr19 42176832 42176897 -chr19 4255777 4255841 chr19 42744227 42744292 -chr19 4294329 4294393 chr19 43523441 43523506 chr19 43523441 43523506 chr19 43530441 43530504 chr19 43627347 43627411 +chr19 43675794 43675859 chr19 43695653 43695718 chr19 43745647 43745712 chr19 43816052 43816117 @@ -193,7 +286,6 @@ chr19 45445269 45445334 chr19 45544807 45544870 chr19 45544807 45544870 chr19 45560313 45560375 -chr19 4573223 4573288 chr19 45783808 45783873 chr19 45998438 45998503 chr19 45998438 45998503 @@ -206,8 +298,6 @@ chr19 46131908 46131973 chr19 46135251 46135316 chr19 46137454 46137519 chr19 46227315 46227380 -chr19 4625575 4625640 -chr19 4625575 4625640 chr19 46303855 46303920 chr19 46315103 46315168 chr19 46344683 46344747 @@ -241,14 +331,11 @@ chr19 47508720 47508785 chr19 47532517 47532580 chr19 47854815 47854880 chr19 47854815 47854880 -chr19 4812506 4812571 -chr19 4812506 4812571 -chr19 4838943 4839008 -chr19 4927788 4927853 chr19 53032714 53032779 chr19 53033182 53033247 chr19 53033182 53033247 chr19 53094984 53095048 +chr19 53142314 53142379 chr19 53153141 53153205 chr19 53153370 53153434 chr19 53185471 53185536 @@ -263,7 +350,6 @@ chr19 53404793 53404858 chr19 53464692 53464757 chr19 53519559 53519624 chr19 53528783 53528847 -chr19 5366634 5366698 chr19 53753016 53753081 chr19 53753239 53753304 chr19 53801888 53801953 @@ -284,10 +370,8 @@ chr19 53945370 53945435 chr19 53964960 53965024 chr19 54081560 54081625 chr19 54107382 54107447 -chr19 5428048 5428113 chr19 55084688 55084753 chr19 55085024 55085088 -chr19 5512666 5512730 chr19 55206709 55206774 chr19 55218429 55218494 chr19 55251051 55251116 @@ -295,15 +379,10 @@ chr19 55251359 55251424 chr19 55283989 55284053 chr19 55289035 55289100 chr19 55289035 55289100 -chr19 5529393 5529457 -chr19 5529592 5529657 chr19 55565648 55565713 -chr19 5568445 5568510 chr19 55859923 55859987 chr19 55867197 55867262 chr19 55926369 55926434 -chr19 5601588 5601653 -chr19 5609920 5609985 chr19 56438944 56439009 chr19 56438944 56439009 chr19 56462865 56462930 @@ -319,9 +398,6 @@ chr19 56800965 56801029 chr19 56801654 56801719 chr19 56820901 56820966 chr19 56821841 56821906 -chr19 5688457 5688522 -chr19 5724407 5724472 -chr19 5731429 5731493 chr19 57328882 57328947 chr19 57328882 57328947 chr19 57339307 57339372 @@ -331,39 +407,21 @@ chr19 57413287 57413350 chr19 57457115 57457180 chr19 57605987 57606052 chr19 57605987 57606052 -chr19 5771586 5771650 -chr19 5786308 5786373 -chr19 5786308 5786373 -chr19 5793009 5793074 -chr19 5798627 5798692 -chr19 5799717 5799782 -chr19 5847924 5847989 -chr19 5850107 5850172 -chr19 5877794 5877858 -chr19 5883916 5883980 chr19 58979835 58979900 -chr19 5923997 5924062 -chr19 5923997 5924062 chr19 59389388 59389453 chr19 59404070 59404135 chr19 59404070 59404135 chr19 59942753 59942818 chr19 60129770 60129835 chr19 60160900 60160964 -chr19 6018124 6018188 chr19 60192743 60192807 chr19 60226375 60226440 chr19 60226375 60226440 -chr19 6040209 6040273 -chr19 6040209 6040273 -chr19 6061161 6061226 -chr19 6061161 6061226 chr19 60665829 60665894 chr19 60757363 60757428 chr19 60757562 60757626 chr19 60760076 60760140 chr19 60760076 60760140 -chr19 6077066 6077131 chr19 60780316 60780381 chr19 60787188 60787252 chr19 60792249 60792314 @@ -372,58 +430,4 @@ chr19 60818620 60818684 chr19 60861423 60861488 chr19 60861423 60861488 chr19 60939846 60939910 -chr19 6099617 6099667 chr19 61140854 61140919 -chr19 6242218 6242280 -chr19 6276290 6276355 -chr19 6331662 6331727 -chr19 6363241 6363305 -chr19 6364443 6364508 -chr19 6392955 6393020 -chr19 6400330 6400395 -chr19 6660174 6660239 -chr19 6679319 6679381 -chr19 6819016 6819081 -chr19 6840729 6840794 -chr19 6951093 6951156 -chr19 6974314 6974379 -chr19 7018411 7018474 -chr19 7019228 7019293 -chr19 7040930 7040994 -chr19 7056442 7056506 -chr19 7064564 7064629 -chr19 7064564 7064629 -chr19 7206344 7206409 -chr19 7206344 7206409 -chr19 7217382 7217447 -chr19 7264057 7264122 -chr19 7341920 7341985 -chr19 7483343 7483408 -chr19 7495167 7495232 -chr19 7516435 7516499 -chr19 8712956 8713021 -chr19 8736557 8736621 -chr19 8741194 8741258 -chr19 8758861 8758926 -chr19 8774337 8774402 -chr19 8798356 8798421 -chr19 8798356 8798421 -chr19 8798356 8798421 -chr19 8818915 8818980 -chr19 8818915 8818980 -chr19 8886874 8886939 -chr19 8888003 8888067 -chr19 8888003 8888068 -chr19 8888550 8888612 -chr19 8928151 8928216 -chr19 8929715 8929780 -chr19 8942636 8942701 -chr19 8944705 8944769 -chr19 9055894 9055959 -chr19 9055894 9055959 -chr19 9065915 9065979 -chr19 9911393 9911458 -chr19 9954017 9954082 -chr19 9980230 9980295 -chr19 9982121 9982185 -chr19 9983054 9983118 diff --git a/tests/assets/AB2-A2.bed b/tests/assets/AB2-A2.bed index 0a1f44c..25e5d79 100644 --- a/tests/assets/AB2-A2.bed +++ b/tests/assets/AB2-A2.bed @@ -1,3 +1,104 @@ +chr19 3324344 3324408 +chr19 3575580 3575644 +chr19 3590055 3590120 +chr19 3590055 3590120 +chr19 3610868 3610933 +chr19 3832902 3832967 +chr19 3851699 3851764 +chr19 3935070 3935134 +chr19 4044148 4044212 +chr19 4154959 4155024 +chr19 4191742 4191807 +chr19 4231511 4231576 +chr19 4231511 4231576 +chr19 4257590 4257654 +chr19 4294403 4294467 +chr19 4294403 4294468 +chr19 4315218 4315283 +chr19 4408768 4408830 +chr19 4510384 4510449 +chr19 4558632 4558697 +chr19 4558632 4558697 +chr19 4756482 4756547 +chr19 4928319 4928384 +chr19 5023791 5023856 +chr19 5269421 5269484 +chr19 5366590 5366655 +chr19 5367490 5367555 +chr19 5433599 5433664 +chr19 5488154 5488219 +chr19 5489114 5489179 +chr19 5489114 5489179 +chr19 5610103 5610167 +chr19 5637269 5637334 +chr19 5724537 5724602 +chr19 5804683 5804748 +chr19 5845370 5845435 +chr19 5846635 5846700 +chr19 5875054 5875118 +chr19 5877610 5877673 +chr19 5912604 5912669 +chr19 5912604 5912669 +chr19 5922122 5922187 +chr19 5922660 5922724 +chr19 5924639 5924704 +chr19 5988122 5988187 +chr19 6046929 6046994 +chr19 6061151 6061216 +chr19 6104554 6104619 +chr19 6104554 6104619 +chr19 6104554 6104619 +chr19 6117996 6118061 +chr19 6130433 6130497 +chr19 6235748 6235813 +chr19 6241737 6241802 +chr19 6285639 6285704 +chr19 6302230 6302295 +chr19 6334622 6334686 +chr19 6334622 6334686 +chr19 6401977 6402041 +chr19 6401977 6402041 +chr19 6448837 6448901 +chr19 6553290 6553355 +chr19 6577206 6577270 +chr19 6660334 6660399 +chr19 6660334 6660399 +chr19 6858072 6858137 +chr19 6869401 6869462 +chr19 6921152 6921217 +chr19 6980540 6980561 +chr19 6986396 6986461 +chr19 7019296 7019361 +chr19 7056596 7056659 +chr19 7064630 7064695 +chr19 7066595 7066660 +chr19 7066595 7066660 +chr19 7067351 7067416 +chr19 7118003 7118068 +chr19 7206119 7206184 +chr19 7295436 7295501 +chr19 7315611 7315674 +chr19 7342135 7342200 +chr19 7342135 7342200 +chr19 7494381 7494446 +chr19 7495627 7495692 +chr19 7495627 7495692 +chr19 7552048 7552113 +chr19 7637485 7637550 +chr19 8018595 8018660 +chr19 8572581 8572646 +chr19 8572581 8572646 +chr19 8605257 8605321 +chr19 8786068 8786133 +chr19 8880759 8880823 +chr19 8892811 8892875 +chr19 8897942 8898007 +chr19 8929715 8929780 +chr19 8967013 8967078 +chr19 8992921 8992986 +chr19 9041106 9041171 +chr19 9062919 9062984 +chr19 9628605 9628670 chr19 10045739 10045804 chr19 10045739 10045804 chr19 10065741 10065806 @@ -120,7 +221,6 @@ chr19 33078662 33078727 chr19 33091387 33091452 chr19 33102031 33102096 chr19 33102031 33102096 -chr19 3324344 3324408 chr19 34270335 34270400 chr19 34526312 34526377 chr19 34562669 34562734 @@ -130,10 +230,6 @@ chr19 34905436 34905500 chr19 34905436 34905500 chr19 35149529 35149594 chr19 35320018 35320083 -chr19 3575580 3575644 -chr19 3590055 3590120 -chr19 3590055 3590120 -chr19 3610868 3610933 chr19 36409342 36409406 chr19 36409342 36409406 chr19 36451085 36451150 @@ -159,18 +255,14 @@ chr19 37910031 37910096 chr19 37910031 37910096 chr19 38054895 38054960 chr19 38224000 38224065 -chr19 3832902 3832967 chr19 38354335 38354400 chr19 38365270 38365335 chr19 38410404 38410469 -chr19 3851699 3851764 chr19 38819123 38819187 chr19 38819123 38819187 chr19 38862559 38862624 chr19 38862559 38862624 -chr19 3935070 3935134 chr19 40238220 40238285 -chr19 4044148 4044212 chr19 40512067 40512132 chr19 40512067 40512132 chr19 40542874 40542937 @@ -184,7 +276,6 @@ chr19 41310923 41310988 chr19 41310923 41310988 chr19 41310923 41310988 chr19 41495724 41495789 -chr19 4154959 4155024 chr19 41694646 41694711 chr19 41801858 41801923 chr19 41808134 41808199 @@ -196,7 +287,6 @@ chr19 41833537 41833602 chr19 41846446 41846511 chr19 41850371 41850436 chr19 41906436 41906501 -chr19 4191742 4191807 chr19 41980954 41981019 chr19 42079019 42079083 chr19 42128854 42128919 @@ -205,18 +295,12 @@ chr19 42137216 42137281 chr19 42170425 42170490 chr19 42170425 42170490 chr19 42230191 42230255 -chr19 4231511 4231576 -chr19 4231511 4231576 chr19 42575409 42575474 -chr19 4257590 4257654 chr19 42737832 42737897 chr19 42744152 42744217 chr19 42745174 42745239 chr19 42752409 42752473 -chr19 4294403 4294467 -chr19 4294403 4294468 chr19 43064175 43064240 -chr19 4315218 4315283 chr19 43479004 43479069 chr19 43494903 43494967 chr19 43524838 43524902 @@ -231,7 +315,6 @@ chr19 43940176 43940241 chr19 43973765 43973828 chr19 43981990 43982055 chr19 44077721 44077785 -chr19 4408768 4408830 chr19 44107301 44107366 chr19 44144346 44144411 chr19 44144346 44144411 @@ -249,12 +332,9 @@ chr19 44396854 44396918 chr19 44555249 44555312 chr19 44562683 44562747 chr19 44930961 44931026 -chr19 4510384 4510449 chr19 45229280 45229345 chr19 45458862 45458926 chr19 45575621 45575685 -chr19 4558632 4558697 -chr19 4558632 4558697 chr19 45602729 45602794 chr19 45747550 45747615 chr19 45747550 45747615 @@ -281,6 +361,7 @@ chr19 46532244 46532309 chr19 46553700 46553765 chr19 46553700 46553765 chr19 46573266 46573331 +chr19 46573266 46573331 chr19 46599042 46599107 chr19 46609589 46609654 chr19 46669818 46669883 @@ -294,15 +375,11 @@ chr19 46905478 46905541 chr19 46911637 46911700 chr19 46911637 46911700 chr19 46961708 46961773 -chr19 4756482 4756547 chr19 47579274 47579339 chr19 47689833 47689896 chr19 47689833 47689896 chr19 47719112 47719176 chr19 47898540 47898604 -chr19 4928319 4928384 -chr19 5023791 5023856 -chr19 5269421 5269484 chr19 53119667 53119732 chr19 53187682 53187745 chr19 53192123 53192188 @@ -321,8 +398,6 @@ chr19 53530254 53530319 chr19 53533912 53533976 chr19 53533912 53533976 chr19 53657428 53657492 -chr19 5366590 5366655 -chr19 5367490 5367555 chr19 53789039 53789103 chr19 53800059 53800124 chr19 53802003 53802068 @@ -344,12 +419,8 @@ chr19 54081886 54081951 chr19 54083052 54083115 chr19 54083052 54083115 chr19 54212183 54212247 -chr19 5433599 5433664 chr19 54685245 54685310 chr19 54685245 54685310 -chr19 5488154 5488219 -chr19 5489114 5489179 -chr19 5489114 5489179 chr19 55099364 55099429 chr19 55248243 55248308 chr19 55248243 55248308 @@ -366,9 +437,7 @@ chr19 55861719 55861784 chr19 55861719 55861784 chr19 55938233 55938297 chr19 55939759 55939822 -chr19 5610103 5610167 chr19 56189275 56189339 -chr19 5637269 5637334 chr19 56378036 56378101 chr19 56378041 56378106 chr19 56393072 56393136 @@ -380,7 +449,6 @@ chr19 56588304 56588368 chr19 56588304 56588368 chr19 56735784 56735849 chr19 56822063 56822128 -chr19 5724537 5724602 chr19 57339288 57339353 chr19 57341170 57341235 chr19 57358447 57358512 @@ -390,23 +458,13 @@ chr19 57477428 57477493 chr19 57538959 57539024 chr19 57541995 57542060 chr19 57610928 57610993 -chr19 5804683 5804748 -chr19 5845370 5845435 -chr19 5846635 5846700 -chr19 5875054 5875118 -chr19 5877610 5877673 chr19 58946012 58946077 -chr19 5912604 5912669 -chr19 5912604 5912669 -chr19 5922122 5922187 -chr19 5922660 5922724 -chr19 5924639 5924704 chr19 59248551 59248616 chr19 59248551 59248616 +chr19 59260615 59260680 chr19 59330887 59330951 chr19 59337934 59337999 chr19 59404264 59404328 -chr19 5988122 5988187 chr19 59899970 59900035 chr19 59905764 59905828 chr19 60134549 60134612 @@ -415,9 +473,7 @@ chr19 60149436 60149500 chr19 60160943 60161008 chr19 60192910 60192974 chr19 60302435 60302500 -chr19 6046929 6046994 chr19 60581067 60581132 -chr19 6061151 6061216 chr19 60743981 60744044 chr19 60872308 60872373 chr19 60887707 60887771 @@ -425,60 +481,7 @@ chr19 60889560 60889625 chr19 60994348 60994413 chr19 60994348 60994413 chr19 61031618 61031682 -chr19 6104554 6104619 -chr19 6104554 6104619 -chr19 6104554 6104619 chr19 61078712 61078775 chr19 61085308 61085370 chr19 61085308 61085370 chr19 61140681 61140745 -chr19 6117996 6118061 -chr19 6130433 6130497 -chr19 6235748 6235813 -chr19 6241737 6241802 -chr19 6285639 6285704 -chr19 6302230 6302295 -chr19 6334622 6334686 -chr19 6334622 6334686 -chr19 6401977 6402041 -chr19 6401977 6402041 -chr19 6448837 6448901 -chr19 6553290 6553355 -chr19 6577206 6577270 -chr19 6660334 6660399 -chr19 6660334 6660399 -chr19 6858072 6858137 -chr19 6921152 6921217 -chr19 6980540 6980561 -chr19 6986396 6986461 -chr19 7019296 7019361 -chr19 7056596 7056659 -chr19 7064630 7064695 -chr19 7066595 7066660 -chr19 7066595 7066660 -chr19 7067351 7067416 -chr19 7118003 7118068 -chr19 7206119 7206184 -chr19 7295436 7295501 -chr19 7315611 7315674 -chr19 7342135 7342200 -chr19 7342135 7342200 -chr19 7494381 7494446 -chr19 7495627 7495692 -chr19 7495627 7495692 -chr19 7552048 7552113 -chr19 7637485 7637550 -chr19 8018595 8018660 -chr19 8572581 8572646 -chr19 8572581 8572646 -chr19 8605257 8605321 -chr19 8786068 8786133 -chr19 8880759 8880823 -chr19 8892811 8892875 -chr19 8897942 8898007 -chr19 8929715 8929780 -chr19 8967013 8967078 -chr19 8992921 8992986 -chr19 9041106 9041171 -chr19 9062919 9062984 -chr19 9628605 9628670 diff --git a/tests/verify_merged_splitbams_from_example_data.sh b/tests/verify_merged_splitbams_from_example_data.sh index 972a12e..874656b 100755 --- a/tests/verify_merged_splitbams_from_example_data.sh +++ b/tests/verify_merged_splitbams_from_example_data.sh @@ -39,9 +39,9 @@ fi BED_REF_AB1="$DIR_TEST_ASSETS/AB1-A1.bed" BED_REF_AB2="$DIR_TEST_ASSETS/AB2-A2.bed" -# hashes for merged splitbam output, converted to 3-column BED files, sorted lexicographically -HASH_REF_AB1="ed3ec0eb6c1bdb954921dd5c34efc3a8" -HASH_REF_AB2="d89dbda765c35bdde188dbdde1a1e161" +# hashes for merged splitbam output, converted to 3-column BED files, sorted by position +HASH_REF_AB1="ec857d153c89e12127f6d4b4438053c8" +HASH_REF_AB2="7a7adffb428fda686e0269395347093c" # hash for expected cluster statistics file HASH_REF_CLUSTERS="b2efa9824814cca021e287ba36ebb20f" @@ -52,24 +52,27 @@ hash_ab1=$(md5sum "$BED_REF_AB1" | cut -f 1 -d ' ') hash_ab2=$(md5sum "$BED_REF_AB2" | cut -f 1 -d ' ') [ "$hash_ab2" != "$HASH_REF_AB2" ] && echo "Corrupt reference BED file $BED_REF_AB2" && exit 1 +# set locale to C for consistent sorting +export LC_ALL=C + # generate BED files from pipeline merged splitbam output tmpbed1=$(mktemp ./bed1.XXXXX) tmpbed2=$(mktemp ./bed2.XXXXX) bedtools bamtobed -i "$DIR_OUTPUT"/workup/splitbams/AB1-A1.bam | cut -f 1,2,3 | - sort > "$tmpbed1" + sort -k1,1V -k2,2n -k3,3n > "$tmpbed1" bedtools bamtobed -i "$DIR_OUTPUT"/workup/splitbams/AB2-A2.bam | cut -f 1,2,3 | - sort > "$tmpbed2" + sort -k1,1V -k2,2n -k3,3n > "$tmpbed2" -diff "$tmpbed1" "$BED_REF_AB1" +diff "$tmpbed1" <(sort -k1,1V -k2,2n -k3,3n "$BED_REF_AB1") # delete BED files if they match reference [ "$?" = 0 ] && echo "AB1-A1 matches reference." && rm "$tmpbed1" -diff "$tmpbed2" "$BED_REF_AB2" +diff "$tmpbed2" <(sort -k1,1V -k2,2n -k3,3n "$BED_REF_AB2") # delete BED files if they match reference [ "$?" = 0 ] && echo "AB2-A2 matches reference." && rm "$tmpbed2" # validate cluster file -hash_cluster=$(export LC_ALL=C; sort "$DIR_OUTPUT"/workup/clusters/cluster_statistics.txt | md5sum | cut -f 1 -d ' ') +hash_cluster=$(sort "$DIR_OUTPUT"/workup/clusters/cluster_statistics.txt | md5sum | cut -f 1 -d ' ') [ "$hash_cluster" = "$HASH_REF_CLUSTERS" ] && echo "MD5 checksum of cluster_statistics.txt matches reference."