[feat] Check that read pairs have the same name #27

nh13 · 2022-11-26T16:06:23Z

The read name checked include only the header line up to the first white-space. The remaining header is ignore.

Fixes #26.

nh13 · 2022-11-26T16:07:02Z

src/lib/demux.rs

@@ -421,6 +421,20 @@ where
            let mut output_reads = vec![];
            let mut qual_counter = BaseQualCounter::new();



My only concern is that this may be expensive, since this will be done for every read.

I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:

Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the PerFastqRecordSet (where we already validate that the record sets are the same length).

If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.

More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.

See #27 where I added in checking just the first read in each file (implicitly does not allow empty input FASTQs).

tfenne

Suggestion for a more performant (I think) approach.
But more importantly I think we need to understand the goal, and that can inform whether we need to check every read, or just enough to catch user error.

tfenne · 2022-11-27T13:04:16Z

src/lib/demux.rs

+            let first_read_name =
+                zipped_reads[0].head().splitn(2, |c| *c == b' ').next().unwrap_or(b"");
+            for read in &zipped_reads {
+                let cur_read_name = read.head().splitn(2, |c| *c == b' ').next().unwrap_or(b"");
+                ensure!(
+                    first_read_name == cur_read_name,
+                    "Read names did not match: {:?} != {:?}",
+                    String::from_utf8_lossy(first_read_name),
+                    String::from_utf8_lossy(cur_read_name)
+                );
+            }


I think you could do this more efficiently as follows:

let first_head = zipped_reads[0].head(); let end_index = zipped_reads[0].head().find_byte(b' ').unwrap_or(first_head.len()); for read in zipped_reads.iter().dropping(1) { let head = read.head(); let ok = head.len() >= end_index && first_head[0..end_index] == head[0..end_index]; ensure!(ok, ...); }

This calculates the region of the first header that is the read name, and then for subsequent reads checks they have at least that many characters, and that they match. This avoids scanning all but the first read, and I think it does that more efficiently too. For bonus points you could also double-check that each read header stops after end_index or that head[end_index] == b' '.

For my future self:

let head = record.head(); let ok = head.len() == end_index || (head.len() > end_index && head[end_index] == b' '); let ok = ok && first_head[0..end_index] == head[0..end_index];

tfenne · 2022-11-27T13:09:23Z

src/lib/demux.rs

@@ -421,6 +421,20 @@ where
            let mut output_reads = vec![];
            let mut qual_counter = BaseQualCounter::new();



I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:

Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the PerFastqRecordSet (where we already validate that the record sets are the same length).

If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.

More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.

The read name checked include only the header line up to the first white-space. The remaining header is ignore.

nh13 requested review from tfenne, jiangweiyao and omicsorama November 26, 2022 16:06

nh13 commented Nov 26, 2022

View reviewed changes

tfenne reviewed Nov 27, 2022

View reviewed changes

nh13 force-pushed the feature/enfoce-paired-end-read-names branch from 41be6bb to 1c19421 Compare November 28, 2022 18:44

nh13 changed the base branch from main to feature/auto-merge-by-lane November 28, 2022 18:44

omicsorama approved these changes Nov 28, 2022

View reviewed changes

Base automatically changed from feature/auto-merge-by-lane to main December 6, 2022 04:56

nh13 added 3 commits December 5, 2022 22:00

[feat] Check that read pairs have the same name

42fd7ba

The read name checked include only the header line up to the first white-space. The remaining header is ignore.

updates

f23e8fb

print which FASTQ is wrong

8f3f7ff

nh13 force-pushed the feature/enfoce-paired-end-read-names branch from 17fe73a to 8f3f7ff Compare December 6, 2022 05:00

fixup

0ad2426

nh13 merged commit 67bac88 into main Dec 7, 2022

nh13 deleted the feature/enfoce-paired-end-read-names branch December 7, 2022 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Check that read pairs have the same name #27

[feat] Check that read pairs have the same name #27

nh13 commented Nov 26, 2022

nh13 Nov 26, 2022

tfenne Nov 27, 2022

nh13 Nov 27, 2022

tfenne left a comment

tfenne Nov 27, 2022

nh13 Nov 27, 2022

tfenne Nov 27, 2022

		@@ -421,6 +421,20 @@ where
		let mut output_reads = vec![];
		let mut qual_counter = BaseQualCounter::new();

[feat] Check that read pairs have the same name #27

[feat] Check that read pairs have the same name #27

Conversation

nh13 commented Nov 26, 2022

nh13 Nov 26, 2022

Choose a reason for hiding this comment

tfenne Nov 27, 2022

Choose a reason for hiding this comment

nh13 Nov 27, 2022

Choose a reason for hiding this comment

tfenne left a comment

Choose a reason for hiding this comment

tfenne Nov 27, 2022

Choose a reason for hiding this comment

nh13 Nov 27, 2022

Choose a reason for hiding this comment

tfenne Nov 27, 2022

Choose a reason for hiding this comment