-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Check that read pairs have the same name #27
Conversation
@@ -421,6 +421,20 @@ where | |||
let mut output_reads = vec![]; | |||
let mut qual_counter = BaseQualCounter::new(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is that this may be expensive, since this will be done for every read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:
- Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the
PerFastqRecordSet
(where we already validate that the record sets are the same length). - If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.
More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #27 where I added in checking just the first read in each file (implicitly does not allow empty input FASTQs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for a more performant (I think) approach.
But more importantly I think we need to understand the goal, and that can inform whether we need to check every read, or just enough to catch user error.
src/lib/demux.rs
Outdated
let first_read_name = | ||
zipped_reads[0].head().splitn(2, |c| *c == b' ').next().unwrap_or(b""); | ||
for read in &zipped_reads { | ||
let cur_read_name = read.head().splitn(2, |c| *c == b' ').next().unwrap_or(b""); | ||
ensure!( | ||
first_read_name == cur_read_name, | ||
"Read names did not match: {:?} != {:?}", | ||
String::from_utf8_lossy(first_read_name), | ||
String::from_utf8_lossy(cur_read_name) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could do this more efficiently as follows:
let first_head = zipped_reads[0].head();
let end_index = zipped_reads[0].head().find_byte(b' ').unwrap_or(first_head.len());
for read in zipped_reads.iter().dropping(1) {
let head = read.head();
let ok = head.len() >= end_index && first_head[0..end_index] == head[0..end_index];
ensure!(ok, ...);
}
This calculates the region of the first header that is the read name, and then for subsequent reads checks they have at least that many characters, and that they match. This avoids scanning all but the first read, and I think it does that more efficiently too. For bonus points you could also double-check that each read header stops after end_index
or that head[end_index] == b' '
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my future self:
let head = record.head();
let ok = head.len() == end_index || (head.len() > end_index && head[end_index] == b' ');
let ok = ok && first_head[0..end_index] == head[0..end_index];
@@ -421,6 +421,20 @@ where | |||
let mut output_reads = vec![]; | |||
let mut qual_counter = BaseQualCounter::new(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:
- Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the
PerFastqRecordSet
(where we already validate that the record sets are the same length). - If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.
More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.
41be6bb
to
1c19421
Compare
The read name checked include only the header line up to the first white-space. The remaining header is ignore.
17fe73a
to
8f3f7ff
Compare
The read name checked include only the header line up to the first white-space. The remaining header is ignore.
Fixes #26.