Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Check that read pairs have the same name #27

Merged
merged 4 commits into from
Dec 7, 2022

Conversation

nh13
Copy link
Collaborator

@nh13 nh13 commented Nov 26, 2022

The read name checked include only the header line up to the first white-space. The remaining header is ignore.

Fixes #26.

@@ -421,6 +421,20 @@ where
let mut output_reads = vec![];
let mut qual_counter = BaseQualCounter::new();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that this may be expensive, since this will be done for every read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:

  1. Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the PerFastqRecordSet (where we already validate that the record sets are the same length).
  2. If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.

More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #27 where I added in checking just the first read in each file (implicitly does not allow empty input FASTQs).

Copy link
Contributor

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for a more performant (I think) approach.
But more importantly I think we need to understand the goal, and that can inform whether we need to check every read, or just enough to catch user error.

src/lib/demux.rs Outdated
Comment on lines 426 to 443
let first_read_name =
zipped_reads[0].head().splitn(2, |c| *c == b' ').next().unwrap_or(b"");
for read in &zipped_reads {
let cur_read_name = read.head().splitn(2, |c| *c == b' ').next().unwrap_or(b"");
ensure!(
first_read_name == cur_read_name,
"Read names did not match: {:?} != {:?}",
String::from_utf8_lossy(first_read_name),
String::from_utf8_lossy(cur_read_name)
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do this more efficiently as follows:

            let first_head = zipped_reads[0].head();
            let end_index = zipped_reads[0].head().find_byte(b' ').unwrap_or(first_head.len());
            for read in zipped_reads.iter().dropping(1) {
                let head = read.head();
                let ok = head.len() >= end_index && first_head[0..end_index] == head[0..end_index];
                ensure!(ok, ...);
            }

This calculates the region of the first header that is the read name, and then for subsequent reads checks they have at least that many characters, and that they match. This avoids scanning all but the first read, and I think it does that more efficiently too. For bonus points you could also double-check that each read header stops after end_index or that head[end_index] == b' '.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my future self:

let head = record.head();
let ok = head.len() == end_index || (head.len() > end_index && head[end_index] == b' ');
let ok = ok && first_head[0..end_index] == head[0..end_index];

@@ -421,6 +421,20 @@ where
let mut output_reads = vec![];
let mut qual_counter = BaseQualCounter::new();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree and would be hesitant to merge this as is without some benchmarking. A couple of alternative suggestions for @jiangweiyao and @omicsorama to weigh in on:

  1. Would it be sufficient to check the either just the first record, or the first and last in each RecordSet, either here or perhaps upon construction of the PerFastqRecordSet (where we already validate that the record sets are the same length).
  2. If (1) is not sufficient, we could also e.g. check every nth record here. E.g. if we checked every 50th or 100th record that would catch the vast majority of issues with a lot less performance overhead.

More generally - what's the problem we are trying to solve? Catching when a user mixes up FASTQ files can be done by checking first/last etc. But if we're trying to catch broken upstream software that might occasionally emit reads in differing order that would dictate checking every read on the way through.

@nh13 nh13 force-pushed the feature/enfoce-paired-end-read-names branch from 41be6bb to 1c19421 Compare November 28, 2022 18:44
@nh13 nh13 changed the base branch from main to feature/auto-merge-by-lane November 28, 2022 18:44
Base automatically changed from feature/auto-merge-by-lane to main December 6, 2022 04:56
The read name checked include only the header line up to the first white-space.  The
remaining header is ignore.
@nh13 nh13 force-pushed the feature/enfoce-paired-end-read-names branch from 17fe73a to 8f3f7ff Compare December 6, 2022 05:00
@nh13 nh13 merged commit 67bac88 into main Dec 7, 2022
@nh13 nh13 deleted the feature/enfoce-paired-end-read-names branch December 7, 2022 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check if read names match in paired end reads
3 participants