Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input can be a path prefix #24

Merged
merged 11 commits into from
Nov 27, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix ordering
  • Loading branch information
nh13 committed Nov 17, 2022
commit b7aeec3c8416c321fc72b5fdad75bb228f6408de
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,21 +127,30 @@ for read in R1 R2 I1 I2; do cat L*/${read}.fastq.gz > ./${read}.fastq.gz; done

FASTQ files _must_ be BGZF compressed.

All reads containing sample barcodes/indexes _must_ be of the same length.

###### Auto-detecting FASTQS from a Path Prefix

Alternatively, the FASTQS can be auto-detected when a path prefix is given to `--fastqs <dir>/<prefix>`.
The FASTQs must be named `<dir>/<prefix>_L00<lane>_<kind><kind-number>_001.fastq.gz`, where `kind` is
one of R (read/template), I (index/sample barcode), or U (umi/molecular barcode).

If the read structure is given on the command line or Sample sheet, the segments are assumed to to
apply to I1 R1 R2 I2 reads in that order. Otherwise, the read structure will be `B+ T+ T+ B+`
(all index bases are used for the sample barcode, all read bases are assumed to be template).
The Read Structure must not be given on the the command line or Sample Sheet. Instead, the Read
Structure will be derived file names (kind and kind number), with the full read length used for the given kind.
E.g. if the following FASTQs are present with path prefix `/path/to/prefix`:

```
/path/to/prefix_L001_I1_001.fasztq.gz
/path/to/prefix_L001_R1_001.fasztq.gz
/path/to/prefix_L001_R2_001.fasztq.gz
/path/to/prefix_L001_I2_001.fasztq.gz
```

then the `+B +T +T +B` read structure will be used. Since this tool requires all sample barcode
segments to have a fixed length, the first read in any index/sample-barcode FASTQ will be examined
and its length used as the expected sample barcode length.

##### Read Structures

Read Structures are short strings that describe the origin and/or purpose of bases within sequencing reads. They are made up of a sequence of `<number><operator>` pairs. Four kinds of operators are recognized:
Read Structures are short strings that describe the origin and/or purpose of bases within sequencing reads. They are made up of a sequence of `<number><operator>` pairs (segments). Four kinds of operators are recognized:

1. **T** identifies template reads/bases
2. **B** identifies sample barcode reads/bases
Expand All @@ -159,6 +168,8 @@ One Read Structure must be provided for each input FASTQ file, in the same order
--read-structures +T +T 8B 8B
```

All sample barocde segments must be a fixed length. E.g. `8B+T` is allowed but `10S+B` is not.

##### Sample Sheet

Information about the sample(s) to demultiplex is specified within a Sample Sheet.
Expand Down
58 changes: 40 additions & 18 deletions src/lib/run.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use std::{fs::File, io::BufWriter, sync::Arc, vec::Vec};
use std::{fs::File, io::BufWriter, path::PathBuf, sync::Arc, vec::Vec};

use anyhow::{bail, ensure, Context, Result};
use gzp::BUFSIZE;
Expand Down Expand Up @@ -35,20 +35,39 @@ pub fn run(opts: Opts) -> Result<(), anyhow::Error> {
// and so all accepted records will go into file(s) for one sample with no Undetermined
let is_no_demux = samples.len() == 1 && samples[0].barcode.len() == 0;

// If a path prefix is given as the single argument to --fastqs then auto-detect all the FASTQs
// with that path prefix. Also, if the read-structures is not given, build it from the auto-
// detected
let opts = if opts.fastqs.len() == 1 && !opts.fastqs[0].is_file() {
let input_fastqs: Vec<InputFastq> = InputFastq::slurp(opts.fastqs[0].clone());
let fastqs = input_fastqs.iter().map(|f| f.path.clone()).collect();
let read_structures: Vec<ReadStructure> = if opts.read_structures.is_empty() {
input_fastqs.iter().map(InputFastq::read_structure).collect::<Vec<ReadStructure>>()
} else {
opts.read_structures
};
// If the input FASTQs are in fact path prefixes, then slurp in the FASTQs and create the
// read structure based on the kind/kind-number inferred from the file name.
let opts = if opts.fastqs.iter().all(|f| f.is_file()) {
// do nothing
opts
} else if opts.fastqs.iter().all(|f| !f.is_file()) {
ensure!(
opts.read_structures.is_empty(),
"Read Structure must not be given when the input FASTQs are a path prefix."
);

// slurp in all the FASTQs
// Important: sort by kind then kind number so the output kind number is ordered correctly
let input_fastqs: Vec<InputFastq> = opts
.fastqs
.iter()
.flat_map(|prefix| InputFastq::slurp(prefix.clone()))
.sorted_by(|left, right| {
left.kind.cmp(&right.kind).then(left.kind_number.cmp(&right.kind_number))
})
.collect();

nh13 marked this conversation as resolved.
Show resolved Hide resolved
// build read structures, one per input FASTQ
let read_structures: Vec<ReadStructure> =
input_fastqs.iter().map(InputFastq::read_structure).collect();

// extract the list of input FASTQs
let fastqs: Vec<PathBuf> =
input_fastqs.iter().map(|input_fastq| input_fastq.path.clone()).collect();

Opts { fastqs, read_structures, ..opts }
} else {
opts
bail!("Input FASTQS (--fastqs) must either all be files or all path prefixes, not a mixture of both")
nh13 marked this conversation as resolved.
Show resolved Hide resolved
};

// Important: this must be created **after** updating the number of read structures
Expand Down Expand Up @@ -91,6 +110,12 @@ pub fn run(opts: Opts) -> Result<(), anyhow::Error> {

// All sample barcode read segments should now have a fixed length, so check the sum of their
// lengths with the sum of length of the sample barcode(s) in the sample sheet.
ensure!(
opts.read_structures
.iter()
.all(|s| s.sample_barcodes().all(read_structure::ReadSegment::has_length)),
"The Read Structure must have sample barcode segments with fixed lengths"
nh13 marked this conversation as resolved.
Show resolved Hide resolved
);
ensure!(
is_no_demux
|| opts
Expand Down Expand Up @@ -562,7 +587,6 @@ mod test {
fn test_end_to_end_simple(
#[values(1, 2)] threads: usize,
#[values("T", "B", "TB")] output_types: String,
#[values(true, false)] use_path_prefix: bool,
) {
let dir = tempfile::tempdir().unwrap();
let read_structure = ReadStructure::from_str("17B100T").unwrap();
Expand All @@ -571,10 +595,9 @@ mod test {
create_dir(&output).unwrap();

let metadata = create_preset_sample_metadata_file(&dir.path());
let fastqs: PathBuf = if use_path_prefix { dir.path().to_path_buf() } else { input };

let opts = Opts {
fastqs: vec![fastqs],
fastqs: vec![input],
output_dir: output.clone(),
sample_metadata: metadata,
read_structures: vec![read_structure],
Expand Down Expand Up @@ -886,7 +909,6 @@ mod test {
fn test_demux_dual_index_paired_end_reads(
#[values(1, 2)] threads: usize,
#[values(true, false)] use_path_prefix: bool,
#[values(true, false)] empty_read_structures: bool,
) {
let dir = tempfile::tempdir().unwrap();
let fq1_path = dir.path().join(fastq_file_name("test", 1, SegmentType::SampleBarcode, 1));
Expand Down Expand Up @@ -918,7 +940,7 @@ mod test {
&fq4_path,
);

let read_structures = if use_path_prefix && empty_read_structures {
let read_structures = if use_path_prefix {
vec![]
} else {
vec![
Expand Down
Loading