-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Add two new entry points in illumina.py:
illumina_metadata: Generate metadata JSONs from RunInfo.xml and SampleSheet (run once per sequencing run)splitcode_demux_fastqs: Perform splitcode-driven demultiplexing from paired DRAGEN FASTQs using custom third inline barcode (run in parallel per FASTQ pair)
This separation enables efficient parallel processing by generating shared metadata once, then running multiple demux jobs simultaneously.
New Entry Point #1: illumina_metadata
Purpose
Generate metadata JSON files from Illumina run metadata files without processing reads. Run once per sequencing run to create metadata that's shared across all parallel demux jobs.
Inputs
- RunInfo.xml (required): Illumina run metadata
- SampleSheet.csv (required): Illumina/DRAGEN samplesheet
- Lane number (required): Lane to process
- Sequencing center (optional): Default "Broad"
Outputs
- run_info.json: Run metadata (flowcell, dates, read structure, instrument info)
- meta_by_sample.json: Sample metadata indexed by sample name
- meta_by_filename.json: Sample metadata indexed by filename/library ID
Implementation Notes
- Reuses existing
build_run_info_json()utility function - Extracts duplicated metadata generation logic from
illumina_demuxandsplitcode_demux - No read processing - pure metadata extraction
Status: ✅ COMPLETE - All 7 tests passing
New Entry Point #2: splitcode_demux_fastqs
Purpose
Perform splitcode-based demultiplexing directly from a single paired DRAGEN FASTQ file set, using a custom third inline barcode scheme. Designed to run in parallel across multiple FASTQ pairs.
Inputs
- Paired FASTQ files (R1/R2): Exactly one pair from DRAGEN output
- Custom 3-barcode samplesheet: TSV format defining third inline barcode sequences
- Maps composite (index1 + index2 + inline) barcode → sample name
- May include rows with empty
barcode_3(2-barcode samples that bypass splitcode)
- Output directory: Where to write BAM files and metrics
Note: RunInfo.xml and Illumina SampleSheet.csv are NOT required - metadata JSONs are generated separately via illumina_metadata.
Processing for 3-barcode samples (barcode_3 present):
- Parse FASTQ filenames to extract pool/sample metadata
- Extract outer barcodes (index1+index2) from DRAGEN FASTQ headers
- Filter samplesheet to matching outer barcodes
- Generate splitcode configuration from inline barcode definitions
- Run splitcode demultiplexing
- Convert splitcode output to per-sample unaligned BAMs
- Generate demux metrics
Processing for 2-barcode samples (barcode_3 empty):
- Skip splitcode demultiplexing entirely
- Perform direct FASTQ → BAM conversion
- Output exactly one BAM file (the pool itself)
- Generate metrics
Outputs
- Per-sample unaligned BAMs: One BAM per resolved sample
- demux_metrics.json: Read counts per sample, unmatched reads, etc.
Note: Does NOT output:
barcodes_common.txt(removed from spec - useillumina_demuxfor comprehensive barcode reporting)barcodes_outliers.txt(removed from spec - useillumina_demuxfor comprehensive barcode reporting)- run_info.json, meta_by_sample.json, meta_by_filename.json (use
illumina_metadatainstead)
Status: ✅ COMPLETE - All 9 tests passing
Typical Workflow
# Step 1: Generate metadata once per run
illumina_metadata \
--runinfo RunInfo.xml \
--samplesheet SampleSheet.csv \
--lane 1 \
--out_runinfo run_info.json \
--out_meta_by_sample meta_by_sample.json \
--out_meta_by_filename meta_by_filename.json
# Step 2: Run demux in parallel for each pool
for pool in Pool1 Pool2 Pool3 Pool4; do
splitcode_demux_fastqs \
--inFastq1 ${pool}_R1.fastq.gz \
--inFastq2 ${pool}_R2.fastq.gz \
--sampleSheet samples_3bc.tsv \
--outDir demux_out/${pool} &
done
waitRefactoring Benefits
- Eliminates duplication: Both
illumina_demuxandsplitcode_demuxcurrently duplicate run_info.json generation code (100% identical) - Uses existing code: Leverages already-implemented
build_run_info_json()utility - Enables parallelization: Metadata generated once, then many demux jobs run simultaneously
- Simplifies interface:
splitcode_demux_fastqshas fewer required inputs - Clear separation of concerns: Metadata extraction vs read processing
Implementation Status
✅ Phase 1: Shared Utilities - COMPLETE
- parse_illumina_fastq_filename() - 15 tests passing
- build_run_info_json() - 5 tests passing
- normalize_barcode() - 11 tests passing
✅ Phase 2: Test Infrastructure - COMPLETE
- TestIlluminaMetadata test class created - 7 tests
- TestSplitcodeDemuxFastqs test class created - 9 tests
- Test data files created (RunInfo.xml, SampleSheet.csv, FASTQs)
✅ Phase 3: Implementation - COMPLETE
- illumina_metadata() implemented - 7/7 tests passing
- splitcode_demux_fastqs() implemented - 9/9 tests passing
- Refactored illumina_demux to use build_run_info_json()
- Refactored splitcode_demux to use build_run_info_json()
⬜ Phase 4: Documentation & Validation - TODO
- Update command-line documentation
- Final validation with CI
- Code review
Test Data
For illumina_metadata:
- Synthetic RunInfo.xml with flowcell TESTFC01
- Synthetic SampleSheet.csv in DRAGEN format (3 pools)
- Validates output JSON schemas match existing demux outputs
For splitcode_demux_fastqs:
-
TestPool1 (3-barcode sample):
- 100 reads: AAAAAAAA (TestSample1)
- 75 reads: CCCCCCCC (TestSample2)
- 50 reads: GGGGTTTT (TestSample3)
- 0 reads: TTTTGGGG (TestSampleEmpty)
- 25 reads: Outlier barcodes (GGAATTTT, CCCCAAAA, ATATAGAG)
-
TestPool3 (2-barcode sample):
- 80 reads: No inline barcode (TestSampleNoSplitcode)
- Tests bypass of splitcode for 2-barcode samples