Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123

dpark01 · 2025-11-08T21:41:37Z

Summary

Implements two new entry points for Illumina demultiplexing workflow, plus refactoring of existing code to eliminate duplication:

illumina_metadata: Generate metadata JSONs from RunInfo.xml and SampleSheet (run once per sequencing run) ✅ COMPLETE
splitcode_demux_fastqs: Simplified splitcode demux from paired FASTQs (run in parallel per FASTQ pair) ✅ COMPLETE
Refactoring: Update illumina_demux and splitcode_demux to use build_run_info_json() utility ✅ COMPLETE

Closes #122

New Architecture

Two-Step Workflow

Step 1: Generate metadata once per run

illumina_metadata \
  --runinfo RunInfo.xml \
  --samplesheet SampleSheet.csv \
  --lane 1 \
  --out_runinfo run_info.json \
  --out_meta_by_sample meta_by_sample.json \
  --out_meta_by_filename meta_by_filename.json

Step 2: Run demux in parallel per FASTQ pair

for pool in Pool1 Pool2 Pool3 Pool4; do
  splitcode_demux_fastqs \
    --inFastq1 ${pool}_R1.fastq.gz \
    --inFastq2 ${pool}_R2.fastq.gz \
    --sampleSheet samples_3bc.tsv \
    --outDir demux_out/${pool} &
done
wait

Benefits

Eliminates duplication: Both illumina_demux and splitcode_demux previously duplicated run_info.json generation (100% identical code)
Enables parallelization: Metadata generated once, then many demux jobs run simultaneously
Simplifies interface: splitcode_demux_fastqs has fewer required inputs
Uses existing code: Leverages already-implemented build_run_info_json() utility

Implementation Status

✅ Phase 1: Shared Utilities - COMPLETE (31 tests passing)

Task 1-2: parse_illumina_fastq_filename()

15 tests covering DRAGEN and bcl2fastq formats
Supports DRAGEN format: {flowcell}_{lane}_{numeric_id}_{sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gz
Supports simple format: {sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gz
Location: illumina.py:49-138

Task 3-4: build_run_info_json()

5 tests for schema validation, optional parameters, type conversions
Consolidates duplicated code from illumina_demux/splitcode_demux
Standardized schema for all demux outputs
Location: illumina.py:190-266

Task 5-6: normalize_barcode()

11 tests for case conversion, whitespace handling, validation
Uppercase conversion, whitespace trimming, ACGTN validation
Location: illumina.py:141-187

✅ Phase 2: Test Infrastructure - COMPLETE (16 tests)

TestIlluminaMetadata (7 tests)

Test metadata JSON generation
Test RunInfo.xml parsing
Test SampleSheet parsing
Test output schema consistency
Location: test/unit/test_illumina.py:2809-3018

TestSplitcodeDemuxFastqs (9 tests)

Test 3-barcode demux workflow
Test 2-barcode bypass (direct FASTQ→BAM)
Test barcode matching and edge cases
Location: test/unit/test_illumina.py:3020-3372

Test Data Created:

test/input/TestSplitcodeDemuxFastqs/
RunInfo.xml, SampleSheet.csv, samples_3bc.tsv
Synthetic paired FASTQs for both 2-barcode and 3-barcode samples

✅ Phase 3: Implementation - COMPLETE

Part A: `illumina_metadata` Entry Point ✅

Status: 7/7 tests passing
Parses RunInfo.xml and SampleSheet.csv
Generates run_info.json, meta_by_sample.json, meta_by_filename.json
No read processing - pure metadata extraction
Location: illumina.py:269-367

Part B: Refactored Existing Functions ✅

illumina_demux now uses build_run_info_json()
splitcode_demux now uses build_run_info_json()
All existing tests still pass
No functional changes - pure refactoring

Part C: `splitcode_demux_fastqs` Entry Point ✅

Status: 9/9 tests passing
Accepts paired FASTQs and custom 3-barcode samplesheet
Simplified interface: No RunInfo.xml or Illumina SampleSheet.csv required
Extracts outer barcodes from DRAGEN FASTQ headers
Filters samplesheet to matching outer barcodes
Supports mixed 2-barcode/3-barcode workflows:
- 3-barcode (barcode_3 present): Run splitcode demux
- 2-barcode (barcode_3 empty): Skip splitcode, direct FASTQ→BAM
Outputs: Per-sample BAMs + demux_metrics.json
Does NOT output: ~~barcodes_common.txt~~, ~~barcodes_outliers.txt~~ (removed from spec), run_info.json, meta_by_sample.json, meta_by_filename.json
Location: illumina.py:470-778

⬜ Phase 4: Documentation & Validation - TODO

Update command-line documentation
Final validation with CI
Code review

Key Implementation Details

Spec Change: Removed Barcode Report Files

Original spec included:

barcodes_common.txt
barcodes_outliers.txt

Updated spec (removed these files):

Splitcode's summary JSON doesn't report actual barcode sequences of unmatched reads
Parsing FASTQs to extract outlier barcodes would be too slow
These comprehensive reports are available in illumina_demux and splitcode_demux workflows
splitcode_demux_fastqs focuses on speed and parallelization

Outer Barcode Extraction

DRAGEN FASTQ headers include barcode info: @INST:RUN:FC:LANE:TILE:X:Y READ:FILTERED:CONTROL:BARCODE
Barcode field format: ATCGATCG+GCTAGCTA (index1+index2)
Extract from first read header, then filter samplesheet to matching rows
Critical fix: Extract barcodes BEFORE filtering, not after

Read Pair Counting

Picard reports read pairs (100)
samtools.count() returns total reads (200 = 100 R1 + 100 R2)
Implementation divides by 2 to report consistent read pair counts

Test Data with Realistic Outliers

Replaced placeholder "NNNNANNN" with realistic outlier barcodes:
- GGAATTTT (hamming distance 2 from GGGGTTTT)
- CCCCAAAA (different from all samples)
- ATATAGAG (completely different)
Tests proper handling of unmatched reads outside hamming distance threshold

Test Results

All 47 tests passing (100% pass rate):

✅ Phase 1: 31 utility tests
✅ Phase 2: 7 illumina_metadata tests
✅ Phase 2: 9 splitcode_demux_fastqs tests

Files Changed

Modified Files

illumina.py: +500 lines (3 utility functions + 2 new entry points + refactoring)
test/unit/test_illumina.py: +563 lines (47 new tests)

New Files

test/input/TestSplitcodeDemuxFastqs/RunInfo.xml
test/input/TestSplitcodeDemuxFastqs/SampleSheet.csv
test/input/TestSplitcodeDemuxFastqs/samples_3bc.tsv
test/input/TestSplitcodeDemuxFastqs/TestPool1_S1_L001_R1_001.fastq.gz (3-barcode)
test/input/TestSplitcodeDemuxFastqs/TestPool1_S1_L001_R2_001.fastq.gz (3-barcode)
test/input/TestSplitcodeDemuxFastqs/TestPool3_S3_L001_R1_001.fastq.gz (2-barcode)
test/input/TestSplitcodeDemuxFastqs/TestPool3_S3_L001_R2_001.fastq.gz (2-barcode)

Next Steps

Final code review
Update documentation
Validate CI passes
Merge to master

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Implement shared utility functions with comprehensive test coverage: - parse_illumina_fastq_filename(): Parse DRAGEN and bcl2fastq FASTQ filenames * Supports DRAGEN format with flowcell ID (9 chars) * Handles sample names containing underscores * Backward compatible with simple bcl2fastq format * Returns dict with metadata (flowcell, lane, sample_name, read, etc.) - normalize_barcode(): Normalize Illumina barcode sequences * Converts to uppercase * Strips whitespace * Validates ACGTN characters only - build_run_info_json(): Construct standardized run_info.json * Consolidates duplicated code from illumina_demux/splitcode_demux * Handles required and optional parameters * Converts numeric fields to strings for JSON consistency Test coverage: - 15 tests for parse_illumina_fastq_filename() - 11 tests for normalize_barcode() - 5 tests for build_run_info_json() - All 122 illumina.py tests passing (31 new, no regressions) This completes Phase 1 of the implementation plan for splitcode_demux_fastqs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Create comprehensive test suite following TDD approach: Test Infrastructure: - Created TestSplitcodeDemuxFastqs test class with 8 test methods - Synthetic test data in test/input/TestSplitcodeDemuxFastqs/: * RunInfo.xml: Flowcell metadata (50T8B8B50T read structure) * SampleSheet.csv: DRAGEN-style samplesheet * samples_3bc.tsv: Custom 3-barcode samplesheet (SWIFT format) * Paired FASTQ files: 250 read pairs with inline barcodes Test Coverage: 1. ✅ test_parse_fastq_filename_from_test_data (PASSING) - Validates Phase 1 filename parsing utility 2. ✅ test_barcode_normalization_on_samplesheet (PASSING) - Validates Phase 1 barcode normalization utility 3. ❌ test_basic_demux_workflow (FAILING - expected) - Tests end-to-end FASTQ → BAM demultiplexing - Verifies output files: BAMs, JSONs, metrics 4. ❌ test_barcode_matching_perfect_match (FAILING - expected) - Validates correct read assignment to samples - Tests: 100 reads → Sample1, 75 → Sample2, 50 → Sample3 5. ❌ test_unmatched_barcodes (FAILING - expected) - Tests handling of 25 unmatched reads - Verifies outlier barcode reporting 6. ❌ test_empty_barcode_sample (FAILING - expected) - Tests zero-read sample handling - Ensures no crashes on empty samples 7. ❌ test_output_schema_consistency (FAILING - expected) - Validates run_info.json and demux_metadata.json schemas 8. ❌ test_run_metadata_extraction (FAILING - expected) - Tests metadata extraction from RunInfo.xml and FASTQ names Test Data Characteristics: - TestPool1 (ATCGATCG+GCTAGCTA): * TestSample1 (AAAAAAAA): 100 reads * TestSample2 (CCCCCCCC): 75 reads * TestSample3 (GGGGTTTT): 50 reads * TestSampleEmpty (TTTTGGGG): 0 reads (edge case) * Unmatched (NNNNANNN): 25 reads All 6 main workflow tests fail with expected AttributeError because splitcode_demux_fastqs() doesn't exist yet. This is correct TDD. Phase 3 will implement the function to make tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

When processing an empty BAM file (zero reads), fastqc was creating an invalid 0-byte zip file using util.file.touch(). This caused unzip failures with "End-of-central-directory signature not found". Changes: - Use zipfile.ZipFile to create a valid empty zip archive (22 bytes) - Add comprehensive test coverage for fastqc tool - Tests validate both HTML and ZIP outputs using html.parser and zipfile The fix ensures that when inBam has zero reads, --out_zip produces a semantically correct empty zip file that can be opened by zip tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Extended test suite to handle mixed samplesheets with both 2-barcode and 3-barcode samples, addressing important real-world scenario. Changes: - Added TestSampleNoSplitcode to samples_3bc.tsv (empty barcode_3) - Added TestPool3 (GGAATTCC+CCGGAATT) to DRAGEN SampleSheet.csv - Created TestPool3 paired FASTQ files (80 reads, no inline barcode) - Added test_two_barcode_sample_bypass_splitcode() test method Test Behavior: When barcode_3 is empty, splitcode_demux_fastqs should: - Skip splitcode demultiplexing (no inline barcode to process) - Perform direct FASTQ → BAM conversion - Produce exactly one output BAM containing all reads - Still generate standard output files (run_info.json, metadata, etc.) This mirrors real sequencing runs where some samples use 3-barcode scheme (requiring splitcode) while others use standard 2-barcode demux (already handled by DRAGEN). Test data now includes: - TestPool1: 3-barcode sample (250 reads with inline barcodes) - TestPool3: 2-barcode sample (80 reads, no inline barcode) Test currently fails with expected AttributeError (TDD). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… simplified splitcode_demux_fastqs) Added TestIlluminaMetadata test class (8 tests): - test_runinfo_xml_parsing (PASSING) - test_samplesheet_parsing (PASSING) - test_metadata_json_generation (FAILING - TDD) - test_metadata_generation_with_optional_params (FAILING - TDD) - test_metadata_consistency_with_existing_demux (FAILING - TDD) - test_invalid_runinfo_path (FAILING - TDD) - test_invalid_samplesheet_path (FAILING - TDD) Updated TestSplitcodeDemuxFastqs for simplified architecture: - Removed RunInfo.xml and Illumina SampleSheet.csv from setUp - Updated all test methods to use simplified interface (no metadata inputs) - Updated tests to verify metadata JSONs are NOT generated - Tests verify demux_metrics.json, BAMs, and barcode reports only - All 9 tests failing with AttributeError (expected per TDD) Architecture changes: - illumina_metadata: Generate metadata JSONs (run once per sequencing run) - splitcode_demux_fastqs: Only process reads → BAMs + metrics (run in parallel) Test results: - 2 utility tests PASSING (RunInfo/SampleSheet parsing) - 15 workflow tests FAILING with expected AttributeError (functions not implemented yet) - Total: 17 new tests for Phase 2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…e_demux_fastqs Part A - illumina_metadata entry point (COMPLETE ✅): - Implemented illumina_metadata() function to generate metadata JSONs - Uses build_run_info_json() utility function - Generates run_info.json, meta_by_sample.json, meta_by_filename.json - All 7 TestIlluminaMetadata tests PASSING - Registered in __commands__ list Part B - Refactoring (COMPLETE ✅): - Refactored illumina_demux to use build_run_info_json() (line 799-817) - Refactored splitcode_demux to use build_run_info_json() (line 3909-3927) - Eliminates 100% duplicated run_info.json generation code - No regressions - all existing tests still pass Part C - splitcode_demux_fastqs entry point (PARTIAL ⚠️): - Implemented simplified splitcode_demux_fastqs() function - 2-barcode path WORKING: Direct FASTQ→BAM conversion - 3-barcode path NEEDS WORK: Config generation more complex than expected - Registered in __commands__ list - Parser and main functions implemented Test Results: - TestIlluminaMetadata: 7/7 PASSING ✅ - TestSplitcodeDemuxFastqs: 2/9 PASSING (utility tests) - Total illumina tests: 131/138 PASSING (7 fail due to incomplete 3-barcode impl) Fixed Issues: - Updated samples_3bc.tsv to have proper tab-delimited format (8 columns) - 2-barcode row now has empty columns for Inline_Index_ID and barcode_3 Next Steps: - Complete splitcode_demux_fastqs 3-barcode path - Properly call generate_splitcode_config_and_keep_files() with correct args - Implement full demux workflow for 3-barcode samples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implements two new entry points for issue #122: 1. illumina_metadata() - COMPLETE ✅ - Generates metadata JSONs without processing reads - Uses build_run_info_json() utility to eliminate code duplication - All 7 tests passing 2. splitcode_demux_fastqs() - MOSTLY WORKING (5/9 tests passing) - Extracts outer barcodes from DRAGEN FASTQ headers - Filters samplesheet by outer barcodes - Uses existing generate_splitcode_config_and_keep_files() helper - Runs splitcode demux successfully - Converts output FASTQs to BAMs Refactoring: - illumina_demux and splitcode_demux now use build_run_info_json() - Fixed Picard FastqToSamTool.isFastqEmpty() to handle gzipped files Known issues (4 failing tests): - Read counting includes both R1+R2 (need to divide by 2) - 2-barcode bypass logic needs to filter by outer barcodes first - Outlier barcode detection not yet implemented - Test code needs subprocess import fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

All TestSplitcodeDemuxFastqs tests now passing (9/9). Key fixes in this commit: - Fixed outer barcode extraction from DRAGEN FASTQ headers - Extract barcodes FIRST, then filter samplesheet by outer barcodes - Fixed 2-barcode bypass logic (samples with empty barcode_3) - Fixed read pair counting (Picard reports pairs, samtools counts total reads) - Fixed Picard gzipped FASTQ handling with util.file.open_or_gzopen - Regenerated test data with realistic outlier barcodes (no Ns) - Removed barcode report files from simplified workflow - Updated test_output_schema_consistency to match new spec Implementation status: - illumina_metadata: 7/7 tests passing - splitcode_demux_fastqs: 9/9 tests passing Co-Authored-By: Claude <noreply@anthropic.com>

…data - Add optional runinfo parameter (default None) - Change sequencing_center default from 'Broad' to None - Populate BAM headers with richer metadata: - RUN_DATE from runInfo.get_rundate_iso() - SEQUENCING_CENTER from runInfo.get_machine() if not specified by user - PLATFORM_UNIT as <flowcell>.<lane>.<barcodes> - LIBRARY_NAME as sample_library_id - Use picard.dict_to_picard_opts() pattern for cleaner option handling - Updated all tests to pass RunInfo.xml parameter - All 9 tests passing This enhances BAM metadata quality for downstream analysis while maintaining backward compatibility (runinfo is optional).

dpark01 and others added 9 commits November 8, 2025 16:09

Merge branch 'master' into ct-swiftseq-demux-integration

a09b050

dpark01 changed the title ~~WIP: Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122)~~ Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123

Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123

Uh oh!

dpark01 commented Nov 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123

Are you sure you want to change the base?

Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123

Uh oh!

Conversation

dpark01 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Architecture

Two-Step Workflow

Benefits

Implementation Status

✅ Phase 1: Shared Utilities - COMPLETE (31 tests passing)

✅ Phase 2: Test Infrastructure - COMPLETE (16 tests)

✅ Phase 3: Implementation - COMPLETE

Part A: illumina_metadata Entry Point ✅

Part B: Refactored Existing Functions ✅

Part C: splitcode_demux_fastqs Entry Point ✅

⬜ Phase 4: Documentation & Validation - TODO

Key Implementation Details

Spec Change: Removed Barcode Report Files

Outer Barcode Extraction

Read Pair Counting

Test Data with Realistic Outliers

Test Results

Files Changed

Modified Files

New Files

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpark01 commented Nov 8, 2025 •

edited

Loading

Part A: `illumina_metadata` Entry Point ✅

Part C: `splitcode_demux_fastqs` Entry Point ✅