Skip to content

Conversation

@dpark01
Copy link
Member

@dpark01 dpark01 commented Nov 8, 2025

Summary

Implements two new entry points for Illumina demultiplexing workflow, plus refactoring of existing code to eliminate duplication:

  1. illumina_metadata: Generate metadata JSONs from RunInfo.xml and SampleSheet (run once per sequencing run) ✅ COMPLETE
  2. splitcode_demux_fastqs: Simplified splitcode demux from paired FASTQs (run in parallel per FASTQ pair) ✅ COMPLETE
  3. Refactoring: Update illumina_demux and splitcode_demux to use build_run_info_json() utility ✅ COMPLETE

Closes #122


New Architecture

Two-Step Workflow

Step 1: Generate metadata once per run

illumina_metadata \
  --runinfo RunInfo.xml \
  --samplesheet SampleSheet.csv \
  --lane 1 \
  --out_runinfo run_info.json \
  --out_meta_by_sample meta_by_sample.json \
  --out_meta_by_filename meta_by_filename.json

Step 2: Run demux in parallel per FASTQ pair

for pool in Pool1 Pool2 Pool3 Pool4; do
  splitcode_demux_fastqs \
    --inFastq1 ${pool}_R1.fastq.gz \
    --inFastq2 ${pool}_R2.fastq.gz \
    --sampleSheet samples_3bc.tsv \
    --outDir demux_out/${pool} &
done
wait

Benefits

  • Eliminates duplication: Both illumina_demux and splitcode_demux previously duplicated run_info.json generation (100% identical code)
  • Enables parallelization: Metadata generated once, then many demux jobs run simultaneously
  • Simplifies interface: splitcode_demux_fastqs has fewer required inputs
  • Uses existing code: Leverages already-implemented build_run_info_json() utility

Implementation Status

✅ Phase 1: Shared Utilities - COMPLETE (31 tests passing)

Task 1-2: parse_illumina_fastq_filename()

  • 15 tests covering DRAGEN and bcl2fastq formats
  • Supports DRAGEN format: {flowcell}_{lane}_{numeric_id}_{sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gz
  • Supports simple format: {sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gz
  • Location: illumina.py:49-138

Task 3-4: build_run_info_json()

  • 5 tests for schema validation, optional parameters, type conversions
  • Consolidates duplicated code from illumina_demux/splitcode_demux
  • Standardized schema for all demux outputs
  • Location: illumina.py:190-266

Task 5-6: normalize_barcode()

  • 11 tests for case conversion, whitespace handling, validation
  • Uppercase conversion, whitespace trimming, ACGTN validation
  • Location: illumina.py:141-187

✅ Phase 2: Test Infrastructure - COMPLETE (16 tests)

TestIlluminaMetadata (7 tests)

TestSplitcodeDemuxFastqs (9 tests)

Test Data Created:


✅ Phase 3: Implementation - COMPLETE

Part A: illumina_metadata Entry Point ✅

  • Status: 7/7 tests passing
  • Parses RunInfo.xml and SampleSheet.csv
  • Generates run_info.json, meta_by_sample.json, meta_by_filename.json
  • No read processing - pure metadata extraction
  • Location: illumina.py:269-367

Part B: Refactored Existing Functions ✅

  • illumina_demux now uses build_run_info_json()
  • splitcode_demux now uses build_run_info_json()
  • All existing tests still pass
  • No functional changes - pure refactoring

Part C: splitcode_demux_fastqs Entry Point ✅

  • Status: 9/9 tests passing
  • Accepts paired FASTQs and custom 3-barcode samplesheet
  • Simplified interface: No RunInfo.xml or Illumina SampleSheet.csv required
  • Extracts outer barcodes from DRAGEN FASTQ headers
  • Filters samplesheet to matching outer barcodes
  • Supports mixed 2-barcode/3-barcode workflows:
    • 3-barcode (barcode_3 present): Run splitcode demux
    • 2-barcode (barcode_3 empty): Skip splitcode, direct FASTQ→BAM
  • Outputs: Per-sample BAMs + demux_metrics.json
  • Does NOT output: barcodes_common.txt, barcodes_outliers.txt (removed from spec), run_info.json, meta_by_sample.json, meta_by_filename.json
  • Location: illumina.py:470-778

⬜ Phase 4: Documentation & Validation - TODO

  • Update command-line documentation
  • Final validation with CI
  • Code review

Key Implementation Details

Spec Change: Removed Barcode Report Files

Original spec included:

  • barcodes_common.txt
  • barcodes_outliers.txt

Updated spec (removed these files):

  • Splitcode's summary JSON doesn't report actual barcode sequences of unmatched reads
  • Parsing FASTQs to extract outlier barcodes would be too slow
  • These comprehensive reports are available in illumina_demux and splitcode_demux workflows
  • splitcode_demux_fastqs focuses on speed and parallelization

Outer Barcode Extraction

  • DRAGEN FASTQ headers include barcode info: @INST:RUN:FC:LANE:TILE:X:Y READ:FILTERED:CONTROL:BARCODE
  • Barcode field format: ATCGATCG+GCTAGCTA (index1+index2)
  • Extract from first read header, then filter samplesheet to matching rows
  • Critical fix: Extract barcodes BEFORE filtering, not after

Read Pair Counting

  • Picard reports read pairs (100)
  • samtools.count() returns total reads (200 = 100 R1 + 100 R2)
  • Implementation divides by 2 to report consistent read pair counts

Test Data with Realistic Outliers

  • Replaced placeholder "NNNNANNN" with realistic outlier barcodes:
    • GGAATTTT (hamming distance 2 from GGGGTTTT)
    • CCCCAAAA (different from all samples)
    • ATATAGAG (completely different)
  • Tests proper handling of unmatched reads outside hamming distance threshold

Test Results

All 47 tests passing (100% pass rate):

  • ✅ Phase 1: 31 utility tests
  • ✅ Phase 2: 7 illumina_metadata tests
  • ✅ Phase 2: 9 splitcode_demux_fastqs tests

Files Changed

Modified Files

  • illumina.py: +500 lines (3 utility functions + 2 new entry points + refactoring)
  • test/unit/test_illumina.py: +563 lines (47 new tests)

New Files

  • test/input/TestSplitcodeDemuxFastqs/RunInfo.xml
  • test/input/TestSplitcodeDemuxFastqs/SampleSheet.csv
  • test/input/TestSplitcodeDemuxFastqs/samples_3bc.tsv
  • test/input/TestSplitcodeDemuxFastqs/TestPool1_S1_L001_R1_001.fastq.gz (3-barcode)
  • test/input/TestSplitcodeDemuxFastqs/TestPool1_S1_L001_R2_001.fastq.gz (3-barcode)
  • test/input/TestSplitcodeDemuxFastqs/TestPool3_S3_L001_R1_001.fastq.gz (2-barcode)
  • test/input/TestSplitcodeDemuxFastqs/TestPool3_S3_L001_R2_001.fastq.gz (2-barcode)

Next Steps

  • Final code review
  • Update documentation
  • Validate CI passes
  • Merge to master

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

dpark01 and others added 9 commits November 8, 2025 16:09
Implement shared utility functions with comprehensive test coverage:

- parse_illumina_fastq_filename(): Parse DRAGEN and bcl2fastq FASTQ filenames
  * Supports DRAGEN format with flowcell ID (9 chars)
  * Handles sample names containing underscores
  * Backward compatible with simple bcl2fastq format
  * Returns dict with metadata (flowcell, lane, sample_name, read, etc.)

- normalize_barcode(): Normalize Illumina barcode sequences
  * Converts to uppercase
  * Strips whitespace
  * Validates ACGTN characters only

- build_run_info_json(): Construct standardized run_info.json
  * Consolidates duplicated code from illumina_demux/splitcode_demux
  * Handles required and optional parameters
  * Converts numeric fields to strings for JSON consistency

Test coverage:
- 15 tests for parse_illumina_fastq_filename()
- 11 tests for normalize_barcode()
- 5 tests for build_run_info_json()
- All 122 illumina.py tests passing (31 new, no regressions)

This completes Phase 1 of the implementation plan for splitcode_demux_fastqs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Create comprehensive test suite following TDD approach:

Test Infrastructure:
- Created TestSplitcodeDemuxFastqs test class with 8 test methods
- Synthetic test data in test/input/TestSplitcodeDemuxFastqs/:
  * RunInfo.xml: Flowcell metadata (50T8B8B50T read structure)
  * SampleSheet.csv: DRAGEN-style samplesheet
  * samples_3bc.tsv: Custom 3-barcode samplesheet (SWIFT format)
  * Paired FASTQ files: 250 read pairs with inline barcodes

Test Coverage:
1. ✅ test_parse_fastq_filename_from_test_data (PASSING)
   - Validates Phase 1 filename parsing utility

2. ✅ test_barcode_normalization_on_samplesheet (PASSING)
   - Validates Phase 1 barcode normalization utility

3. ❌ test_basic_demux_workflow (FAILING - expected)
   - Tests end-to-end FASTQ → BAM demultiplexing
   - Verifies output files: BAMs, JSONs, metrics

4. ❌ test_barcode_matching_perfect_match (FAILING - expected)
   - Validates correct read assignment to samples
   - Tests: 100 reads → Sample1, 75 → Sample2, 50 → Sample3

5. ❌ test_unmatched_barcodes (FAILING - expected)
   - Tests handling of 25 unmatched reads
   - Verifies outlier barcode reporting

6. ❌ test_empty_barcode_sample (FAILING - expected)
   - Tests zero-read sample handling
   - Ensures no crashes on empty samples

7. ❌ test_output_schema_consistency (FAILING - expected)
   - Validates run_info.json and demux_metadata.json schemas

8. ❌ test_run_metadata_extraction (FAILING - expected)
   - Tests metadata extraction from RunInfo.xml and FASTQ names

Test Data Characteristics:
- TestPool1 (ATCGATCG+GCTAGCTA):
  * TestSample1 (AAAAAAAA): 100 reads
  * TestSample2 (CCCCCCCC): 75 reads
  * TestSample3 (GGGGTTTT): 50 reads
  * TestSampleEmpty (TTTTGGGG): 0 reads (edge case)
  * Unmatched (NNNNANNN): 25 reads

All 6 main workflow tests fail with expected AttributeError because
splitcode_demux_fastqs() doesn't exist yet. This is correct TDD.

Phase 3 will implement the function to make tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
When processing an empty BAM file (zero reads), fastqc was creating
an invalid 0-byte zip file using util.file.touch(). This caused
unzip failures with "End-of-central-directory signature not found".

Changes:
- Use zipfile.ZipFile to create a valid empty zip archive (22 bytes)
- Add comprehensive test coverage for fastqc tool
- Tests validate both HTML and ZIP outputs using html.parser and zipfile

The fix ensures that when inBam has zero reads, --out_zip produces a
semantically correct empty zip file that can be opened by zip tools.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Extended test suite to handle mixed samplesheets with both 2-barcode
and 3-barcode samples, addressing important real-world scenario.

Changes:
- Added TestSampleNoSplitcode to samples_3bc.tsv (empty barcode_3)
- Added TestPool3 (GGAATTCC+CCGGAATT) to DRAGEN SampleSheet.csv
- Created TestPool3 paired FASTQ files (80 reads, no inline barcode)
- Added test_two_barcode_sample_bypass_splitcode() test method

Test Behavior:
When barcode_3 is empty, splitcode_demux_fastqs should:
- Skip splitcode demultiplexing (no inline barcode to process)
- Perform direct FASTQ → BAM conversion
- Produce exactly one output BAM containing all reads
- Still generate standard output files (run_info.json, metadata, etc.)

This mirrors real sequencing runs where some samples use 3-barcode
scheme (requiring splitcode) while others use standard 2-barcode
demux (already handled by DRAGEN).

Test data now includes:
- TestPool1: 3-barcode sample (250 reads with inline barcodes)
- TestPool3: 2-barcode sample (80 reads, no inline barcode)

Test currently fails with expected AttributeError (TDD).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
… simplified splitcode_demux_fastqs)

Added TestIlluminaMetadata test class (8 tests):
- test_runinfo_xml_parsing (PASSING)
- test_samplesheet_parsing (PASSING)
- test_metadata_json_generation (FAILING - TDD)
- test_metadata_generation_with_optional_params (FAILING - TDD)
- test_metadata_consistency_with_existing_demux (FAILING - TDD)
- test_invalid_runinfo_path (FAILING - TDD)
- test_invalid_samplesheet_path (FAILING - TDD)

Updated TestSplitcodeDemuxFastqs for simplified architecture:
- Removed RunInfo.xml and Illumina SampleSheet.csv from setUp
- Updated all test methods to use simplified interface (no metadata inputs)
- Updated tests to verify metadata JSONs are NOT generated
- Tests verify demux_metrics.json, BAMs, and barcode reports only
- All 9 tests failing with AttributeError (expected per TDD)

Architecture changes:
- illumina_metadata: Generate metadata JSONs (run once per sequencing run)
- splitcode_demux_fastqs: Only process reads → BAMs + metrics (run in parallel)

Test results:
- 2 utility tests PASSING (RunInfo/SampleSheet parsing)
- 15 workflow tests FAILING with expected AttributeError (functions not implemented yet)
- Total: 17 new tests for Phase 2

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e_demux_fastqs

Part A - illumina_metadata entry point (COMPLETE ✅):
- Implemented illumina_metadata() function to generate metadata JSONs
- Uses build_run_info_json() utility function
- Generates run_info.json, meta_by_sample.json, meta_by_filename.json
- All 7 TestIlluminaMetadata tests PASSING
- Registered in __commands__ list

Part B - Refactoring (COMPLETE ✅):
- Refactored illumina_demux to use build_run_info_json() (line 799-817)
- Refactored splitcode_demux to use build_run_info_json() (line 3909-3927)
- Eliminates 100% duplicated run_info.json generation code
- No regressions - all existing tests still pass

Part C - splitcode_demux_fastqs entry point (PARTIAL ⚠️):
- Implemented simplified splitcode_demux_fastqs() function
- 2-barcode path WORKING: Direct FASTQ→BAM conversion
- 3-barcode path NEEDS WORK: Config generation more complex than expected
- Registered in __commands__ list
- Parser and main functions implemented

Test Results:
- TestIlluminaMetadata: 7/7 PASSING ✅
- TestSplitcodeDemuxFastqs: 2/9 PASSING (utility tests)
- Total illumina tests: 131/138 PASSING (7 fail due to incomplete 3-barcode impl)

Fixed Issues:
- Updated samples_3bc.tsv to have proper tab-delimited format (8 columns)
- 2-barcode row now has empty columns for Inline_Index_ID and barcode_3

Next Steps:
- Complete splitcode_demux_fastqs 3-barcode path
- Properly call generate_splitcode_config_and_keep_files() with correct args
- Implement full demux workflow for 3-barcode samples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements two new entry points for issue #122:

1. illumina_metadata() - COMPLETE ✅
   - Generates metadata JSONs without processing reads
   - Uses build_run_info_json() utility to eliminate code duplication
   - All 7 tests passing

2. splitcode_demux_fastqs() - MOSTLY WORKING (5/9 tests passing)
   - Extracts outer barcodes from DRAGEN FASTQ headers
   - Filters samplesheet by outer barcodes
   - Uses existing generate_splitcode_config_and_keep_files() helper
   - Runs splitcode demux successfully
   - Converts output FASTQs to BAMs

Refactoring:
- illumina_demux and splitcode_demux now use build_run_info_json()
- Fixed Picard FastqToSamTool.isFastqEmpty() to handle gzipped files

Known issues (4 failing tests):
- Read counting includes both R1+R2 (need to divide by 2)
- 2-barcode bypass logic needs to filter by outer barcodes first
- Outlier barcode detection not yet implemented
- Test code needs subprocess import fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
All TestSplitcodeDemuxFastqs tests now passing (9/9).

Key fixes in this commit:
- Fixed outer barcode extraction from DRAGEN FASTQ headers
- Extract barcodes FIRST, then filter samplesheet by outer barcodes
- Fixed 2-barcode bypass logic (samples with empty barcode_3)
- Fixed read pair counting (Picard reports pairs, samtools counts total reads)
- Fixed Picard gzipped FASTQ handling with util.file.open_or_gzopen
- Regenerated test data with realistic outlier barcodes (no Ns)
- Removed barcode report files from simplified workflow
- Updated test_output_schema_consistency to match new spec

Implementation status:
- illumina_metadata: 7/7 tests passing
- splitcode_demux_fastqs: 9/9 tests passing

Co-Authored-By: Claude <noreply@anthropic.com>
@dpark01 dpark01 changed the title WIP: Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) Nov 10, 2025
…data

- Add optional runinfo parameter (default None)
- Change sequencing_center default from 'Broad' to None
- Populate BAM headers with richer metadata:
  - RUN_DATE from runInfo.get_rundate_iso()
  - SEQUENCING_CENTER from runInfo.get_machine() if not specified by user
  - PLATFORM_UNIT as <flowcell>.<lane>.<barcodes>
  - LIBRARY_NAME as sample_library_id
- Use picard.dict_to_picard_opts() pattern for cleaner option handling
- Updated all tests to pass RunInfo.xml parameter
- All 9 tests passing

This enhances BAM metadata quality for downstream analysis while maintaining
backward compatibility (runinfo is optional).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement splitcode_demux_fastqs entry point for splitcode-based demultiplexing of Illumina DRAGEN paired fastq files

2 participants