Skip to content

Implement splitcode_demux_fastqs entry point for splitcode-based demultiplexing of Illumina DRAGEN paired fastq files #122

@dpark01

Description

@dpark01

Summary

Add two new entry points in illumina.py:

  1. illumina_metadata: Generate metadata JSONs from RunInfo.xml and SampleSheet (run once per sequencing run)
  2. splitcode_demux_fastqs: Perform splitcode-driven demultiplexing from paired DRAGEN FASTQs using custom third inline barcode (run in parallel per FASTQ pair)

This separation enables efficient parallel processing by generating shared metadata once, then running multiple demux jobs simultaneously.


New Entry Point #1: illumina_metadata

Purpose

Generate metadata JSON files from Illumina run metadata files without processing reads. Run once per sequencing run to create metadata that's shared across all parallel demux jobs.

Inputs

  1. RunInfo.xml (required): Illumina run metadata
  2. SampleSheet.csv (required): Illumina/DRAGEN samplesheet
  3. Lane number (required): Lane to process
  4. Sequencing center (optional): Default "Broad"

Outputs

  1. run_info.json: Run metadata (flowcell, dates, read structure, instrument info)
  2. meta_by_sample.json: Sample metadata indexed by sample name
  3. meta_by_filename.json: Sample metadata indexed by filename/library ID

Implementation Notes

  • Reuses existing build_run_info_json() utility function
  • Extracts duplicated metadata generation logic from illumina_demux and splitcode_demux
  • No read processing - pure metadata extraction

Status: ✅ COMPLETE - All 7 tests passing


New Entry Point #2: splitcode_demux_fastqs

Purpose

Perform splitcode-based demultiplexing directly from a single paired DRAGEN FASTQ file set, using a custom third inline barcode scheme. Designed to run in parallel across multiple FASTQ pairs.

Inputs

  1. Paired FASTQ files (R1/R2): Exactly one pair from DRAGEN output
  2. Custom 3-barcode samplesheet: TSV format defining third inline barcode sequences
    • Maps composite (index1 + index2 + inline) barcode → sample name
    • May include rows with empty barcode_3 (2-barcode samples that bypass splitcode)
  3. Output directory: Where to write BAM files and metrics

Note: RunInfo.xml and Illumina SampleSheet.csv are NOT required - metadata JSONs are generated separately via illumina_metadata.

Processing for 3-barcode samples (barcode_3 present):

  1. Parse FASTQ filenames to extract pool/sample metadata
  2. Extract outer barcodes (index1+index2) from DRAGEN FASTQ headers
  3. Filter samplesheet to matching outer barcodes
  4. Generate splitcode configuration from inline barcode definitions
  5. Run splitcode demultiplexing
  6. Convert splitcode output to per-sample unaligned BAMs
  7. Generate demux metrics

Processing for 2-barcode samples (barcode_3 empty):

  1. Skip splitcode demultiplexing entirely
  2. Perform direct FASTQ → BAM conversion
  3. Output exactly one BAM file (the pool itself)
  4. Generate metrics

Outputs

  1. Per-sample unaligned BAMs: One BAM per resolved sample
  2. demux_metrics.json: Read counts per sample, unmatched reads, etc.

Note: Does NOT output:

  • barcodes_common.txt (removed from spec - use illumina_demux for comprehensive barcode reporting)
  • barcodes_outliers.txt (removed from spec - use illumina_demux for comprehensive barcode reporting)
  • run_info.json, meta_by_sample.json, meta_by_filename.json (use illumina_metadata instead)

Status: ✅ COMPLETE - All 9 tests passing


Typical Workflow

# Step 1: Generate metadata once per run
illumina_metadata \
  --runinfo RunInfo.xml \
  --samplesheet SampleSheet.csv \
  --lane 1 \
  --out_runinfo run_info.json \
  --out_meta_by_sample meta_by_sample.json \
  --out_meta_by_filename meta_by_filename.json

# Step 2: Run demux in parallel for each pool
for pool in Pool1 Pool2 Pool3 Pool4; do
  splitcode_demux_fastqs \
    --inFastq1 ${pool}_R1.fastq.gz \
    --inFastq2 ${pool}_R2.fastq.gz \
    --sampleSheet samples_3bc.tsv \
    --outDir demux_out/${pool} &
done
wait

Refactoring Benefits

  1. Eliminates duplication: Both illumina_demux and splitcode_demux currently duplicate run_info.json generation code (100% identical)
  2. Uses existing code: Leverages already-implemented build_run_info_json() utility
  3. Enables parallelization: Metadata generated once, then many demux jobs run simultaneously
  4. Simplifies interface: splitcode_demux_fastqs has fewer required inputs
  5. Clear separation of concerns: Metadata extraction vs read processing

Implementation Status

✅ Phase 1: Shared Utilities - COMPLETE

  • parse_illumina_fastq_filename() - 15 tests passing
  • build_run_info_json() - 5 tests passing
  • normalize_barcode() - 11 tests passing

✅ Phase 2: Test Infrastructure - COMPLETE

  • TestIlluminaMetadata test class created - 7 tests
  • TestSplitcodeDemuxFastqs test class created - 9 tests
  • Test data files created (RunInfo.xml, SampleSheet.csv, FASTQs)

✅ Phase 3: Implementation - COMPLETE

  • illumina_metadata() implemented - 7/7 tests passing
  • splitcode_demux_fastqs() implemented - 9/9 tests passing
  • Refactored illumina_demux to use build_run_info_json()
  • Refactored splitcode_demux to use build_run_info_json()

⬜ Phase 4: Documentation & Validation - TODO

  • Update command-line documentation
  • Final validation with CI
  • Code review

Test Data

For illumina_metadata:

  • Synthetic RunInfo.xml with flowcell TESTFC01
  • Synthetic SampleSheet.csv in DRAGEN format (3 pools)
  • Validates output JSON schemas match existing demux outputs

For splitcode_demux_fastqs:

  • TestPool1 (3-barcode sample):

    • 100 reads: AAAAAAAA (TestSample1)
    • 75 reads: CCCCCCCC (TestSample2)
    • 50 reads: GGGGTTTT (TestSample3)
    • 0 reads: TTTTGGGG (TestSampleEmpty)
    • 25 reads: Outlier barcodes (GGAATTTT, CCCCAAAA, ATATAGAG)
  • TestPool3 (2-barcode sample):

    • 80 reads: No inline barcode (TestSampleNoSplitcode)
    • Tests bypass of splitcode for 2-barcode samples

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions