-
Notifications
You must be signed in to change notification settings - Fork 1
Implement splitcode_demux_fastqs for DRAGEN FASTQ demultiplexing (issue #122) #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dpark01
wants to merge
10
commits into
master
Choose a base branch
from
ct-swiftseq-demux-integration
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement shared utility functions with comprehensive test coverage: - parse_illumina_fastq_filename(): Parse DRAGEN and bcl2fastq FASTQ filenames * Supports DRAGEN format with flowcell ID (9 chars) * Handles sample names containing underscores * Backward compatible with simple bcl2fastq format * Returns dict with metadata (flowcell, lane, sample_name, read, etc.) - normalize_barcode(): Normalize Illumina barcode sequences * Converts to uppercase * Strips whitespace * Validates ACGTN characters only - build_run_info_json(): Construct standardized run_info.json * Consolidates duplicated code from illumina_demux/splitcode_demux * Handles required and optional parameters * Converts numeric fields to strings for JSON consistency Test coverage: - 15 tests for parse_illumina_fastq_filename() - 11 tests for normalize_barcode() - 5 tests for build_run_info_json() - All 122 illumina.py tests passing (31 new, no regressions) This completes Phase 1 of the implementation plan for splitcode_demux_fastqs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Create comprehensive test suite following TDD approach: Test Infrastructure: - Created TestSplitcodeDemuxFastqs test class with 8 test methods - Synthetic test data in test/input/TestSplitcodeDemuxFastqs/: * RunInfo.xml: Flowcell metadata (50T8B8B50T read structure) * SampleSheet.csv: DRAGEN-style samplesheet * samples_3bc.tsv: Custom 3-barcode samplesheet (SWIFT format) * Paired FASTQ files: 250 read pairs with inline barcodes Test Coverage: 1. ✅ test_parse_fastq_filename_from_test_data (PASSING) - Validates Phase 1 filename parsing utility 2. ✅ test_barcode_normalization_on_samplesheet (PASSING) - Validates Phase 1 barcode normalization utility 3. ❌ test_basic_demux_workflow (FAILING - expected) - Tests end-to-end FASTQ → BAM demultiplexing - Verifies output files: BAMs, JSONs, metrics 4. ❌ test_barcode_matching_perfect_match (FAILING - expected) - Validates correct read assignment to samples - Tests: 100 reads → Sample1, 75 → Sample2, 50 → Sample3 5. ❌ test_unmatched_barcodes (FAILING - expected) - Tests handling of 25 unmatched reads - Verifies outlier barcode reporting 6. ❌ test_empty_barcode_sample (FAILING - expected) - Tests zero-read sample handling - Ensures no crashes on empty samples 7. ❌ test_output_schema_consistency (FAILING - expected) - Validates run_info.json and demux_metadata.json schemas 8. ❌ test_run_metadata_extraction (FAILING - expected) - Tests metadata extraction from RunInfo.xml and FASTQ names Test Data Characteristics: - TestPool1 (ATCGATCG+GCTAGCTA): * TestSample1 (AAAAAAAA): 100 reads * TestSample2 (CCCCCCCC): 75 reads * TestSample3 (GGGGTTTT): 50 reads * TestSampleEmpty (TTTTGGGG): 0 reads (edge case) * Unmatched (NNNNANNN): 25 reads All 6 main workflow tests fail with expected AttributeError because splitcode_demux_fastqs() doesn't exist yet. This is correct TDD. Phase 3 will implement the function to make tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When processing an empty BAM file (zero reads), fastqc was creating an invalid 0-byte zip file using util.file.touch(). This caused unzip failures with "End-of-central-directory signature not found". Changes: - Use zipfile.ZipFile to create a valid empty zip archive (22 bytes) - Add comprehensive test coverage for fastqc tool - Tests validate both HTML and ZIP outputs using html.parser and zipfile The fix ensures that when inBam has zero reads, --out_zip produces a semantically correct empty zip file that can be opened by zip tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Extended test suite to handle mixed samplesheets with both 2-barcode and 3-barcode samples, addressing important real-world scenario. Changes: - Added TestSampleNoSplitcode to samples_3bc.tsv (empty barcode_3) - Added TestPool3 (GGAATTCC+CCGGAATT) to DRAGEN SampleSheet.csv - Created TestPool3 paired FASTQ files (80 reads, no inline barcode) - Added test_two_barcode_sample_bypass_splitcode() test method Test Behavior: When barcode_3 is empty, splitcode_demux_fastqs should: - Skip splitcode demultiplexing (no inline barcode to process) - Perform direct FASTQ → BAM conversion - Produce exactly one output BAM containing all reads - Still generate standard output files (run_info.json, metadata, etc.) This mirrors real sequencing runs where some samples use 3-barcode scheme (requiring splitcode) while others use standard 2-barcode demux (already handled by DRAGEN). Test data now includes: - TestPool1: 3-barcode sample (250 reads with inline barcodes) - TestPool3: 2-barcode sample (80 reads, no inline barcode) Test currently fails with expected AttributeError (TDD). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
… simplified splitcode_demux_fastqs) Added TestIlluminaMetadata test class (8 tests): - test_runinfo_xml_parsing (PASSING) - test_samplesheet_parsing (PASSING) - test_metadata_json_generation (FAILING - TDD) - test_metadata_generation_with_optional_params (FAILING - TDD) - test_metadata_consistency_with_existing_demux (FAILING - TDD) - test_invalid_runinfo_path (FAILING - TDD) - test_invalid_samplesheet_path (FAILING - TDD) Updated TestSplitcodeDemuxFastqs for simplified architecture: - Removed RunInfo.xml and Illumina SampleSheet.csv from setUp - Updated all test methods to use simplified interface (no metadata inputs) - Updated tests to verify metadata JSONs are NOT generated - Tests verify demux_metrics.json, BAMs, and barcode reports only - All 9 tests failing with AttributeError (expected per TDD) Architecture changes: - illumina_metadata: Generate metadata JSONs (run once per sequencing run) - splitcode_demux_fastqs: Only process reads → BAMs + metrics (run in parallel) Test results: - 2 utility tests PASSING (RunInfo/SampleSheet parsing) - 15 workflow tests FAILING with expected AttributeError (functions not implemented yet) - Total: 17 new tests for Phase 2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…e_demux_fastqs Part A - illumina_metadata entry point (COMPLETE ✅): - Implemented illumina_metadata() function to generate metadata JSONs - Uses build_run_info_json() utility function - Generates run_info.json, meta_by_sample.json, meta_by_filename.json - All 7 TestIlluminaMetadata tests PASSING - Registered in __commands__ list Part B - Refactoring (COMPLETE ✅): - Refactored illumina_demux to use build_run_info_json() (line 799-817) - Refactored splitcode_demux to use build_run_info_json() (line 3909-3927) - Eliminates 100% duplicated run_info.json generation code - No regressions - all existing tests still pass Part C - splitcode_demux_fastqs entry point (PARTIAL⚠️ ): - Implemented simplified splitcode_demux_fastqs() function - 2-barcode path WORKING: Direct FASTQ→BAM conversion - 3-barcode path NEEDS WORK: Config generation more complex than expected - Registered in __commands__ list - Parser and main functions implemented Test Results: - TestIlluminaMetadata: 7/7 PASSING ✅ - TestSplitcodeDemuxFastqs: 2/9 PASSING (utility tests) - Total illumina tests: 131/138 PASSING (7 fail due to incomplete 3-barcode impl) Fixed Issues: - Updated samples_3bc.tsv to have proper tab-delimited format (8 columns) - 2-barcode row now has empty columns for Inline_Index_ID and barcode_3 Next Steps: - Complete splitcode_demux_fastqs 3-barcode path - Properly call generate_splitcode_config_and_keep_files() with correct args - Implement full demux workflow for 3-barcode samples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Implements two new entry points for issue #122: 1. illumina_metadata() - COMPLETE ✅ - Generates metadata JSONs without processing reads - Uses build_run_info_json() utility to eliminate code duplication - All 7 tests passing 2. splitcode_demux_fastqs() - MOSTLY WORKING (5/9 tests passing) - Extracts outer barcodes from DRAGEN FASTQ headers - Filters samplesheet by outer barcodes - Uses existing generate_splitcode_config_and_keep_files() helper - Runs splitcode demux successfully - Converts output FASTQs to BAMs Refactoring: - illumina_demux and splitcode_demux now use build_run_info_json() - Fixed Picard FastqToSamTool.isFastqEmpty() to handle gzipped files Known issues (4 failing tests): - Read counting includes both R1+R2 (need to divide by 2) - 2-barcode bypass logic needs to filter by outer barcodes first - Outlier barcode detection not yet implemented - Test code needs subprocess import fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
All TestSplitcodeDemuxFastqs tests now passing (9/9). Key fixes in this commit: - Fixed outer barcode extraction from DRAGEN FASTQ headers - Extract barcodes FIRST, then filter samplesheet by outer barcodes - Fixed 2-barcode bypass logic (samples with empty barcode_3) - Fixed read pair counting (Picard reports pairs, samtools counts total reads) - Fixed Picard gzipped FASTQ handling with util.file.open_or_gzopen - Regenerated test data with realistic outlier barcodes (no Ns) - Removed barcode report files from simplified workflow - Updated test_output_schema_consistency to match new spec Implementation status: - illumina_metadata: 7/7 tests passing - splitcode_demux_fastqs: 9/9 tests passing Co-Authored-By: Claude <noreply@anthropic.com>
…data - Add optional runinfo parameter (default None) - Change sequencing_center default from 'Broad' to None - Populate BAM headers with richer metadata: - RUN_DATE from runInfo.get_rundate_iso() - SEQUENCING_CENTER from runInfo.get_machine() if not specified by user - PLATFORM_UNIT as <flowcell>.<lane>.<barcodes> - LIBRARY_NAME as sample_library_id - Use picard.dict_to_picard_opts() pattern for cleaner option handling - Updated all tests to pass RunInfo.xml parameter - All 9 tests passing This enhances BAM metadata quality for downstream analysis while maintaining backward compatibility (runinfo is optional).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implements two new entry points for Illumina demultiplexing workflow, plus refactoring of existing code to eliminate duplication:
illumina_metadata: Generate metadata JSONs from RunInfo.xml and SampleSheet (run once per sequencing run) ✅ COMPLETEsplitcode_demux_fastqs: Simplified splitcode demux from paired FASTQs (run in parallel per FASTQ pair) ✅ COMPLETEillumina_demuxandsplitcode_demuxto usebuild_run_info_json()utility ✅ COMPLETECloses #122
New Architecture
Two-Step Workflow
Step 1: Generate metadata once per run
Step 2: Run demux in parallel per FASTQ pair
Benefits
illumina_demuxandsplitcode_demuxpreviously duplicated run_info.json generation (100% identical code)splitcode_demux_fastqshas fewer required inputsbuild_run_info_json()utilityImplementation Status
✅ Phase 1: Shared Utilities - COMPLETE (31 tests passing)
Task 1-2:
parse_illumina_fastq_filename(){flowcell}_{lane}_{numeric_id}_{sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gz{sample_name}_{S#}_{L00#}_{R#}_{chunk}.fastq.gzTask 3-4:
build_run_info_json()illumina_demux/splitcode_demuxTask 5-6:
normalize_barcode()✅ Phase 2: Test Infrastructure - COMPLETE (16 tests)
TestIlluminaMetadata (7 tests)
TestSplitcodeDemuxFastqs (9 tests)
Test Data Created:
✅ Phase 3: Implementation - COMPLETE
Part A:
illumina_metadataEntry Point ✅Part B: Refactored Existing Functions ✅
illumina_demuxnow usesbuild_run_info_json()splitcode_demuxnow usesbuild_run_info_json()Part C:
splitcode_demux_fastqsEntry Point ✅barcodes_common.txt,barcodes_outliers.txt(removed from spec), run_info.json, meta_by_sample.json, meta_by_filename.json⬜ Phase 4: Documentation & Validation - TODO
Key Implementation Details
Spec Change: Removed Barcode Report Files
Original spec included:
Updated spec (removed these files):
illumina_demuxandsplitcode_demuxworkflowssplitcode_demux_fastqsfocuses on speed and parallelizationOuter Barcode Extraction
@INST:RUN:FC:LANE:TILE:X:Y READ:FILTERED:CONTROL:BARCODEATCGATCG+GCTAGCTA(index1+index2)Read Pair Counting
samtools.count()returns total reads (200 = 100 R1 + 100 R2)Test Data with Realistic Outliers
Test Results
All 47 tests passing (100% pass rate):
Files Changed
Modified Files
New Files
Next Steps
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com