SeqRush is a prototype pangenome graph construction tool inspired by seqwish. Planned features include lock-free union-find structures and WFA2-based alignments.
SeqRush builds pangenome graphs by:
- Planned: perform all-vs-all pairwise alignments using WFA2 (Wavefront Alignment)
- Planned: use a lock-free union-find data structure to merge matching positions
- Construct a graph where sequences are embedded as paths
The design aims to leverage UFRush (lock-free union-find) for true parallel graph construction once implemented.
# Create a test FASTA file
cat > test.fasta << EOF
>seq1
ATCGATCGATCGATCG
>seq2
ATCGATGGATCGATCG
>seq3
ATCGATCGATCGATGG
EOF
# Build the project with CLI support
cargo build --release --features cli
# Build the pangenome graph using flags
seqrush -s test.fasta -o test.gfa
# View the output
cat test.gfa
If you build without --features cli
, use positional arguments instead:
seqrush test.fasta test.gfa
- Planned: Lock-free Parallel Processing via UFRush
- Planned: Memory Efficient alignment using WFA2's UltraLow mode
- Planned: All-vs-All Alignment of all input sequences
- Configurable Parameters: Alignment scoring, minimum match length
- Standard GFA Output: Compatible with tools like
odgi
andvg
- Path Integrity: Sequences are perfectly reconstructible from the graph
- Rust via
rustup
(install the version pinned inrust-toolchain.toml
or provide it via an offline setup) - Git
git clone https://github.com/KristopherKubicki/seqrush.git
cd seqrush
# rustup will automatically install the toolchain defined in `rust-toolchain.toml`
cargo build --release
To use the CLI flags, build the binary with the cli
feature enabled:
cargo build --release --features cli
The binary will be available at target/release/seqrush
.
seqrush -s sequences.fasta -o graph.gfa
Enable the optional cli
feature to use command-line flags.
seqrush \
-s sequences.fasta \ # Input FASTA file
-o graph.gfa \ # Output GFA file
-t 8 \ # Number of threads (default: 1)
-k 15 \ # Minimum match length (default: 15)
-S "0,5,8,2,24,1" \ # Alignment scores: match,mismatch,gap1_open,gap1_extend,gap2_open,gap2_extend
-v # Verbose output
The -S/--scores
parameter accepts comma-separated values:
# Two-piece affine gap model (default)
-S "0,5,8,2,24,1" # match=0, mismatch=5, gap1_open=8, gap1_extend=2, gap2_open=24, gap2_extend=1
# Single affine gap model
-S "0,5,8,2" # match=0, mismatch=5, gap_open=8, gap_extend=2
# Custom scoring for high similarity sequences
-S "0,4,6,1" # More permissive scoring
Note: Two-piece affine gap support requires compatible WFA2 library version.
# Build a pangenome graph
seqrush -s genomes.fasta -o pangenome.gfa
# Visualize with odgi
odgi build -g pangenome.gfa -o pangenome.og
odgi viz -i pangenome.og -o pangenome.png
# Check graph statistics
odgi stats -i pangenome.og -S
- Load Sequences: Read FASTA file and assign global positions to each base
- Planned: initialize a UFRush instance with one element per base
- Planned: align all sequence pairs using WFA2
- Planned: process matches ≥
min_match_length
and unite positions - Build Graph: Walk sequences to identify nodes and edges
- Planned: CIGAR Processing for WFA2's fine-grained output
- Planned: Match Accumulation across CIGAR operations
- Planned: Base Verification since 'M' may represent mismatch
- Path Construction: Each sequence becomes a path through deduplicated nodes
- Time Complexity: O(n²×L) for n sequences of length L (pairwise alignment)
- Space Complexity: O(N) where N is total sequence length
- Parallel Scaling: Near-linear with thread count for alignment phase
SeqRush generates GFA 1.0 format with:
- H: Header with version
- S: Segments (nodes) with single-character sequences
- P: Paths representing input sequences
- L: Links between adjacent nodes in paths
Example output:
H VN:Z:1.0
S 1 A
S 2 C
S 3 G
P seq1 1+,2+,3+ *
L 1 + 2 + 0M
L 2 + 3 + 0M
# Run all tests
cargo test --features cli
# Run with verbose output
cargo test -- --nocapture
# Run a specific test
cargo test run_seqrush_writes_output
cargo doc --open
seqrush/
├── src/
│ ├── lib.rs # Library interface
│ └── main.rs # CLI binary
├── tests/
│ └── integration_tests.rs
├── Cargo.toml
└── README.md
- Input sequences must fit in memory
- Currently builds the entire graph at once (no streaming)
- Single-character nodes (no compaction)
- Limited to DNA sequences (ACGT alphabet)
- Path integrity verification may fail for some sequences with complex indel patterns. The graph structure is correct, but path reconstruction needs improvement for certain edge cases.
If you use SeqRush in your research, please cite:
SeqRush: Lock-free parallel pangenome graph construction
Kristopher Kubicki, 2025
https://github.com/KristopherKubicki/seqrush
This project is licensed under the MIT License. See LICENSE for details.
SeqRush is inspired by:
- Integrate WFA2 for efficient pairwise alignments
- Implement lock-free union-find (UFRush) for parallel graph construction
- Support streaming graph output for large datasets
- Extend sequence alphabet beyond DNA