Ultra-high throughput processing for 10x Genomics Flex single-cell sequencing.
cyto is a fast, memory-efficient processor for 10x Genomics Flex single-cell RNA sequencing data, designed specifically for production-scale analysis. It handles:
- Gene expression profiling from FFPE samples and fresh tissue
- Highly multiplexed experiments (16-plex Flex-V1)
- CRISPR perturbation screens (Perturb-seq) with efficient guide assignment
- Probe-based multiplexing for clinical and archived samples
cyto achieves dramatic performance improvements through algorithmic innovations optimized for Flex's fixed sequence geometry, making previously prohibitive experiments computationally feasible.
- Ultra-fast processing: Processes 320k-cell datasets in minutes rather than hours
- Memory efficient: Runs on smaller cloud instances with reduced resource requirements
- Highly accurate: 99.85% concordance with standard CellRanger outputs, identical cell clustering
- Modular architecture: Independent, composable tools for flexible workflows
- Production-ready: Built for atlas-scale projects and genome-wide screens
- BINSEQ support: Efficient binary format for highly parallel sequence parsing
- Compact IBU format: Binary Index-Barcode-UMI storage for efficient read processing
Note: This crate makes use of SIMD instructions for improved performance. To make sure you take advantage of SIMD instructions on your machine set the following environment variable before compiling:
Install via cargo:
export RUSTFLAGS="-C target-cpu=native";
cargo install cytoOr from source:
git clone https://github.com/arcinstitute/cyto
cd cyto
# install with cargo
export RUSTFLAGS="-C target-cpu=native"
cargo install --path crates/cyto
# or with just
just installProcess Flex gene expression data with probe demultiplexing:
cyto workflow gex \
-c gene_probes.tsv \
-w cell_barcode_whitelist.txt \
-p probe_barcodes.txt \
-o output_dir \
sample.vbqProcess Perturb-seq data with guide assignment:
cyto workflow crispr \
-c guide_library.tsv \
-w cell_barcode_whitelist.txt \
-p probe_barcodes.txt \
-o output_dir \
sample.vbqBoth workflows automatically handle:
- Read mapping to features
- Barcode correction
- UMI deduplication
- Molecule counting
- Guide assignment (CRISPR mode)
Workflows generate organized outputs:
output_dir/
├── metadata/
│ └── features.tsv # Feature index
├── stats/
│ └── mapping.json # Mapping statistics
├── ibu/
│ ├── probe1.sort.ibu # Processed IBU files
│ └── probe2.sort.ibu # (one per probe)
└── counts/
├── probe1.counts.tsv # Count matrices
└── probe2.counts.tsv # (one per probe)
Gene Expression (-c flag) - 3-column TSV, no header:
ENSG00000000003 TSPAN6 ACGTACGTACGTACGT
ENSG00000000005 TNMD TGCATGCATGCATGCA
Columns: Gene ID | Gene Name | Probe Sequence
CRISPR Guides (-c flag) - 3-column TSV, no header:
gene1_guide1 GGGGCCCC ACGTACGTACGTACGTACGT
gene1_guide2 GGGGCCCC TGCATGCATGCATGCATGCA
Columns: Guide Name | Anchor Sequence | Protospacer Sequence
For multiplexed experiments (-p flag) - 3-column TSV, no header:
ACGTACGT BC001 ProbeSet1
TGCATGCA BC002 ProbeSet2
Columns: True Sequence | Alias | Probe Name
Note: Probe sequences should match those provided by 10x Genomics for your specific chemistry.
Standard 10x barcode whitelist (-w flag):
# Example: 737K barcode list for GEM-X
-w 737K-fixed-rna-profiling.txt.gzcyto accepts both FASTQ and BINSEQ formats:
# BINSEQ (recommended - faster parsing)
cyto workflow gex -c probes.tsv -w whitelist.txt sample.vbq
# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt sample_R1.fastq.gz sample_R2.fastq.gzIf you have a large collection of sequence files that can be processed as a single input you can provide them all on the CLI:
# BINSEQ
cyto workflow gex -c probes.tsv -w whitelist.txt *.vbq
# FASTQ paired-end
cyto workflow gex -c probes.tsv -w whitelist.txt *.fastq.gzNote: Currently supports Flex-V1 (16-plex). Flex-V2 (364-plex) support coming soon.
cyto has some support for specifying alternative sequence geometries on the different modes.
This is useful when designing custom experimental designs that differ from the original 10X sequence structure.
R1: [barcode][umi]
R2: [gex-probe][spacer][flex-probe][...]
cyto allows you to adjust the spacer length using the --spacer flag as well as the barcode (--barcode) and umi (--umi) lengths.
R1: [barcode][umi]
R2: [...][flex-probe][lookback][anchor][protospacer][...]
cyto allows you to adjust the lookback length using the --lookback flag, as well as the anchor offset using the --offset flag.
The offset is the number of bases between the start of the sequence and the start of the anchor.
The lookback is the number of bases between the start of the anchor and the end of the flex-probe.
The barcode and umi lengths can be adjusted using the --barcode and --umi flags, respectively.
Note: If you're unsure about the
offsetorlookbackfor your library we suggest doing a quick check usingbqtools grepwith one of your anchor sequences and one of your flex-probe sequences:bqtools grep <input.vbq> <anchor_sequence> <flex_probe_sequence>This will highlight the
offsetandlookbacksequences in your sequences on the command-line and then you can easily count the number of bases between them and identify the start of the anchor sequence.
For advanced users, cyto exposes individual processing steps:
# 1. Map reads to features
cyto map gex -c probes.tsv -p probe_barcodes.txt -o map_out sample.vbq
# 2. Sort IBU files
cyto ibu sort -i map_out/ibu/probe1.ibu -o probe1.sorted.ibu
# 3. Correct cell barcodes
cyto ibu barcode -i probe1.sorted.ibu -w whitelist.txt -o probe1.corrected.ibu
# 4. Correct UMIs
cyto ibu umi -i probe1.corrected.ibu -o probe1.umi.ibu
# 5. Count molecules
cyto ibu count -i probe1.umi.ibu -f map_out/metadata/features.tsv -o counts.tsvThis modular design allows:
- Custom processing pipelines
- Integration with orchestration tools (Snakemake, Nextflow)
- Independent scaling of pipeline components
- Checkpointing and resumption
Control parallelization with -T:
# Use all available cores
cyto workflow gex -c probes.tsv -w whitelist.txt -T0 sample.vbq
# Use specific number of threads
cyto workflow gex -c probes.tsv -w whitelist.txt -T32 sample.vbq
# Single-threaded (minimal resources)
cyto workflow gex -c probes.tsv -w whitelist.txt -T1 sample.vbqDefault: All available threads
Tab-separated sparse matrix:
barcode feature count
ACGTACGT ENSG00000000003 5
ACGTACGT ENSG00000000005 12
For downstream analysis with scanpy/Seurat:
cyto ibu count -i sample.ibu -f features.tsv -o counts_mtx --format mtxGenerates:
matrix.mtx- Sparse count matrixbarcodes.tsv- Cell barcodesfeatures.tsv- Feature names
Use pycyto utilities for format conversion and aggregation:
# Convert MTX to h5ad
pycyto mtx-to-h5ad counts_mtx/ output.h5ad
# Aggregate cyto output into a single h5ad per sample
pycyto aggregate <config>.json <cyto_output_dir> <aggr_dir>The CRISPR workflow includes automatic guide assignment using the geomux algorithm, which:
- Scales linearly with data sparsity (not total dimensions)
- Handles multi-guide perturbations
- Works on unfiltered cells (no pre-filtering needed)
- Performs hypergeometric testing with FDR correction
Guide assignments are included in the count matrix output.
cyto is optimized for:
- Fixed-geometry protocols: Flex libraries with predetermined sequence structures
- Multiplexed datasets: Efficient probe demultiplexing at scale
- Large-scale screens: Million-cell perturbation experiments
cyto is not designed for:
- Splice-aware alignment (use STAR, kallisto|bustools, Alevin-fry)
- Transcript discovery or quantification
- Variable read architectures
- Full-length transcript sequencing
All components are available under the MIT license:
- cyto: https://github.com/arcinstitute/cyto
- pycyto utilities: https://github.com/arcinstitute/pycyto
- geomux: https://github.com/noamteyssier/geomux
- cell-filter: https://github.com/arcinstitute/cell-filter
- IBU format: https://github.com/noamteyssier/ibu
Rust packages on crates.io | Python packages on PyPI
If you use cyto in your research, please cite our BioRxiv preprint:
Teyssier, N. and Dobin, A. (2025). cyto: ultra high-throughput processing
of 10x-flex single cell sequencing. bioRxiv.
- Issues: https://github.com/arcinstitute/cyto/issues
- Documentation: See
--helpfor any command - Examples: See
justfilefor complete workflows
Developed at Arc Institute with support for computational resources.