Skip to content

feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695

Draft
pinin4fjords wants to merge 6 commits into
masterfrom
rpbp-add-modules-and-subworkflows
Draft

feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695
pinin4fjords wants to merge 6 commits into
masterfrom
rpbp-add-modules-and-subworkflows

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 19, 2026

Adds 7 rpbp/* modules and the fasta_gtf_bam_rpbp subworkflow that wraps Rp-Bp's translated-ORF caller chain (Malone et al. 2017, doi:10.1093/nar/gkw1141). Ported from nf-core/riboseq#174.

Design — bypass the umbrella

Rp-Bp's prepare-rpbp-genome bundles three independent steps: bowtie2-build (rRNA index), STAR --runMode genomeGenerate (alignment index), and get_orfs (the BED prep used by all downstream tools). The 6 per-sample tools we wrap don't consume the bowtie or STAR indices — upstream alignment is supplied as the BAM. rpbp/preparegenome therefore invokes get_orfs directly via a small Python wrapper, skipping the unused index builds entirely. Real runs take ~3 minutes on chr20 instead of the multi-minute STAR build the umbrella triggers.

The per-sample chain similarly avoids predict-translated-orfs and create-orf-profiles, both of which internally call flexbar + bowtie + STAR on raw FASTQs (re-doing the upstream alignment). Splitting the 6 internal steps into separate modules also gives independent caching on resume.

Modules

  • rpbp/preparegenome — chains gtf-to-bed12extract-bed-sequencesextract-orf-coordinatessplit-bed12-blockslabel-orfs via the internal get_orfs function.
  • rpbp/extractmetageneprofiles — per-read-length metagene profiles around annotated start codons.
  • rpbp/estimatemetagenebayesfactors — Bayes factors comparing periodic vs non-periodic models per read length.
  • rpbp/selectperiodicoffsets — pick one P-site offset per high-quality read length.
  • rpbp/extractorfprofiles — applies the metagene-length filter (replicating ribo_utils.utils.get_periodic_lengths_and_offsets) and runs extract-orf-profiles. Filter thresholds via ext.args2 (4 space-separated tokens; defaults mirror rpbp.defaults.metagene_options).
  • rpbp/estimateorfbayesfactors — Bayesian translated-vs-untranslated model per ORF.
  • rpbp/selectfinalpredictionset — apply BF/length/overlap rules and emit the final predicted-ORF BED, DNA FASTA, protein FASTA.

Subworkflow

fasta_gtf_bam_rpbp — end-to-end run: rpbp/preparegenome (once per invocation) followed by the per-sample 6-step chain (extract-metagene → metagene BF → select offsets → extract-orf-profiles → orf BF → select final).

Container

All rpbp/* modules share a Wave-built container co-installing bioconda::rpbp=4.0.1 and bioconda::star=2.7.11b:

  • docker: community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb
  • singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/3a/3a8aa95ce76934f6269b2d8cbdd3d57c13db029c704152975b2315e35b7a2154/data

Versions emitted via topic: versions.

Test plan

Real + stub tests pass for each of the 7 modules and the subworkflow under nf-core modules test --profile docker / nf-core subworkflows test --profile docker. Test fixtures (chr20) live in nf-core/test-datasets:modules under genomics/homo_sapiens/riboseq_expression/ — same chr20 FASTA / GTF / BAMs the other riboseq modules use.

The 3 modules whose chain extends through extract-orf-profiles (extractorfprofiles, estimateorfbayesfactors, selectfinalpredictionset) and the subworkflow itself carry a tests/nextflow.config setting ext.args2 = '10 1 None 0.0' so chr20's modest per-length read counts survive the metagene filter and produce non-empty profiles.

Source: nf-core/riboseq#174.

…_rpbp subworkflows

Adds the Rp-Bp Ribo-seq ORF caller (Malone et al. 2017, doi:10.1093/nar/gkw1141) as
8 split modules plus 2 orchestration subworkflows ported from nf-core/riboseq#174.
Splitting per-tool (rather than wrapping rpbp's `predict-translated-orfs` umbrella
command) gives independent caching on resume and lets the pipeline's own STAR
alignment run instead of being re-done inside rpbp.

Modules:
- rpbp/buildconfig: render the Rp-Bp YAML config from pipeline-supplied fasta+gtf.
- rpbp/preparegenome: build the Rp-Bp genome index (STAR index, ribosomal index,
  ORF BEDs).
- rpbp/extractmetageneprofiles: per-read-length metagene profiles around starts.
- rpbp/estimatemetagenebayesfactors: periodicity Bayes factors per read length.
- rpbp/selectperiodicoffsets: pick a single P-site offset per high-quality length.
- rpbp/extractorfprofiles: per-ORF P-site profile matrix using selected offsets.
- rpbp/estimateorfbayesfactors: Bayesian translated-vs-untranslated model.
- rpbp/selectfinalpredictionset: filter to the final predicted-ORF BED + FASTAs.

Subworkflows:
- bam_rpbp_predictorfs: per-sample 6-step chain on a Ribo-seq BAM with the
  cohort-shared annotation outputs supplied by the caller.
- fasta_gtf_bam_rpbp: top-level end-to-end run; renders config, prepares the
  index once, then runs the per-sample chain.

Containers: Wave-built `community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb`
(co-installs `bioconda::rpbp=4.0.1` + `bioconda::star=2.7.11b`) for all rpbp tools;
`quay.io/biocontainers/coreutils:9.5` for rpbp/buildconfig (config templating only).
Versions are emitted via the topic-channel pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…across the board

Bypass `prepare-rpbp-genome` umbrella; call rpbp's internal `get_orfs`
function directly via a small inline Python wrapper. Skips the umbrella
script's bowtie2-build (rRNA index) and STAR genome-generate steps
entirely - none of the downstream Rp-Bp tools we wrap consume those
indices, and upstream alignment is supplied as the BAM.

rpbp/buildconfig removed - the config dict is now built in-memory inside
rpbp/preparegenome from the input fasta + gtf paths. fasta_gtf_bam_rpbp
loses the corresponding setup step.

extractorfprofiles replicates rpbp's metagene-length filter
(`ribo_utils.utils.get_periodic_lengths_and_offsets`) inline before
calling extract-orf-profiles, so the upstream select-periodic-offsets
output can drive --lengths/--offsets without going through rpbp's
filename-driven config plumbing. Filter thresholds are exposed via
ext.args2 (4 space-separated tokens, defaults match
rpbp.defaults.metagene_options).

All 7 inner-module tests now run the upstream chain on real chr20 data
(no stub-only tests). Subworkflow tests cover bam_rpbp_predictorfs and
fasta_gtf_bam_rpbp end-to-end. The 3 inner modules that need the lower
threshold (extractorfprofiles / estimateorfbayesfactors /
selectfinalpredictionset) carry tests/nextflow.config setting
ext.args2 = '10 1 None 0.0' so the chr20 BAM produces non-empty profiles.

Wave container unchanged
(community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords changed the title feat(rpbp): add 8 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows May 19, 2026
pinin4fjords and others added 3 commits May 19, 2026 17:57
The per-sample 6-step chain was only ever wrapped by fasta_gtf_bam_rpbp;
keeping it as a standalone subworkflow added a thin layer no realistic
caller needs (rpbp's BED outputs only come from preparegenome). Inlined
into fasta_gtf_bam_rpbp — now one subworkflow with 7 process invocations
end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords changed the title feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow May 19, 2026
…nnel: column

Cleans up the file:
- Drops the multi-line header narrative recapping design choices and step ordering.
- Drops the numbered "1. ... 2. ... 3. ..." inline narrative blocks.
- Aligns the `// channel:` annotations in the emit block (12 of 13 lines
  were one column off the longest emit's reference position).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant