feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695
Draft
pinin4fjords wants to merge 6 commits into
Draft
feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695pinin4fjords wants to merge 6 commits into
pinin4fjords wants to merge 6 commits into
Conversation
…_rpbp subworkflows Adds the Rp-Bp Ribo-seq ORF caller (Malone et al. 2017, doi:10.1093/nar/gkw1141) as 8 split modules plus 2 orchestration subworkflows ported from nf-core/riboseq#174. Splitting per-tool (rather than wrapping rpbp's `predict-translated-orfs` umbrella command) gives independent caching on resume and lets the pipeline's own STAR alignment run instead of being re-done inside rpbp. Modules: - rpbp/buildconfig: render the Rp-Bp YAML config from pipeline-supplied fasta+gtf. - rpbp/preparegenome: build the Rp-Bp genome index (STAR index, ribosomal index, ORF BEDs). - rpbp/extractmetageneprofiles: per-read-length metagene profiles around starts. - rpbp/estimatemetagenebayesfactors: periodicity Bayes factors per read length. - rpbp/selectperiodicoffsets: pick a single P-site offset per high-quality length. - rpbp/extractorfprofiles: per-ORF P-site profile matrix using selected offsets. - rpbp/estimateorfbayesfactors: Bayesian translated-vs-untranslated model. - rpbp/selectfinalpredictionset: filter to the final predicted-ORF BED + FASTAs. Subworkflows: - bam_rpbp_predictorfs: per-sample 6-step chain on a Ribo-seq BAM with the cohort-shared annotation outputs supplied by the caller. - fasta_gtf_bam_rpbp: top-level end-to-end run; renders config, prepares the index once, then runs the per-sample chain. Containers: Wave-built `community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb` (co-installs `bioconda::rpbp=4.0.1` + `bioconda::star=2.7.11b`) for all rpbp tools; `quay.io/biocontainers/coreutils:9.5` for rpbp/buildconfig (config templating only). Versions are emitted via the topic-channel pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…across the board Bypass `prepare-rpbp-genome` umbrella; call rpbp's internal `get_orfs` function directly via a small inline Python wrapper. Skips the umbrella script's bowtie2-build (rRNA index) and STAR genome-generate steps entirely - none of the downstream Rp-Bp tools we wrap consume those indices, and upstream alignment is supplied as the BAM. rpbp/buildconfig removed - the config dict is now built in-memory inside rpbp/preparegenome from the input fasta + gtf paths. fasta_gtf_bam_rpbp loses the corresponding setup step. extractorfprofiles replicates rpbp's metagene-length filter (`ribo_utils.utils.get_periodic_lengths_and_offsets`) inline before calling extract-orf-profiles, so the upstream select-periodic-offsets output can drive --lengths/--offsets without going through rpbp's filename-driven config plumbing. Filter thresholds are exposed via ext.args2 (4 space-separated tokens, defaults match rpbp.defaults.metagene_options). All 7 inner-module tests now run the upstream chain on real chr20 data (no stub-only tests). Subworkflow tests cover bam_rpbp_predictorfs and fasta_gtf_bam_rpbp end-to-end. The 3 inner modules that need the lower threshold (extractorfprofiles / estimateorfbayesfactors / selectfinalpredictionset) carry tests/nextflow.config setting ext.args2 = '10 1 None 0.0' so the chr20 BAM produces non-empty profiles. Wave container unchanged (community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-sample 6-step chain was only ever wrapped by fasta_gtf_bam_rpbp; keeping it as a standalone subworkflow added a thin layer no realistic caller needs (rpbp's BED outputs only come from preparegenome). Inlined into fasta_gtf_bam_rpbp — now one subworkflow with 7 process invocations end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m/nf-core/modules into rpbp-add-modules-and-subworkflows
…nnel: column Cleans up the file: - Drops the multi-line header narrative recapping design choices and step ordering. - Drops the numbered "1. ... 2. ... 3. ..." inline narrative blocks. - Aligns the `// channel:` annotations in the emit block (12 of 13 lines were one column off the longest emit's reference position). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds 7
rpbp/*modules and thefasta_gtf_bam_rpbpsubworkflow that wraps Rp-Bp's translated-ORF caller chain (Malone et al. 2017, doi:10.1093/nar/gkw1141). Ported from nf-core/riboseq#174.Design — bypass the umbrella
Rp-Bp's
prepare-rpbp-genomebundles three independent steps:bowtie2-build(rRNA index),STAR --runMode genomeGenerate(alignment index), andget_orfs(the BED prep used by all downstream tools). The 6 per-sample tools we wrap don't consume the bowtie or STAR indices — upstream alignment is supplied as the BAM.rpbp/preparegenometherefore invokesget_orfsdirectly via a small Python wrapper, skipping the unused index builds entirely. Real runs take ~3 minutes on chr20 instead of the multi-minute STAR build the umbrella triggers.The per-sample chain similarly avoids
predict-translated-orfsandcreate-orf-profiles, both of which internally call flexbar + bowtie + STAR on raw FASTQs (re-doing the upstream alignment). Splitting the 6 internal steps into separate modules also gives independent caching on resume.Modules
rpbp/preparegenome— chainsgtf-to-bed12→extract-bed-sequences→extract-orf-coordinates→split-bed12-blocks→label-orfsvia the internalget_orfsfunction.rpbp/extractmetageneprofiles— per-read-length metagene profiles around annotated start codons.rpbp/estimatemetagenebayesfactors— Bayes factors comparing periodic vs non-periodic models per read length.rpbp/selectperiodicoffsets— pick one P-site offset per high-quality read length.rpbp/extractorfprofiles— applies the metagene-length filter (replicatingribo_utils.utils.get_periodic_lengths_and_offsets) and runsextract-orf-profiles. Filter thresholds viaext.args2(4 space-separated tokens; defaults mirrorrpbp.defaults.metagene_options).rpbp/estimateorfbayesfactors— Bayesian translated-vs-untranslated model per ORF.rpbp/selectfinalpredictionset— apply BF/length/overlap rules and emit the final predicted-ORF BED, DNA FASTA, protein FASTA.Subworkflow
fasta_gtf_bam_rpbp— end-to-end run:rpbp/preparegenome(once per invocation) followed by the per-sample 6-step chain (extract-metagene → metagene BF → select offsets → extract-orf-profiles → orf BF → select final).Container
All
rpbp/*modules share a Wave-built container co-installingbioconda::rpbp=4.0.1andbioconda::star=2.7.11b:community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfbhttps://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/3a/3a8aa95ce76934f6269b2d8cbdd3d57c13db029c704152975b2315e35b7a2154/dataVersions emitted via
topic: versions.Test plan
Real + stub tests pass for each of the 7 modules and the subworkflow under
nf-core modules test --profile docker/nf-core subworkflows test --profile docker. Test fixtures (chr20) live innf-core/test-datasets:modulesundergenomics/homo_sapiens/riboseq_expression/— same chr20 FASTA / GTF / BAMs the other riboseq modules use.The 3 modules whose chain extends through
extract-orf-profiles(extractorfprofiles,estimateorfbayesfactors,selectfinalpredictionset) and the subworkflow itself carry atests/nextflow.configsettingext.args2 = '10 1 None 0.0'so chr20's modest per-length read counts survive the metagene filter and produce non-empty profiles.Source: nf-core/riboseq#174.