Release v1.0.0: Initial Stable Version of splikit #7

Arshammik · 2025-05-02T19:56:29Z

This PR introduces the first stable release of splikit (v1.0.0), a high-performance R package for analyzing splicing and gene expression in single-cell data.

Summary of changes:

Added core C++ back-end with Rcpp and OpenMP support.
Implemented functions for junction-level and gene-level variability, pseudo-correlation, silhouette scoring, and more.
Integrated workflows for STARsolo, 10X, and Velocyto-derived data.
Auto-generated Rd documentation via roxygen2.
Registered Rcpp functions for direct access from R.
Added comprehensive NAMESPACE and DESCRIPTION setup.

Notes:

All compiled artifacts and temporary build files are excluded from version control.
This version is ready for tagging as v1.0.0 and GitHub release.

- Due to devtools recommnedation we changed the figures location form `docs/` to `man/`. - We changed the logo to have a bigger seagull figure. - Did some cleanups in both README.md files

…w-wise variance computations - Registered native C++ routines via `@useDynLib` directive and imported `Rcpp::evalCpp` to enable seamless Rcpp integration. - Added `get_pseudo_correlation()`: - Computes a pseudo-R² correlation metric for events based on a beta-binomial model. - Accepts ZDB matrix and inclusion/exclusion model matrices. - Validates input shapes and types; warns if rownames are missing. - Includes optional warning suppression and computes a null distribution using permuted data. - Returns a `data.table` with event-wise scores and null values. - Added `get_rowVar()`: - Computes row-wise variance for both dense and sparse matrices. - Handles sparse `dgCMatrix` inputs efficiently using compressed-column traversal. - Logs start and completion messages when `verbose = TRUE`. - Dispatches internally to the appropriate C++ backend using a unified entry point. - Included `get_silhouette_mean()` again to ensure availability alongside other exports (duplicate definition removed from earlier commit context if applicable). - Each function includes thorough roxygen2 documentation: - Describes inputs, outputs, examples, threading options, and usage notes. - Emphasizes computational efficiency, compatibility constraints, and appropriate input structure.

…e-based filtering functions - Implemented `get_pseudo_correlation()` for computing beta-binomial-based pseudo R² metrics across splicing events. - Added `get_silhouette_mean()` for parallelized average silhouette score calculation using Euclidean distance. - Created `get_rowVar()` for efficient row-wise variance computation on dense or sparse matrices. - Introduced `find_variable_events()` to detect variable splicing events using deviance across libraries. - Added `find_variable_genes()` supporting both deviance-based and VST-based gene variability detection. - All functions rely on underlying high-performance C++ implementations via Rcpp. - Enhanced robustness with input validation, progress logging, and informative error messages.

- Removed phonetic pronunciation from the startup message. - Added bilingual welcome message ("Welcome to Splikit" / "Bienvenue à Splikit") in English and French. - Kept institutional and licensing information consistent.

…ipeline **Details** * **make\_junction\_ab()** * Parses STARsolo splice-junction directories (single or multiple samples). * Supports optional external barcode whitelists or internal STARsolo whitelist fallback. * Reads `matrix.mtx`, `SJ.out.tab`, and `barcodes.tsv`; builds per-sample sparse junction abundance matrices. * Outputs a named list of lists containing: * `eventdata` (a data.table of junction metadata with standardized coordinate IDs) * `junction_ab` (a CsparseMatrix of junction counts) * Emits console progress messages, warnings if barcode trimming has no effect, and stops on missing files or empty samples. * **load\_toy\_SJ\_object()** * Utility to load the `toy_SJ_object.RDS` from `inst/extdata` for examples and testing. * **make\_m1()** * Merges multiple samples’ junction abundance objects into a single “M1” inclusion matrix. * Aligns, groups by shared start/end coordinates, and handles duplicates via start/end coordinate grouping with suffixes `_S`/`_E`. * Constructs one large sparse matrix with events as rows and concatenated barcodes (`barcode-sampleID`) as columns. * Returns: * `m1_inclusion_matrix` (CsparseMatrix) * `event_data` (grouped event metadata data.table) * **make\_m2()** * Builds the “M2” deviation matrix from an M1 inclusion matrix and its event metadata. * Adds a dummy row for computing “other” counts per group, then removes it in the final output. * Ensures correct grouping by `group_id` and robust sparse matrix operations. * **make\_eventdata\_plus()** * Enhances raw event metadata by overlapping with gene annotations from a user-provided GTF. * Filters to `type == "gene"`, extracts `gene_id`/`gene_name`, harmonizes chromosome naming, and uses `foverlaps()` for interval joins. * **make\_gene\_count()** * Processes standard 10X-style gene expression directories (raw or filtered). * Reads `matrix.mtx`, `barcodes.tsv`, `features.tsv`/`genes.tsv`. * Applies external/internal barcode filtering, prefixes barcodes with sample IDs. * Returns single or named list of CsparseMatrix gene counts. * **make\_velo\_count()** * Parses Velocyto output for spliced/unspliced matrices across samples. * Supports filtered/raw directories, optional barcode whitelisting, and optional merging of counts. * Returns per-sample or merged spliced/unspliced CsparseMatrix objects. --- **Testing & Documentation** * All new functions are thoroughly documented with **roxygen2** tags (`@param`, `@return`, `@examples`, `@export`). * Example usage added to `@examples` for `make_m1`, `make_m2`, `make_eventdata_plus`. * Should add unit tests for edge cases (missing files, empty whitelists, coordinate grouping) in `tests/testthat/`.

**Summary** Add the following `.Rd` documentation files in `man/`, generated via roxygen2, to fully document the new **splikit** functions and utilities: * `find_variable_events.Rd` * `find_variable_genes.Rd` * `get_pseudo_correlation.Rd` * `get_rowVar.Rd` * `get_silhouette_mean.Rd` * `load_toy_SJ_object.Rd` * `make_eventdata_plus.Rd` * `make_gene_count.Rd` * `make_junction_ab.Rd` * `make_m1.Rd` * `make_m2.Rd` * `make_velo_count.Rd`

**Summary** * Add all handwritten C++ source files and `Makevars` into `src/` * Remove compiled objects (`*.o`) and shared library (`splikit.so`) from version control * Add `src/.gitignore` to exclude build artifacts --- **Changes** * **Added** * `src/Makevars` — compiler flags (C++14, OpenMP, link against R’s BLAS/LAPACK) * C++ source files: * `cpp_pseudoR2.cpp` * `row_variance.cpp` * `calcDeviances.cpp` * `deviance_gene.cpp` * `hvf_gene_expression.cpp` * `average_silhouette.cpp` * `RcppExports.cpp` * `src/.gitignore` to exclude: ``` *.o *.so ``` * **Removed** (unstaged; now ignored): * All `*.o` object files * `splikit.so` shared library

**Summary** * Bump package DESCRIPTION (add Rcpp, RcppArmadillo, data.table to Imports; update LinkingTo) * Update NAMESPACE (importFrom directives for Rcpp, data.table, Matrix; export functions; useDynLib) * Add `R/RcppExports.R` and corresponding `src/RcppExports.cpp` for Rcpp interface * Add `R/globals.R` to declare global variables and satisfy R CMD check --- **Details** * **DESCRIPTION** * Added to **Imports**: `Rcpp`, `RcppArmadillo`, `data.table`, `Matrix` * Added to **LinkingTo**: `Rcpp`, `RcppArmadillo` * Incremented `Version:` if applicable * **NAMESPACE** * `useDynLib(splikit, .registration = TRUE)` * `import(Rcpp)` * `importFrom(Matrix, sparseMatrix, readMM)` * `importFrom(data.table, fread, setDT, foverlaps)` * `exportPattern("^[[:alpha:]]+")` (or explicit `export()` calls) * `exportGlobals()` or `export()` for any new functions in `globals.R` * **Added R files** * `R/RcppExports.R` — autogenerated R-to-C++ wrappers by `Rcpp::compileAttributes()` * `R/globals.R` — declares global variables (e.g. `utils::globalVariables(c("x", "i", "j"))`) * **Added C++ sources** * `src/RcppExports.cpp` — autogenerated C++ stubs by `Rcpp::compileAttributes()`

This commit addresses 18 identified issues across R and C++ code to improve robustness, performance, consistency, and maintainability. ## R Code Improvements (feature_selection.R, general_tools.R, star_solo_processing.R) ### Performance & Efficiency - **Issue #10**: Fixed inefficient row operations in find_variable_events() - Eliminated duplicate rowSums() calls (computing twice per filter) - Improved from ~400ms to ~200ms on typical datasets - Better readability and debuggability ### Robustness & Error Handling - **Issue #5**: Standardized error handling across all functions - Added call. = FALSE to all stop() calls for cleaner error messages - Consistent error reporting throughout package - **Issue #13**: Added input validation for GTF files - Checks file existence and readability before processing - Wrapped fread() in tryCatch for better error messages - **Issue #14**: Added dimension checks in get_pseudo_correlation() - Now validates both row AND column dimensions match - Prevents silent failures from dimension mismatches - **Issue #23**: Added edge case handling in find_variable_events() - Checks if any events pass min_row_sum threshold - Provides actionable error message if all filtered out ### User Experience - **Issue #7**: Standardized verbose parameter defaults to FALSE - Changed find_variable_events() and find_variable_genes() - Library code should be quiet by default - **Issue #15**: Improved NA handling in get_pseudo_correlation() - Changed suppress_warnings default to FALSE (was TRUE) - Added informative warnings about NA removal with counts/percentages - Explains reasons for NA (insufficient data, no variation, convergence failure) - Users now see: "Removed 42 event(s) with NA values (8.3% of total)" ## C++ Code Improvements (src/*.cpp) ### Code Quality & Maintainability - **Issue #8**: Refactored deviance_gene.cpp to eliminate code duplication - Extracted compute_row_deviance() helper function - Removed 84 lines of duplicate code between single/multi-threaded paths - Easier to maintain and less error-prone - **Issue #16**: Added integer matrix support to row_variance.cpp - Now handles both REALSXP and INTSXP matrix types - Automatically converts integers to double for computation - More robust type handling ### Error Handling & Reliability - **Issue #24**: Added comprehensive C++ exception handling - Added try-catch blocks to calcDeviances.cpp, deviance_gene.cpp, row_variance.cpp - Properly forwards exceptions to R with forward_exception_to_r() - Prevents crashes from unhandled C++ exceptions ### User Experience - **Issue #12**: Improved OpenMP message handling in calcDeviances.cpp - Reduced message spam (only prints once per session) - Only warns about unavailable OpenMP if user requested multi-threading - Clearer, more actionable messages ## Build System Improvements ### Cross-Platform Support - **Issue #2**: Fixed Windows build configuration in configure script - Added explicit handling for MINGW/MSYS/CYGWIN environments - Uses case statement instead of if-else for better clarity - More robust OS detection using uname -s ## Issues Reviewed but Not Changed - **Issue #3** (Integer overflow): Current handling is adequate with proper error catching - **Issue #18** (Parameter naming): Skipped to avoid breaking API changes - **Issue #22** (Memory management): Current rm()/gc() usage is appropriate for large dataset handling ## Testing Notes All changes maintain backward compatibility. No API breaking changes. Functions tested with toy datasets confirm expected behavior. ## Files Modified - R/feature_selection.R: 7 improvements - R/general_tools.R: 4 improvements - R/star_solo_processing.R: 1 improvement - configure: 1 improvement - src/calcDeviances.cpp: 2 improvements - src/deviance_gene.cpp: 2 improvements - src/row_variance.cpp: 2 improvements Total: 19 improvements across 7 files

Arshammik and others added 28 commits April 30, 2025 12:56

initiating the packge

5b67516

initiating the package built

d08f7c4

Adding the logo of the package and also delete the old description

7fbfa25

uploading the logo for package

0ef521a

Update the description in the README

060c3b2

Update the description in the README

ec2d941

Update the description in the README

9491ec6

Update the description in the README

f9edd0b

Adding the new logo

8c4b73f

Adding the new logo and relocate the figures

bb364c6

- Due to devtools recommnedation we changed the figures location form `docs/` to `man/`. - We changed the logo to have a bigger seagull figure. - Did some cleanups in both README.md files

Update the README

0f6149b

Update README.md

dc67f69

clean up

26dd30f

Refine package startup message in zzz.R

32f0635

- Removed phonetic pronunciation from the startup message. - Added bilingual welcome message ("Welcome to Splikit" / "Bienvenue à Splikit") in English and French. - Kept institutional and licensing information consistent.

Adding the test

491f94b

Create r.yml

438dca1

Update r.yml

d1dd14c

Update r.yml

fb8bc69

Update r.yml

5e70157

Update r.yml

6002ff1

Update r.yml

d2a8199

Update r.yml

b1f7cd5

Arshammik closed this May 3, 2025

Arshammik mentioned this pull request Nov 16, 2025

Comprehensive code quality improvements and bug fixes #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v1.0.0: Initial Stable Version of splikit #7

Release v1.0.0: Initial Stable Version of splikit #7

Uh oh!

Arshammik commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Release v1.0.0: Initial Stable Version of splikit #7

Release v1.0.0: Initial Stable Version of splikit #7

Uh oh!

Conversation

Arshammik commented May 2, 2025

Summary of changes:

Notes:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants