Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add peak variable support to MsBackendMemory and MsBackendDataFrame #297

Merged
merged 19 commits into from
Sep 22, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/check-bioc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ jobs:
- name: Query dependencies
run: |
install.packages('remotes')
remotes::install_github("r-lib/remotes")
saveRDS(remotes::dev_package_deps(dependencies = TRUE), ".github/depends.Rds", version = 2)
shell: Rscript {0}

Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: Spectra
Title: Spectra Infrastructure for Mass Spectrometry Data
Version: 1.11.9
Version: 1.11.10
Description: The Spectra package defines an efficient infrastructure
for storing and handling mass spectrometry spectra and functionality to
subset, process, visualize and compare spectra data. It provides different
Expand Down
19 changes: 19 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,24 @@
# Spectra 1.11

## Changes in 1.11.10

- `peaksData,MsBackendMemory` returns a `data.frame` if additional peak
variables (in addition to `"mz"` and `"intensity"`) are requested. For
`columns = c("mz", "intensity")` (the default) a `list` of `matrix` is
returned.
- `peaksData,Spectra` returns either a `matrix` or `data.frame` and ensures
the peak data is correctly subset based on the lazy evaluation processing
queue.
- `$,Spectra` to access peak variables ensures the lazy evaluation queue is
applied prior to extracting the values.
- `applyProcessing` correctly subsets and processes all peak variables
depending on the processing queue.
- `spectraData<-,Spectra` throws an error if processing queue is not empty and
values for peaks variables should be replaced.
- `$<-,Spectra` throws an error if processing queue is not empty and a peaks
variable is going to be replaced.
- Add full support for additional peaks variables to `MsBackendDataFrame`.

## Changes in 1.11.9

- Add `filterPrecursorPeaks` to allow filtering peaks within each spectrum
Expand Down
71 changes: 32 additions & 39 deletions R/MsBackend.R
Original file line number Diff line number Diff line change
Expand Up @@ -224,8 +224,8 @@
#' used to submit the full spectra data as a `DataFrame` to the
#' backend. This would allow the backend to be also usable for the
#' [setBackend()] function from `Spectra`. Note that eventually (for
#' *read-only* backends) also the `supportsSetBackend` method would
#' need to be implemented to return `TRUE`.
#' *read-only* backends) also the `supportsSetBackend` method would need
#' to be implemented to return `TRUE`.
#' The `backendInitialize` method has also to ensure to correctly set
#' spectra variable `dataStorage`.
#'
Expand Down Expand Up @@ -422,28 +422,30 @@
#' the number of spectra in `object`. `NA` are reported for MS1
#' spectra of if no precursor information is available.
#'
#' - `peaksData` returns a `list` with the spectras' peak data, i.e. numeric
#' `matrix` with peak values. The length of the list is equal to the number
#' of spectra in `object`. Each element of the list is a `numeric` `matrix`
#' - `peaksData` returns a `list` with the spectras' peak data, i.e. m/z and
#' intensity values or other *peak variables*. The length of the list is
#' equal to the number of spectra in `object`. Each element of the list has
#' to be a two-dimensional array (`matrix` or `data.frame`)
#' with columns depending on the provided `columns` parameter (by default
#' `"mz"` and `"intensity"`, but depends on the backend's available
#' `peaksVariables`). For an empty spectrum, a `matrix` with 0 rows and
#' columns according to `columns` is returned. The optional parameter
#' `columns`, if supported by the backend, allows to define which peak
#' variables should be returned in the `numeric` peak `matrix`. As a default
#' `c("mz", "intensity")` should be used.
#' `peaksVariables`). For an empty spectrum, a `matrix` (`data.frame`) with
#' 0 rows and columns according to `columns` is returned. The optional
#' parameter `columns`, if supported by the backend, allows to define which
#' peak variables should be returned in the `numeric` peak `matrix`. As a
#' default `c("mz", "intensity")` should be used.
#'
#' - `peaksData<-` replaces the peak data (m/z and intensity values) of the
#' backend. This method expects a `list` of `matrix` objects with columns
#' `"mz"` and `"intensity"` that has the same length as the number of
#' spectra in the backend. Note that just writeable backends support this
#' method.
#' backend. This method expects a `list` of two dimensional arrays (`matrix`
#' or `data.frame`) with columns representing the peak variables. All
#' existing peaks data is expected to be replaced with these new values. The
#' length of the `list` has to match the number of spectra of `object`.
#' Note that only writeable backends need to support this method.
#'
#' - `peaksVariables`: lists the available variables for mass peaks. Default
#' peak variables are `"mz"` and `"intensity"` (which all backends need to
#' support and provide), but some backends might provide additional variables.
#' These variables correspond to the column names of the `numeric` `matrix`
#' representing the peak data (returned by `peaksData`).
#' All these variables are expected to be returned (if requested) by the
#' `peaksData` function.
#'
#' - `reset` a backend (if supported). This method will be called on the backend
#' by the `reset,Spectra` method that is supposed to restore the data to its
Expand Down Expand Up @@ -544,10 +546,7 @@
#' way the data is organized internally, provides much faster access to the
#' full peak data (i.e. the numerical matrices of m/z and intensity values).
#' Also subsetting and access to any spectra variable (except `"mz"` and
#' `"intensity"` is fastest for the `MsBackendMemory`. Finally, the
#' `MsBackendMemory` supports also arbitrary peak annotations while the
#' `MsBackendDataFrame` does not have support for such additional peak
#' variables.
#' `"intensity"` is fastest for the `MsBackendMemory`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a missing ) after "intensity".

#'
#' Thus, for most use cases, the `MsBackendMemory` provides a higher
#' performance and flexibility than the `MsBackendDataFrame` and should thus be
Expand All @@ -556,20 +555,13 @@
#' performance comparison.
#'
#' New objects can be created with the `MsBackendMemory()` and
#' `MsBackendDataFrame()` function, respectively. The backend can be
#' `MsBackendDataFrame()` function, respectively. Both backends can be
#' subsequently initialized with the `backendInitialize` method, taking a
#' `DataFrame` (or `data.frame`) with the MS data as first parameter `data`.
#' `backendInitialize` for `MsBackendMemory` has a second parameter
#' `peaksVariables` (default `peaksVariables = c("mz", "intensity")` that
#' allows to specify which of the columns in the provided data frame should
#' be considered as a *peaks variable* (i.e. information of an individual
#' mass peak) rather than a *spectra variable* (i.e. information of an
#' individual spectrum). Note that it is important to also include `"mz"` and
#' `"intensity"` in `peaksVariables` as these would otherwise be considered
#' to be spectra variables! Also, while it is possible to change the values of
#' existing peaks variables using the `$<-` method, this method does **not**
#' allow to add new peaks variables to an existing `MsBackendMemory`. New
#' peaks variables should be added using the `backendInitialize` method.
#' `DataFrame` (or `data.frame`) with the (full) MS data as first parameter
#' `data`. The second parameter `peaksVariables` allows to define which columns
#' in `data` contain *peak variables* such as the m/z and intensity values of
#' individual peaks per spectrum. The default for this parameter is
#' `peaksVariables = c("mz", "intensity")`.
#'
#' Suggested columns of this `DataFrame` are:
#'
Expand Down Expand Up @@ -598,13 +590,12 @@
#'
#' Additional columns are allowed too.
#'
#' For the `MsBackendMemory`, any column in the provided `data.frame` which
#' contains a `list` of vectors each with length equal to the number of peaks
#' for a spectrum will be used as additional *peak variable* (see examples
#' below for details).
#' The `peaksData` function for `MsBackendMemory` and `MsBackendDataFrame`
#' returns a `list` of `numeric` `matrix` by default (with parameter
#' `columns = c("mz", "intensity")`). If other peak variables are requested,
#' a `list` of `data.frame` is returned (to ensure m/z and intensity values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand the "to ensure m/z ...". Should it not read "ensuring that m/z ..."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks

#' are always `numeric`).
#'
#' The `MsBackendDataFrame` ignores parameter `columns` of the `peaksData`
#' function and returns **always** m/z and intensity values.
#'
#' @section `MsBackendMzR`, on-disk MS data backend:
#'
Expand Down Expand Up @@ -650,6 +641,7 @@
#' The `MsBackendMzR` ignores parameter `columns` of the `peaksData`
#' function and returns **always** m/z and intensity values.
#'
#'
#' @section `MsBackendHdf5Peaks`, on-disk MS data backend:
#'
#' The `MsBackendHdf5Peaks` keeps, similar to the `MsBackendMzR`, peak data
Expand Down Expand Up @@ -681,6 +673,7 @@
#' The `MsBackendHdf5Peaks` ignores parameter `columns` of the `peaksData`
#' function and returns **always** m/z and intensity values.
#'
#'
#' @section Implementation notes:
#'
#' Backends extending `MsBackend` **must** implement all of its methods (listed
Expand Down
28 changes: 27 additions & 1 deletion R/MsBackendDataFrame-functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,26 @@ NULL
NULL
}

.peaks_variables <- function(x) {
if (.hasSlot(x, "peaksVariables")) {
x@peaksVariables
} else c("mz", "intensity")
}

.valid_peaks_variable_columns <- function(x, pvars) {
lens <- lapply(pvars, function(z) lengths(x[[z]]))
names(lens) <- pvars
lens <- lens[lengths(lens) > 0]
if (length(lens) > 1) {
for (i in 2:length(lens))
if (any(lens[[1L]] != lens[[i]]))
return(paste0("Number of values per spectra differ for peak ",
"variables \"", names(lens)[1L], "\" and \"",
names(lens)[i], "\"."))
}
NULL
}

.valid_intensity_mz_columns <- function(x) {
## Don't want to have that tested on all on-disk objects.
if (length(x$intensity) && length(x$mz))
Expand Down Expand Up @@ -198,7 +218,13 @@ MsBackendDataFrame <- function() {
return(objects[[1]])
if (!all(vapply1c(objects, class) == class(objects[[1]])))
stop("Can only merge backends of the same type: ", class(objects[[1]]))
res <- objects[[1]]
pvars <- lapply(objects, peaksVariables)
for (i in 2:length(pvars))
if (length(pvars[[i]]) != length(pvars[[1L]]) ||
any(pvars[[i]] != pvars[[1L]]))
stop("Provided backends have different peaks variables. Can only ",
"merge backends with the same set of peaks variables.")
res <- objects[[1L]]
suppressWarnings(
res@spectraData <- do.call(
rbindFill, lapply(objects, function(z) z@spectraData))
Expand Down
75 changes: 50 additions & 25 deletions R/MsBackendDataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ NULL

setClass("MsBackendDataFrame",
contains = "MsBackend",
slots = c(spectraData = "DataFrame"),
slots = c(spectraData = "DataFrame",
peaksVariables = "character"),
prototype = prototype(spectraData = DataFrame(),
peaksVariables = c("mz", "intensity"),
readonly = FALSE,
version = "0.1"))
version = "0.2"))

setValidity("MsBackendDataFrame", function(object) {
msg <- .valid_spectra_data_required_columns(object@spectraData)
Expand All @@ -27,7 +29,8 @@ setValidity("MsBackendDataFrame", function(object) {
.valid_column_datatype(object@spectraData, .SPECTRA_DATA_COLUMNS),
.valid_intensity_column(object@spectraData),
.valid_mz_column(object@spectraData),
.valid_intensity_mz_columns(object@spectraData))
.valid_peaks_variable_columns(object@spectraData,
.peaks_variables(object)))
if (is.null(msg)) TRUE
else msg
})
Expand All @@ -52,12 +55,16 @@ setMethod("show", "MsBackendDataFrame", function(object) {
#'
#' @rdname MsBackend
setMethod("backendInitialize", signature = "MsBackendDataFrame",
function(object, data, ...) {
function(object, data, peaksVariables = c("mz", "intensity"), ...) {
if (missing(data)) data <- DataFrame()
if (is.data.frame(data))
data <- DataFrame(data)
if (!is(data, "DataFrame"))
stop("'data' has to be a 'DataFrame'")
peaksVariables <- intersect(peaksVariables, colnames(data))
if (sum(c("mz", "intensity") %in% peaksVariables) == 1L)
Copy link
Member

@lgatto lgatto Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this check, it will throw an error if only one of "mz" or "intensity" are in the peaksVariables. If none are, this wouldn't trigger the error. Why not

sum(c("mz", "intensity") %in% peaksVariables) != 2L

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! thanks, changed that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, after checking again: we were supporting the following situations:

  • user provides m/z and intensity
  • user does not provide neither m/z and intensity (in which case 0-length spectra are available, but the Spectra is still considered valid).

An error is only thrown if either m/z but not intensity, or intensity but not m/z is provided. from a data point of view that does not make sense, you can't have one of the two. you could either have both or none.

Open to discuss @lgatto when you think that's not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fine by me. May be just clarify with a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jepp, will add that and also check that I mention it in documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as a comment and docu.

stop("Both \"mz\" and \"intensity\" peak variables ",
"need to be provided.")
if (nrow(data)) {
data$dataStorage <- "<memory>"
if (nrow(data) && !is(data$mz, "NumericList"))
Expand All @@ -67,7 +74,8 @@ setMethod("backendInitialize", signature = "MsBackendDataFrame",
compress = FALSE)
}
object@spectraData <- data
validObject(object)
object@peaksVariables <- peaksVariables
validObject(object) # this checks also for peaks variables.
object
})

Expand All @@ -91,14 +99,21 @@ setMethod("acquisitionNum", "MsBackendDataFrame", function(object) {

#' @rdname hidden_aliases
setMethod("peaksData", "MsBackendDataFrame",
function(object, columns = peaksVariables(object)) {
if (!all(columns %in% c("mz", "intensity")))
stop("'peaksData' for 'MsBackendDataFrame' does only support",
" columns \"mz\" and \"intensity\"", call. = FALSE)
lst <- lapply(columns, function(z) do.call(z, list(object)))
function(object, columns = c("mz", "intensity")) {
na <- columns[!columns %in% peaksVariables(object)]
if (length(na))
stop("Peaks variable \"", na, "\" not available.")
lst <- lapply(columns, function(z) {
if (z %in% c("mz", "intensity"))
do.call(z, list(object))
else object@spectraData[, z]
})
names(lst) <- columns
tmp <- do.call(mapply, c(list(FUN = cbind, SIMPLIFY = FALSE,
USE.NAMES = FALSE), lst))
if (all(columns %in% c("mz", "intensity")))
fun <- cbind
else fun <- cbind.data.frame
do.call(mapply, c(list(FUN = fun, SIMPLIFY = FALSE,
USE.NAMES = FALSE), lst))
})

#' @rdname hidden_aliases
Expand Down Expand Up @@ -337,22 +352,30 @@ setMethod("precursorMz", "MsBackendDataFrame", function(object) {

#' @rdname hidden_aliases
setReplaceMethod("peaksData", "MsBackendDataFrame", function(object, value) {
if (!(is.list(value) || inherits(value, "SimpleList")))
stop("'value' has to be a list-like object")
if (length(value) != length(object))
stop("Length of 'value' has to match length of 'object'")
vals <- lapply(value, "[", , 1L)
if (!is(vals, "NumericList"))
vals <- NumericList(vals, compress = FALSE)
object@spectraData$mz <- vals
vals <- lapply(value, "[", , 2L)
if (!is(vals, "NumericList"))
vals <- NumericList(vals, compress = FALSE)
object@spectraData$intensity <- vals
validObject(object)
if (length(object)) {
.check_peaks_data_value(value, length(object))
cns <- colnames(value[[1L]])
for (cn in cns) {
vals <- lapply(value, "[", , cn)
if (cn %in% c("mz", "intensity"))
vals <- NumericList(vals, compress = FALSE)
object@spectraData[[cn]] <- vals
}
## remove eventual old peak variables
rem <- setdiff(peaksVariables(object), cns)
for (r in rem)
object@spectraData[[r]] <- NULL
object@peaksVariables <- cns
validObject(object)
}
object
})

#' @rdname hidden_aliases
setMethod("peaksVariables", "MsBackendDataFrame", function(object) {
union(c("mz", "intensity"), .peaks_variables(object))
})

#' @rdname hidden_aliases
setMethod("rtime", "MsBackendDataFrame", function(object) {
.get_column(object@spectraData, "rtime")
Expand Down Expand Up @@ -388,6 +411,8 @@ setMethod("selectSpectraVariables", "MsBackendDataFrame",
msg <- .valid_spectra_data_required_columns(object@spectraData)
if (length(msg))
stop(msg)
object@peaksVariables <- intersect(object@peaksVariables,
colnames(object@spectraData))
validObject(object)
object
})
Expand Down
17 changes: 17 additions & 0 deletions R/MsBackendMemory-functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -116,3 +116,20 @@ MsBackendMemory <- function() {
}
res
}

.check_peaks_data_value <- function(x, lo) {
if (!(is.list(x) || inherits(x, "SimpleList")))
stop("'value' has to be a list-like object")
if (length(x) != lo)
stop("Length of 'value' has to match length of 'object'")
if (!(is.matrix(x[[1L]]) | is.data.frame(x[[1L]])))
stop("'value' is expected to be a 'list' of 'matrix' ",
"or 'data.frame'")
cn <- colnames(x[[1L]])
lcn <- length(cn)
lapply(x, function(z) {
cur_cn <- colnames(z)
if (lcn != length(cur_cn) || !all(cn == cur_cn))
stop("provided matrices don't have the same column names")
})
}
Loading
Loading