Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale function(_get_mean_var) updated for dense array, speedup upto ~4.65x #3099

Closed
wants to merge 14 commits into from
Closed
1 change: 1 addition & 0 deletions docs/release-notes/1.10.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@
* `pp.highly_variable_genes` with `flavor=seurat_v3` now uses a numba kernel {pr}`3017` {smaller}`S Dicks`
* Speed up {func}`~scanpy.pp.scrublet` {pr}`3044` {smaller}`S Dicks` and {pr}`3056` {smaller}`P Angerer`
* Speed up clipping of array in {func}`~scanpy.pp.scale` {pr}`3100` {smaller}`P Ashish & S Dicks`
* Speed up _get_mean_var used in {func}`~scanpy.pp.scale` {pr}`3099` {smaller}`P Ashish & S Dicks`
Zethson marked this conversation as resolved.
Show resolved Hide resolved
41 changes: 34 additions & 7 deletions src/scanpy/preprocessing/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,41 @@ def _(X: np.ndarray, *, axis: Literal[0, 1], dtype: DTypeLike) -> np.ndarray:
def _get_mean_var(
X: _SupportedArray, *, axis: Literal[0, 1] = 0
) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
if isinstance(X, sparse.spmatrix):
mean, var = sparse_mean_variance_axis(X, axis=axis)
if isinstance(X, np.ndarray):
mean, var = _compute_mean_var(X, axis=axis, dtype=np.float64)
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
else:
mean = axis_mean(X, axis=axis, dtype=np.float64)
mean_sq = axis_mean(elem_mul(X, X), axis=axis, dtype=np.float64)
var = mean_sq - mean**2
# enforce R convention (unbiased estimator) for variance
var *= X.shape[axis] / (X.shape[axis] - 1)
if isinstance(X, sparse.spmatrix):
mean, var = sparse_mean_variance_axis(X, axis=axis)
else:
mean = axis_mean(X, axis=axis, dtype=np.float64)
mean_sq = axis_mean(elem_mul(X, X), axis=axis, dtype=np.float64)
var = mean_sq - mean**2
# enforce R convention (unbiased estimator) for variance
var *= X.shape[axis] / (X.shape[axis] - 1)
Comment on lines +46 to +47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before your change, this line ran unconditionally, now it only runs for the not isinstance(X, np.ndarray) case. Is that intentional? Then you should mention that in _compute_mean_var’s docstring.

return mean, var


@numba.njit(cache=True, parallel=True)
def _compute_mean_var(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have _get_mean_var. Maybe rename this to _get_mean_var_ndarray or _get_mean_var_dense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rename the kernel

X: _SupportedArray, axis: Literal[0, 1] = 0, dtype: DTypeLike | None = None
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
nthr = numba.get_num_threads()
axis_i = 1 if axis == 0 else 0
s = np.zeros((nthr, X.shape[axis_i]), dtype=dtype)
ss = np.zeros((nthr, X.shape[axis_i]), dtype=dtype)
mean = np.zeros(X.shape[axis_i], dtype=dtype)
var = np.zeros(X.shape[axis_i], dtype=dtype)
n = X.shape[axis]
for i in numba.prange(nthr):
for r in range(i, n, nthr):
for c in range(X.shape[axis_i]):
v = X[r, c] if axis == 0 else X[c, r]
s[i, c] += v
ss[i, c] += v * v
for c in numba.prange(X.shape[axis_i]):
s0 = s[:, c].sum()
mean[c] = s0 / n
var[c] = (ss[:, c].sum() - s0 * s0 / n) / (n - 1)
return mean, var


Expand Down
Loading