Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale function(_get_mean_var) updated for dense array, speedup upto ~4.65x #3099

Closed
wants to merge 14 commits into from
Closed
1 change: 1 addition & 0 deletions docs/release-notes/1.10.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@
* `pp.highly_variable_genes` with `flavor=seurat_v3` now uses a numba kernel {pr}`3017` {smaller}`S Dicks`
* Speed up {func}`~scanpy.pp.scrublet` {pr}`3044` {smaller}`S Dicks` and {pr}`3056` {smaller}`P Angerer`
* Speed up clipping of array in {func}`~scanpy.pp.scale` {pr}`3100` {smaller}`P Ashish & S Dicks`
* Speed up _get_mean_var used in {func}`~scanpy.pp.scale` {pr}`3099` {smaller}`P Ashish & S Dicks`
Zethson marked this conversation as resolved.
Show resolved Hide resolved
43 changes: 36 additions & 7 deletions src/scanpy/preprocessing/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,43 @@
def _get_mean_var(
X: _SupportedArray, *, axis: Literal[0, 1] = 0
) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
if isinstance(X, sparse.spmatrix):
mean, var = sparse_mean_variance_axis(X, axis=axis)
if isinstance(X, np.ndarray):
mean, var = _compute_mean_var(X, axis=axis, dtype=np.float64)
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
else:
mean = axis_mean(X, axis=axis, dtype=np.float64)
mean_sq = axis_mean(elem_mul(X, X), axis=axis, dtype=np.float64)
var = mean_sq - mean**2
# enforce R convention (unbiased estimator) for variance
var *= X.shape[axis] / (X.shape[axis] - 1)
if isinstance(X, sparse.spmatrix):
mean, var = sparse_mean_variance_axis(X, axis=axis)
else:
mean = axis_mean(X, axis=axis, dtype=np.float64)
mean_sq = axis_mean(elem_mul(X, X), axis=axis, dtype=np.float64)
var = mean_sq - mean**2

Check warning on line 44 in src/scanpy/preprocessing/_utils.py

View check run for this annotation

Codecov / codecov/patch

src/scanpy/preprocessing/_utils.py#L42-L44

Added lines #L42 - L44 were not covered by tests
# enforce R convention (unbiased estimator) for variance
var *= X.shape[axis] / (X.shape[axis] - 1)
Comment on lines +46 to +47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before your change, this line ran unconditionally, now it only runs for the not isinstance(X, np.ndarray) case. Is that intentional? Then you should mention that in _compute_mean_var’s docstring.

return mean, var


@numba.njit(cache=True, parallel=True)
def _compute_mean_var(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have _get_mean_var. Maybe rename this to _get_mean_var_ndarray or _get_mean_var_dense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rename the kernel

X: _SupportedArray, axis: Literal[0, 1] = 0, dtype: DTypeLike | None = None
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
nthr = numba.get_num_threads()
axis_i = 1 if axis == 0 else 0
s = np.zeros((nthr, X.shape[axis_i]), dtype=dtype)
ss = np.zeros((nthr, X.shape[axis_i]), dtype=dtype)
mean = np.zeros(X.shape[axis_i], dtype=dtype)
# std=np.zeros(X.shape[axis_i],dtype=dtype)
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
var = np.zeros(X.shape[axis_i], dtype=dtype)
n = X.shape[axis]
for i in numba.prange(nthr):
for r in range(i, n, nthr):
for c in range(X.shape[axis_i]):
v = X[r, c] if axis == 0 else X[c, r]
s[i, c] += v
ss[i, c] += v * v
for c in numba.prange(X.shape[axis_i]):
s0 = s[:, c].sum()
mean[c] = s0 / n
var[c] = (ss[:, c].sum() - s0 * s0 / n) / (n - 1)
# std[c]=np.sqrt(var[c])
ashish615 marked this conversation as resolved.
Show resolved Hide resolved
return mean, var


Expand Down
Loading