Skip to content

sc.pp.normalize_total does not support dask arrays #2465

@flying-sheep

Description

@flying-sheep
  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the master branch of scanpy.

Minimal code sample (that we can copy&paste without having any data)

This issue is not yet visible in any builds, since the error fixed by scverse/anndata#970 prevents running the tests properly. Once that PR’s fix is released in a stable anndata version, we’ll start seeing this error unless we fix it first. (the minimal tests + anndata dev don’t have dask, so we don’t see the error there either)

$ pip install dask
$ pytest -k test_normalize_total

The error happens in this line:

f'normalization factor computation:\n{adata.var_names[~gene_subset].tolist()}'

i.e. in this expression: adata.var_names[~gene_subset], where gene_subset is a dask.array<invert, shape=(3,), dtype=bool, chunksize=(3,), chunktype=numpy.ndarray>

IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

The traceback pytest shows is a bit incorrect and shows this line instead:

' The following highly-expressed genes are not considered during '

./scanpy/tests/test_normalization.py::test_normalize_total[dask-array-int64] Failed: [undefined]IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed
typ = <function from_array at 0x7f6c35f8d940>, dtype = 'int64'

    @pytest.mark.parametrize('dtype', ['float32', 'int64'])
    def test_normalize_total(typ, dtype):
        adata = AnnData(typ(X_total), dtype=dtype)
        sc.pp.normalize_total(adata, key_added='n_counts')
        assert np.allclose(np.ravel(adata.X.sum(axis=1)), [3.0, 3.0, 3.0])
        sc.pp.normalize_total(adata, target_sum=1, key_added='n_counts2')
        assert np.allclose(np.ravel(adata.X.sum(axis=1)), [1.0, 1.0, 1.0])
    
        adata = AnnData(typ(X_frac), dtype=dtype)
>       sc.pp.normalize_total(adata, exclude_highly_expressed=True, max_fraction=0.7)

scanpy/tests/test_normalization.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
scanpy/preprocessing/_normalization.py:185: in normalize_total
    ' The following highly-expressed genes are not considered during '
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Index(['0', '1', '2'], dtype='object')
key = dask.array<invert, shape=(3,), dtype=bool, chunksize=(3,), chunktype=numpy.ndarray>

    def __getitem__(self, key):
        """
        Override numpy.ndarray's __getitem__ method to work as desired.
    
        This function adds lists and Series as valid boolean indexers
        (ndarrays only supports ndarray with dtype=bool).
    
        If resulting ndim != 1, plain ndarray is returned instead of
        corresponding `Index` subclass.
    
        """
        getitem = self._data.__getitem__
    
        if is_integer(key) or is_float(key):
            # GH#44051 exclude bool, which would return a 2d ndarray
            key = com.cast_scalar_indexer(key, warn_float=True)
            return getitem(key)
    
        if isinstance(key, slice):
            # This case is separated from the conditional above to avoid
            # pessimization com.is_bool_indexer and ndim checks.
            result = getitem(key)
            # Going through simple_new for performance.
            return type(self)._simple_new(result, name=self._name)
    
        if com.is_bool_indexer(key):
            # if we have list[bools, length=1e5] then doing this check+convert
            #  takes 166 µs + 2.1 ms and cuts the ndarray.__getitem__
            #  time below from 3.8 ms to 496 µs
            # if we already have ndarray[bool], the overhead is 1.4 µs or .25%
            key = np.asarray(key, dtype=bool)
    
>       result = getitem(key)
E       IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

../../venvs/single-cell/lib/python3.8/site-packages/pandas/core/indexes/base.py:5055: IndexError

Versions

Details

anndata 0.9.0rc2.dev18+g7771f6ee
scanpy 1.10.0.dev50+g3e3427d0

PIL 9.1.1
asciitree NA
beta_ufunc NA
binom_ufunc NA
cffi 1.15.0
cloudpickle 2.2.1
cycler 0.10.0
cython_runtime NA
dask 2023.3.2
dateutil 2.8.2
defusedxml 0.7.1
entrypoints 0.4
fasteners 0.17.3
h5py 3.7.0
hypergeom_ufunc NA
igraph 0.10.4
jinja2 3.1.2
joblib 1.1.0
kiwisolver 1.4.3
leidenalg 0.9.1
llvmlite 0.38.1
markupsafe 2.1.1
matplotlib 3.5.2
mpl_toolkits NA
natsort 8.1.0
nbinom_ufunc NA
numba 0.55.2
numcodecs 0.10.2
numpy 1.22.4
packaging 21.3
pandas 1.4.3
pkg_resources NA
psutil 5.9.1
pyparsing 3.0.9
pytz 2022.1
scipy 1.8.1
session_info 1.0.0
setuptools 67.2.0
setuptools_scm NA
six 1.16.0
sklearn 1.1.1
sphinxcontrib NA
texttable 1.6.7
threadpoolctl 3.1.0
tlz 0.12.0
toolz 0.12.0
typing_extensions NA
wcwidth 0.2.5
yaml 6.0
zarr 2.12.0
zipp NA

Python 3.8.16 (default, Dec 7 2022, 12:42:00) [GCC 12.2.0]
Linux-6.2.10-zen1-1-zen-x86_64-with-glibc2.34

Session information updated at 2023-04-11 15:57

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions