Skip to content

RFC: SciPy array types & libraries support #18286

Open
@rgommers

Description

@rgommers

This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.

The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray.

The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.

Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.

I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.

array/tensor types support

Array types and how to treat them:

Input type Return type When Behavior Notes
numpy.ndarray numpy.ndarray always
torch.Tensor (CPU) torch.Tensor always
torch.Tensor (GPU) torch.Tensor when possible raise TypeError otherwise for pure Python implementation for now. See text below for more details
cupy.ndarray cupy.ndarray when possible raise TypeError otherwise
has __array_namespace__ same as input type when possible raise TypeError otherwise
array with object dtype n/a raise TypeError
numpy.matrix n/a for NumPy >= 2.0 (or always?) raise TypeError
numpy.ma.MaskedArray numpy.ma.MaskedArray only for stats.mstats raise TypeError otherwise future nan_policy/mstats plans are unaffected by this
numpy.ndarray subclasses same subclass type always so always use np.asanyarray checks
torch.Tensor subclasses same subclass type same as for torch.Tensor on CPU/GPU
PyData/Sparse arrays same sparse array type once it has array API support TypeError otherwise TypeError is already the current state; no auto-densification
dask.Array dask.Array once it has array API support TypeError otherwise (?) bc-breaking, but mixing with current conversion to numpy array may not be healthy
jax.numpy.Array jax.numpy.Array once added to array-api-compat same as PyTorch for CPU and GPU

When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:

  1. dispatch back to the other library (PyTorch, CuPy, etc.).
  2. convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or __array__), use the compiled code in question, then convert back.
  3. don't support this at all

I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.

A note on numpy.matrix: the only place this is needed is scipy.sparse, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array instances and PyData/Sparse.

Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg and fft submodules - supporting partial implementations without linalg support in particular seems unnecessary.

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

Development and introduction strategy

The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.

  • develop behind an experimental flag (an environment variable and a global setting starting with an underscore)
  • copy the scikit-learn approach, using array-api-compat as an optional dependency for now
  • switch the new behavior of "input type in == input type out" by default once a submodule has complete support
    • at that point, vendor array-api-compat
    • may be a major release if it can all be done in one release cycle. otherwise do submodule by submodule
  • explicit support can and should be tested in CI for: PyTorch CPU tensors, numpy.array_api for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that.
  • implement new type coercion machinery centrally in scipy._lib and depend on that from other submodules. That means we'll have a single replacement for np.asarray/np.asanyarray in one central location.

Context

  • For NumPy 2.0 we should have full array API standard support in the main numpy namespace
  • The numpy.array_api module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only.
  • The array-api-compat library provides the compatibility we need right now between NumPy, CuPy and PyTorch. Other libraries can be added there too. This bridges any gaps between the standard and real-world implementations. The package is light-weight and can be depended on - but is also designed to be vendored to avoid a new runtime dependency.
  • scikit-learn has had experimental array API support for a little while, and has now settled on array-api-compat as the way to go. This looks quite a bit nicer than the previous approach with numpy.array_api, and the work to support PyTorch that way has been quite well-received.
  • As for the myriad of open NEPs on interoperability topics:
    • NEP 30 (__duckarray__), NEP 31 (uarray) and NEP 37 (__array_module__) are all effectively dead - I'll propose rejecting these.
    • NEP 47 (array API support via numpy.array_api) has been implemented, however I'm going to propose marking it superceded via a new NEP for the main numpy namespace
    • __array_function__ and __array_ufunc__ are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types to ndarray and error out if that fails. The exception here is ufuncs in scipy.special, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.
  • NumPy array subclasses are a long-standing problem, with masked arrays and np.matrix instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful.
  • Missing data support is a broad and tricky subject:
    • numpy.ma has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned. numpy.ma in its current form should be considered legacy.
    • scipy.stats.mstats is the only module that specifically supports masked arrays.
    • scipy.stats has a nan_policy keyword in many functions that is well-maintained at this point, and has a documented specification. That is probably not applicable to other submodules though.
    • Support for missing data in other places like interpolate and spatial.KDTree (see gh-18230) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist.
    • Dataframe libraries have much better support, but it's a difficult story to use non-numpy-backed dataframes at all.
    • For the purposes of this proposal, I'd prefer leaving it at "reject numpy.ma.MaskedArray instances".

Problems & opportunities

The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).

Problems include:

  • pandas input with non-numpy dtypes being converted to object arrays and then support being very hit-and-miss; see for example gh-18254
  • general pandas dataframe/series support being patchy; see for example gh-9616 and gh-12164
  • masked arrays giving incorrect results because the mask is silently dropped; see, e.g., gh-1682 and gh-12898
  • sparse array support being patchy; agreement that auto-densification needs to be avoided, but that means current behavior is inconsistent (xref gh-4239)
  • object arrays, including use of Decimal and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.

Opportunities include:

  • Large performance gains. See for example mdhaber/scipy#63 (10x-50x for stats bootstrapping functionality with CuPy) and scikit-learn#25956 which shows how PyTorch improves over NumPy (2x-7x on CPU, 60x-100x on GPU)
  • interest in support for Dask which we haven't really been able to do much with, see for example gh-10204, gh-8810 and gh-10087. Xarray is in the same boat, it builds on top of Dask (see, e.g., gh-14824)

References

Appendix

Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).

dataframe support

For dataframe library support, the situation is a little trickier. We have to think about pandas Series and DataFrame instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.

This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):

Input type Return type When Behavior Notes
pandas.Series pandas.Series if dtype is a numpy dtype raise TypeError otherwise non-NumPy dtypes get coerced to object arrays otherwise, avoid that
pandas.DataFrame pandas.DataFrame if dtype is a numpy dtype raise TypeError otherwise
has __dataframe__ same as input type needs a small extension to dataframe interchange protocol to reconstruct input type

There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output feature to request pandas dataframe as the return type.

Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.

Dispatching mechanism

For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).

There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft (as do matching fft APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.

Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x) where x is a non-numpy array should dispatch to the library which implements the array type of x.

We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest for Comments; typically used to gather feedback for a substantial change proposalSciPEPSciPy Enhancement Proposalarray typesItems related to array API support and input array validation (see gh-18286)enhancementA new feature or improvement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions