Description
This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.
The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.
Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray
.
The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.
Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.
I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.
array/tensor types support
Array types and how to treat them:
Input type | Return type | When | Behavior | Notes |
---|---|---|---|---|
numpy.ndarray | numpy.ndarray | always | ||
torch.Tensor (CPU) | torch.Tensor | always | ||
torch.Tensor (GPU) | torch.Tensor | when possible | raise TypeError otherwise | for pure Python implementation for now. See text below for more details |
cupy.ndarray | cupy.ndarray | when possible | raise TypeError otherwise | |
has __array_namespace__ |
same as input type | when possible | raise TypeError otherwise | |
array with object dtype |
n/a | raise TypeError | ||
numpy.matrix | n/a | for NumPy >= 2.0 (or always?) | raise TypeError | |
numpy.ma.MaskedArray | numpy.ma.MaskedArray | only for stats.mstats |
raise TypeError otherwise | future nan_policy /mstats plans are unaffected by this |
numpy.ndarray subclasses | same subclass type | always | so always use np.asanyarray checks |
|
torch.Tensor subclasses | same subclass type | same as for torch.Tensor on CPU/GPU |
||
PyData/Sparse arrays | same sparse array type | once it has array API support | TypeError otherwise | TypeError is already the current state; no auto-densification |
dask.Array | dask.Array | once it has array API support | TypeError otherwise (?) | bc-breaking, but mixing with current conversion to numpy array may not be healthy |
jax.numpy.Array | jax.numpy.Array | once added to array-api-compat |
same as PyTorch for CPU and GPU |
When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:
- dispatch back to the other library (PyTorch, CuPy, etc.).
- convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or
__array__
), use the compiled code in question, then convert back. - don't support this at all
I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.
A note on numpy.matrix
: the only place this is needed is scipy.sparse
, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix
long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array
instances and PyData/Sparse.
Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg
and fft
submodules - supporting partial implementations without linalg
support in particular seems unnecessary.
For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.
Development and introduction strategy
The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.
- develop behind an experimental flag (an environment variable and a global setting starting with an underscore)
- copy the scikit-learn approach, using
array-api-compat
as an optional dependency for now - switch the new behavior of "input type in == input type out" by default once a submodule has complete support
- at that point, vendor
array-api-compat
- may be a major release if it can all be done in one release cycle. otherwise do submodule by submodule
- at that point, vendor
- explicit support can and should be tested in CI for: PyTorch CPU tensors,
numpy.array_api
for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that. - implement new type coercion machinery centrally in
scipy._lib
and depend on that from other submodules. That means we'll have a single replacement fornp.asarray
/np.asanyarray
in one central location.
Context
- For NumPy 2.0 we should have full array API standard support in the main
numpy
namespace - The
numpy.array_api
module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only. - The array-api-compat library provides the compatibility we need right now between NumPy, CuPy and PyTorch. Other libraries can be added there too. This bridges any gaps between the standard and real-world implementations. The package is light-weight and can be depended on - but is also designed to be vendored to avoid a new runtime dependency.
- scikit-learn has had experimental array API support for a little while, and has now settled on
array-api-compat
as the way to go. This looks quite a bit nicer than the previous approach withnumpy.array_api
, and the work to support PyTorch that way has been quite well-received. - As for the myriad of open NEPs on interoperability topics:
- NEP 30 (
__duckarray__
), NEP 31 (uarray
) and NEP 37 (__array_module__
) are all effectively dead - I'll propose rejecting these. - NEP 47 (array API support via
numpy.array_api
) has been implemented, however I'm going to propose marking it superceded via a new NEP for the mainnumpy
namespace __array_function__
and__array_ufunc__
are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types tondarray
and error out if that fails. The exception here is ufuncs inscipy.special
, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.
- NEP 30 (
- NumPy array subclasses are a long-standing problem, with masked arrays and
np.matrix
instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful. - Missing data support is a broad and tricky subject:
numpy.ma
has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned.numpy.ma
in its current form should be considered legacy.scipy.stats.mstats
is the only module that specifically supports masked arrays.scipy.stats
has anan_policy
keyword in many functions that is well-maintained at this point, and has a documented specification. That is probably not applicable to other submodules though.- Support for missing data in other places like
interpolate
andspatial.KDTree
(see gh-18230) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist. - Dataframe libraries have much better support, but it's a difficult story to use non-numpy-backed dataframes at all.
- For the purposes of this proposal, I'd prefer leaving it at "reject
numpy.ma.MaskedArray
instances".
Problems & opportunities
The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).
Problems include:
- pandas input with non-numpy dtypes being converted to object arrays and then support being very hit-and-miss; see for example gh-18254
- general pandas dataframe/series support being patchy; see for example gh-9616 and gh-12164
- masked arrays giving incorrect results because the mask is silently dropped; see, e.g., gh-1682 and gh-12898
- sparse array support being patchy; agreement that auto-densification needs to be avoided, but that means current behavior is inconsistent (xref gh-4239)
- object arrays, including use of
Decimal
and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.
Opportunities include:
- Large performance gains. See for example mdhaber/scipy#63 (10x-50x for stats bootstrapping functionality with CuPy) and scikit-learn#25956 which shows how PyTorch improves over NumPy (2x-7x on CPU, 60x-100x on GPU)
- interest in support for Dask which we haven't really been able to do much with, see for example gh-10204, gh-8810 and gh-10087. Xarray is in the same boat, it builds on top of Dask (see, e.g., gh-14824)
References
- Scikit-learn PR for PyTorch support: scikit-learn#25956
- Scikit-learn docs for array API support: https://scikit-learn.org/stable/modules/array_api.html
- Tracking issue for SciPy array API standard support: gh-15354
- Initial PR for SciPy array API support (not even blocked, we just didn't push it forward because
numpy.array_api
is a bit cumbersome): gh-15395 - NumPy 2.0 developer meeting presentation on array API adoption
Appendix
Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).
dataframe support
For dataframe library support, the situation is a little trickier. We have to think about pandas Series
and DataFrame
instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.
This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):
Input type | Return type | When | Behavior | Notes |
---|---|---|---|---|
pandas.Series | pandas.Series | if dtype is a numpy dtype | raise TypeError otherwise | non-NumPy dtypes get coerced to object arrays otherwise, avoid that |
pandas.DataFrame | pandas.DataFrame | if dtype is a numpy dtype | raise TypeError otherwise | |
has __dataframe__ |
same as input type | needs a small extension to dataframe interchange protocol to reconstruct input type |
There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output
feature to request pandas dataframe as the return type.
Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.
Dispatching mechanism
For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).
There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft
(as do matching fft
APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.
Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x)
where x
is a non-numpy array should dispatch to the library which implements the array type of x
.
We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:
- https://labs.quansight.org/blog/pydata-extensibility-vision (and in particular the PyData dispatching system section which covers alternative dispatching options)
- https://labs.quansight.org/blog/making-gpus-accessible-to-pydata-ecosystem-via-array-api
- https://quansight-labs.github.io/array-api-demo/GW_Demo_Array_API.html
- https://data-apis.org/array-api/latest/use_cases.html
- https://github.com/numpy/archive/blob/main/other_meetings/2020-04-21-array-protocols-discussion-notes.md
- Note that there's work to be done to write SPEC-0002 on this dispatching topic, see also Discourse posts on that: here, here and here.
- Scikit-learn is working on a different kind of system for this, with explicit plugins to extend scikit-learn with support for other libraries/devices where it's using compiled code - see Path for pluggable low-level computational routines scikit-learn/scikit-learn#22438.
- PyTorch has a more robust backend system (https://pytorch.org/docs/stable/dynamo/custom-backends.html).
- I'm sure there more examples of how to enable this kind of thing, inside and outside of Python. It's just a lot of engineering effort.