RFC: SciPy array types & libraries support

_This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that._

The basic design principle I propose aiming for in all of SciPy for array libraries is: _**container type in == container type out**_. 

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with `np.asarray`.

The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.

Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.

_I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down._

## array/tensor types support

Array types and how to treat them:

| Input type                | Return type          | When                          | Behavior                  | Notes                                                                               |
|---------------------------|----------------------|-------------------------------|---------------------------|-------------------------------------------------------------------------------------|
| numpy.ndarray             | numpy.ndarray        | always                        |                           |                                                                                     |
| torch.Tensor (CPU)        | torch.Tensor         | always                        |                           |                                                                                     |
| torch.Tensor (GPU)        | torch.Tensor         | when possible      |   raise TypeError otherwise    |  for pure Python implementation for now. See text below for more details      |
| cupy.ndarray              | cupy.ndarray         | when possible           |    raise TypeError otherwise   |        |
| has `__array_namespace__` | same as input type   | when possible          |   raise TypeError otherwise    |                                                                                     |
| array with `object` dtype | n/a                  |                               | raise TypeError           |                                                                                     |
| numpy.matrix              | n/a                  | for NumPy >= 2.0 (or always?) | raise TypeError           |                                                                                     |
| numpy.ma.MaskedArray      | numpy.ma.MaskedArray | _only_ for `stats.mstats`     | raise TypeError otherwise | future `nan_policy`/`mstats` plans are unaffected by this                                                                                    |
| numpy.ndarray subclasses  | same subclass type   | always                        |                           | so always use `np.asanyarray` checks                                                |
| torch.Tensor subclasses   | same subclass type   | same as for `torch.Tensor`  on CPU/GPU         |                           |                                              |
| PyData/Sparse arrays      | same sparse array type | once it has array API support | TypeError otherwise | TypeError is already the current state; no auto-densification | 
| dask.Array      | dask.Array     | once it has array API support | TypeError otherwise (?) | bc-breaking, but mixing with current conversion to numpy array may not be healthy | 
| jax.numpy.Array           | jax.numpy.Array        | once added to `array-api-compat`   | same as PyTorch for CPU and GPU          | |

When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:
1. dispatch back to the other library (PyTorch, CuPy, etc.).
2. convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or `__array__`), use the compiled code in question, then convert back.
3. don't support this at all

I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.

A note on `numpy.matrix`: the only place this is needed is `scipy.sparse`, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support `scipy.sparse.*_matrix` long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have `sparse.*_array` instances and PyData/Sparse.

Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the `linalg` and `fft` submodules - supporting partial implementations without `linalg` support in particular seems unnecessary.

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea.  On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

## Development and introduction strategy

The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact. 

- develop behind an experimental flag (an environment variable and a global setting starting with an underscore)
- copy the scikit-learn approach, using `array-api-compat` as an optional dependency for now
- switch the new behavior of "input type in == input type out" by default once a submodule has complete support
  - at that point, vendor `array-api-compat`
  - may be a major release _if_ it can all be done in one release cycle. otherwise do submodule by submodule
- explicit support can and should be tested in CI for: PyTorch CPU tensors, `numpy.array_api` for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that.
- implement new type coercion machinery centrally in `scipy._lib` and depend on that from other submodules. That means we'll have a single replacement for `np.asarray`/`np.asanyarray` in one central location.


## Context

- For NumPy 2.0 we should have full array API standard support in the main `numpy` namespace
- The `numpy.array_api` module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only.
- The [array-api-compat](https://github.com/data-apis/array-api-compat) library provides the compatibility we need right now between NumPy, CuPy and PyTorch. Other libraries can be added there too. This bridges any gaps between the standard and real-world implementations. The package is light-weight and can be depended on - but is also designed to be vendored to avoid a new runtime dependency.
- scikit-learn has had experimental array API support for a little while, and has now settled on `array-api-compat` as the way to go. This looks quite a bit nicer than the previous approach with `numpy.array_api`, and the work to support PyTorch that way has been quite well-received.
- As for the myriad of open NEPs on interoperability topics:
  - [NEP 30](https://numpy.org/neps/nep-0030-duck-array-protocol.html) (`__duckarray__`), [NEP 31](https://numpy.org/neps/nep-0031-uarray.html) (`uarray`) and [NEP 37](https://numpy.org/neps/nep-0037-array-module.html) (`__array_module__`) are all effectively dead - I'll propose rejecting these.
  - [NEP 47](https://numpy.org/neps/nep-0047-array-api-standard.html) (array API support via `numpy.array_api`) has been implemented, however I'm going to propose marking it superceded via a new NEP for the main `numpy` namespace
  - `__array_function__` and `__array_ufunc__` are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types to `ndarray` and error out if that fails. The exception here is ufuncs in `scipy.special`, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.
- NumPy array subclasses are a long-standing problem, with masked arrays and `np.matrix` instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful.
- Missing data support is a broad and tricky subject:
    - `numpy.ma` has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned. `numpy.ma` in its current form should be considered legacy.
    - `scipy.stats.mstats` is the only module that specifically supports masked arrays.
    - `scipy.stats` has a `nan_policy` keyword in many functions that is well-maintained at this point, and has [a documented specification](https://docs.scipy.org/doc/scipy/dev/api-dev/nan_policy.html). That is probably not applicable to other submodules though.
    - Support for missing data in other places like `interpolate` and `spatial.KDTree` (see [gh-18230](https://github.com/scipy/scipy/pull/18230)) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist.
    - Dataframe libraries have much better support, but it's a difficult story to use non-numpy-backed dataframes at all.
    - For the purposes of this proposal, I'd prefer leaving it at "reject `numpy.ma.MaskedArray` instances".


## Problems & opportunities

The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues _and_ because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).

Problems include:

- pandas input with non-numpy dtypes being converted to object arrays and then support being very hit-and-miss; see for example [gh-18254](https://github.com/scipy/scipy/issues/18254)
- general pandas dataframe/series support being patchy; see for example [gh-9616](https://github.com/scipy/scipy/issues/9616) and [gh-12164](https://github.com/scipy/scipy/issues/12164)
- masked arrays giving incorrect results because the mask is silently dropped; see, e.g., [gh-1682](https://github.com/scipy/scipy/issues/1682) and [gh-12898](https://github.com/scipy/scipy/issues/12898)
- sparse array support being patchy; agreement that auto-densification needs to be avoided, but that means current behavior is inconsistent (xref [gh-4239](https://github.com/scipy/scipy/issues/4239))
- object arrays, including use of `Decimal` and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.

Opportunities include:
- Large performance gains. See for example [mdhaber/scipy#63](https://github.com/mdhaber/scipy/pull/63#issuecomment-987126214) (10x-50x for stats bootstrapping functionality with CuPy) and [scikit-learn#25956](https://github.com/scikit-learn/scikit-learn/pull/25956) which shows how PyTorch improves over NumPy (2x-7x on CPU, 60x-100x on GPU)
- interest in support for Dask which we haven't really been able to do much with, see for example [gh-10204](https://github.com/scipy/scipy/issues/10204), [gh-8810](https://github.com/scipy/scipy/issues/8810) and [gh-10087](https://github.com/scipy/scipy/issues/10087). Xarray is in the same boat, it builds on top of Dask (see, e.g., [gh-14824](https://github.com/scipy/scipy/issues/14824))


## References

- Scikit-learn PR for PyTorch support: [scikit-learn#25956](https://github.com/scikit-learn/scikit-learn/pull/25956)
- Scikit-learn docs for array API support: https://scikit-learn.org/stable/modules/array_api.html
- Tracking issue for SciPy array API standard support: [gh-15354](https://github.com/scipy/scipy/issues/15354)
- Initial PR for SciPy array API support (not even blocked, we just didn't push it forward because `numpy.array_api` is a bit cumbersome): [gh-15395](https://github.com/scipy/scipy/pull/15395)
- [NumPy 2.0 developer meeting presentation on array API adoption](https://github.com/numpy/archive/blob/main/2.0_developer_meeting/NumPy_2.0_devmeeting_array_API_adoption.pdf)


## Appendix

_Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics)_.

### dataframe support

For dataframe library support, the situation is a little trickier. We have to think about pandas `Series` and `DataFrame` instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.

This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):

Input type                | Return type          | When                          | Behavior                  | Notes                                                                               |
|---------------------------|----------------------|-------------------------------|---------------------------|-------------------------------------------------------------------------------------|
|                           |                      |                               |                           |                                                                                     |
| pandas.Series             | pandas.Series        | if dtype is a numpy dtype     | raise TypeError otherwise | non-NumPy dtypes get coerced to object arrays otherwise, avoid that                 |
| pandas.DataFrame          | pandas.DataFrame     | if dtype is a numpy dtype     | raise TypeError otherwise |                                                                                     |
| has `__dataframe__`       | same as input type   |                               |                           | needs a small extension to dataframe interchange protocol to reconstruct input type |

There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., [the scikit-learn 1.2 release highlights showing the `set_output` feature](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html#pandas-output-with-set-output-api) to request pandas dataframe as the return type.

_Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy._


### Dispatching mechanism

For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).

There has been a _lot_ of discussion on how this should work. The leading candidate is Uarray, which we already use in `scipy.fft` (as do matching `fft` APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. _However_, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal. 

Whatever the mechanism, it should work transparently such that `scipy.xxx.yyy(x)` where `x` is a non-numpy array should dispatch to the library which implements the array type of `x`. 

We have a [uarray label](https://github.com/scipy/scipy/issues?q=is%3Aissue+is%3Aopen+uarray+label%3Auarray) in the issue tracker. See [gh-14353](https://github.com/scipy/scipy/issues/14353) for the tracker with completed and pending implementations. For more context and real-world examples, see:
- https://labs.quansight.org/blog/pydata-extensibility-vision (and in particular the _PyData dispatching system_ section which covers alternative dispatching options)
- https://labs.quansight.org/blog/making-gpus-accessible-to-pydata-ecosystem-via-array-api
- https://quansight-labs.github.io/array-api-demo/GW_Demo_Array_API.html
- https://data-apis.org/array-api/latest/use_cases.html
- https://github.com/numpy/archive/blob/main/other_meetings/2020-04-21-array-protocols-discussion-notes.md
- Note that there's work to be done to write [SPEC-0002](https://scientific-python.org/specs/spec-0002/) on this dispatching topic, see also Discourse posts on that: [here](https://discuss.scientific-python.org/t/spec-2-api-dispatch/173), [here](https://discuss.scientific-python.org/t/a-proposed-design-for-supporting-multiple-array-types-across-scipy-scikit-learn-scikit-image-and-beyond/131) and [here](https://discuss.scientific-python.org/t/default-dispatching-behavior-for-supporting-multiple-array-types-across-scipy-scikit-learn-scikit-image/135).
- Scikit-learn is working on a different kind of system for this, with explicit plugins to extend scikit-learn with support for other libraries/devices where it's using compiled code - see https://github.com/scikit-learn/scikit-learn/issues/22438.
- PyTorch has a more robust backend system (https://pytorch.org/docs/stable/dynamo/custom-backends.html).
- I'm sure there more examples of how to enable this kind of thing, inside and outside of Python. It's just a lot of engineering effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: SciPy array types & libraries support #18286

array/tensor types support

Development and introduction strategy

Context

Problems & opportunities

References

Appendix

dataframe support

Dispatching mechanism

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input type	Return type	When	Behavior	Notes
numpy.ndarray	numpy.ndarray	always
torch.Tensor (CPU)	torch.Tensor	always
torch.Tensor (GPU)	torch.Tensor	when possible	raise TypeError otherwise	for pure Python implementation for now. See text below for more details
cupy.ndarray	cupy.ndarray	when possible	raise TypeError otherwise
has `__array_namespace__`	same as input type	when possible	raise TypeError otherwise
array with `object` dtype	n/a		raise TypeError
numpy.matrix	n/a	for NumPy >= 2.0 (or always?)	raise TypeError
numpy.ma.MaskedArray	numpy.ma.MaskedArray	only for `stats.mstats`	raise TypeError otherwise	future `nan_policy`/`mstats` plans are unaffected by this
numpy.ndarray subclasses	same subclass type	always		so always use `np.asanyarray` checks
torch.Tensor subclasses	same subclass type	same as for `torch.Tensor` on CPU/GPU
PyData/Sparse arrays	same sparse array type	once it has array API support	TypeError otherwise	TypeError is already the current state; no auto-densification
dask.Array	dask.Array	once it has array API support	TypeError otherwise (?)	bc-breaking, but mixing with current conversion to numpy array may not be healthy
jax.numpy.Array	jax.numpy.Array	once added to `array-api-compat`	same as PyTorch for CPU and GPU

Input type	Return type	When	Behavior	Notes

pandas.Series	pandas.Series	if dtype is a numpy dtype	raise TypeError otherwise	non-NumPy dtypes get coerced to object arrays otherwise, avoid that
pandas.DataFrame	pandas.DataFrame	if dtype is a numpy dtype	raise TypeError otherwise
has `__dataframe__`	same as input type			needs a small extension to dataframe interchange protocol to reconstruct input type

Uh oh!

RFC: SciPy array types & libraries support #18286

Description

array/tensor types support

Development and introduction strategy

Context

Problems & opportunities

References

Appendix

dataframe support

Dispatching mechanism

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions