DISC: nanoarrow-backed ArrowStringArray

### Feature Type

- [X] Adding new functionality to pandas

- [X] Changing existing functionality in pandas

- [X] Removing existing functionality in pandas


### Problem Description

Wanted to add a formal issue around the possibility of using nanoarrow to generate the `ArrowStringArray` class we have today. This could also help move forward with pandas 3.x if we decided to drop the pyarrow requirement

**What is nanoarrow?**
nanoarrow is a small, lightweight library used to generate data that follows the Arrow format specification. It can be used by libraries that want to work with Arrow but do not want to take on the dependency of the larger Arrow code base. The Arrow ADBC library is an example where this has already been used.

**How could we leverage nanoarrow within pandas?**
Sparing implementation details, a reasonable usage directly within our code base would be to change our existing [ArrowStringArray](https://github.com/pandas-dev/pandas/blob/b4d3309e05e0afb7ee5bd671c2150d1e6eebbb88/pandas/core/arrays/string_arrow.py#L133) class. Today the constructor looks something like:

```python
def __init__(self, values) -> None:
    _chk_pyarrow_available()
    if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
        values.type
    ):
        values = pc.cast(values, pa.large_string())

    ...
```

In theory we could do something like:

```python
def __init__(self, values) -> None:
    _uses_pyarrow = pa_installed()
    if _uses_pyarrow:
        if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
            values.type
        ):
            values = pc.cast(values, pa.large_string())
    else:
        values = NanoStringArray(values)

    ...
```

In each method, our internal ArrowStringArray would prioritize pyarrow algorithms if installed, but could fall back to our own functions implemented using nanoarrow (or raise if building such a function is impractical).

```python
def _str_isalnum(self):
    if _uses_pyarrow:
        result = pc.utf8_is_alnum(self._pa_array)
        return self._result_converter(result)

    # nanoarrow fallback
    return self._pa_array.isalnum()
```

This repurposes the internal `self._pa_array` to actually refer to an *Arrow array* and not necessarily an *Arrow array created by pyarrow*. That can definitely be a point of confusion if a developer is not aware.

**Would this be a new dtype?**
**No**, which is what makes this very distinct from other solutions to the pyarrow installation problem. Whether or not pyarrow is installed or we use nanoarrow behind the scenes, the theory is that we produce *Arrow arrays* and operate against them. Alternate solutions to the pyarrow installation problem start to direct users towards using different data types; this solution is merely an implementation detail

This may be confusing that we named our data types `"string[pyarrow]"` and they could be produced without pyarrow. If we had named our data types `"string[arrow]"` it would have abstracted this fully; with that said I don't think it is worth changing. 

**Would we need to vendor nanoarrow?**
No, assuming we drop setuptools support. Historically when pandas has taken on third party C/C++ libraries we copy the files into our code base and maintain them from there. With Meson, we can leverage the [Meson wrap system](https://mesonbuild.com/Wrap-dependency-system-manual.html) and it will work

**Does this require any new tooling within pandas?**
Not really. The expectation is that the algorithms we need would be implemented in C++. pandas already requires a C++ compiler, and the libraries we would need to produce C++ extensions should be installable via Meson

**How do we know this could work for pandas?**
I wrote a proof of concept for this a few months back - https://github.com/WillAyd/nanopandas

Getting that to work with pandas directly was challenging because pandas currently requires EAs to subclass a Python class, which the C++ extension could not do. I do not expect this would be a problem if we decided to use nanoarrow directly within our existing `ArrowStringArray` class instead of trying to register as an EA

**How fast will it be?**
My expectation is the performance will fall somewhere in between where we are today and where pyarrow gets us. Pyarrow offers a lot of optimizations and the goal of this is *not* to try and match those. Users are still encouraged to install pyarrow; this would only be a fallback for cases where pyarrow installation is not feasible

**Is this a long term solution?**
I don't think so. I _really_ want us to align on leveraging all the great work that Arrow/pyarrow has to offer. I only consider this a stepping stone to get past our current 3.x bottleneck and move to a more Arrow-centric future, assuming:

1. We continually encourage users to install pyarrow with pandas
2. Pyarrow installation becomes less of a concern for users over time (either by pyarrow getting smaller, container environments getting bigger, and/or legacy platforms dying off)

However, even if/when this nanostring code goes away I do think there are not-yet-known future capabilities that can be implemented using the nanoarrow library utilized here

**How much larger would this make the pandas installation?**

From the nanopandas POC project listed above, release artifacts show the following sizes for me locally:
  - nanobind static library - 376K
  - nanoarrow static library - 96K
  - utf8proc static library - 340K
  - nanopandas shared library - 748K

So overall I would expect ~1.5 MB increase. (for users that care - utf8proc is the UTF8 library Arrow uses. nanobind is a C++ extension binding library)

**What are the downsides**
As a team we have not historically created our own C++ extensions; adding a new language to the mix is not something that should be taken lightly. I think the flip side to this is that we already have C extensions with the same issue around maintenance, so I am not really sure how to measure this issue.

The library used to bridge Python/C++, nanobind, does not offer first class support for Meson. I believe Meson can still handle this robustly given its wrap system, and its something that has been discussed [upstream in nanobind](https://github.com/wjakob/nanobind/discussions/449), but still worth calling out as a risk

### Feature Description

See above

### Alternative Solutions

https://github.com/pandas-dev/pandas/issues/58503

https://github.com/pandas-dev/pandas/pull/58551

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DISC: nanoarrow-backed ArrowStringArray #58552

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DISC: nanoarrow-backed ArrowStringArray #58552

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions