String dtype: fix pyarrow-based IO + update tests #59478

jorisvandenbossche · 2024-08-11T08:57:16Z

~~(depends on #59479)~~

This is fixing / removing xfails from the pyarrow-based IO methods (parquet, feather, ORC).

Fastparquet will require some update for properly supporting the string dtype on their side, so those tests are still xfailed specifically.

There is also one issue with the string dtype roundtrip. By using distinct string reprs for the NaN and NA variant of the string dtype ("str" vs "string"), I had assumed that a dataframe that is specifically using the NA-variant of the string dtype would nicely roundtrip (as it does now). However, with the future string option enabled, it no longer does but comes back as the default NaN-variant.
PyArrow has multiple mechanism to decide which pandas data type to use in converting a pyarrow.Table to pandas with to_pandas(). The relevant ones here are that it 1) looks at the pandas metadata info that is stored in the arrow schema metadata (for a column using the NA-variant string dtype, this info stores that this column has a pandas "string" dtype, which is then used to correctly roundtrip in the arrow->pandas conversion), and 2) the user can specify a specific pandas dtype to use for a certain pyarrow type using the types_mapper keyword (which we for example use to implement the dtypes_backend="pyarrow" support returning pd.ArrowDtype()).
However, to support using the future default NaN-variant of the string dtype, we do pass types_mapper mapping pyarrow string type to the NaN string dtype, and this overrules the info in the metadata that would otherwise restore the NA string dtype.

In short, I don't think we can realistically solve this on our side (unless we would start looking in the schema metadata to check if we should pass types_mapper keyword or not, but this is going to get complex), but I can upstream to pyarrow that they start using the string dtype by default when the pandas option is enabled, and then we can stop specifying the types_mapper keyword for those recent pyarrow version.

(sidenote: longer term, we should maybe consider moving some of that conversion logic from pyarrow to pandas)

xref #54792

jorisvandenbossche · 2024-08-11T13:19:20Z

pandas/io/_util.py

+        pa.string(): pd.StringDtype(),
+        pa.large_string(): pd.StringDtype(),


This is to ensure that when dtype_backend="numpy_nullable" is passed, the user gets the NA-variant of the string dtype

jorisvandenbossche · 2024-08-11T13:21:34Z

pandas/tests/io/test_parquet.py

@@ -672,6 +675,16 @@ def test_read_empty_array(self, pa, dtype):
            df, pa, read_kwargs={"dtype_backend": "numpy_nullable"}, expected=expected
        )

+    @pytest.mark.network
+    @pytest.mark.single_cpu
+    def test_parquet_read_from_url(self, httpserver, datapath, df_compat, engine):


This test is just moved down from the Base class to TestBasic class (all tests that are generic and run for both engines are in this second class), but fixing that the engine is actually used in the read_parquet call

jorisvandenbossche · 2024-08-11T13:22:32Z

pandas/tests/io/test_parquet.py

+        if using_infer_string:
+            check_round_trip(df, pa, expected=df.astype({"c": "str"}))
+        else:
+            check_round_trip(df, pa)


This is what I explained at length in the top post: although the original dataframe has "string" dtype (NA-variant), it currently comes back as "str" dtype (NaN-variant)

Shouldn't that still be coming back as "string"? By default doesn't parquet just roundtrip the exact type?

See the long explanation in the top post of why this is currently not the case (Arrow<->Parquet roundtrips are exact for most types, but its the pandas<->Arrow roundtrip where it goes wrong if pandas has several extension dtypes for the same arrow type)

OK thanks - that explanation makes sense technically, but I don't think this regression should be shipped. It's also going to be repeated with ADBC drivers.

Maybe our type mapping function just needs to be extended with that data, whether it should be using string inferrence or if the string type is well known?

I don't think this regression should be shipped

Well, I don't think there is any way around it, unfortunately ..

With future.infer_strings enabled, we definitely want that someone who reads a generic parquet file (that was not written from a pandas dataframe that was using the existing nullable string dtype) ends up with the future default string dtype. PyArrow currently only gives us one way to control this, which is via specifying a types_mapper argument. But that then means it overrules the stored pandas metadata, and so a NA string dtype is not roundtripped.

As far as I can see, we can't have it both ways.
Except, if we would essentially take over the conversion of the pyarrow Table to a pandas DataFrame, parse the pandas metadata from the schema, etc. Which is something that I think we should actually consider doing at some point, i.e move all/most of the pandas compat code from the pyarrow project into pandas. But that's a bigger project not for the coming month.

Now, as I also mentioned, currently the way to fix this is to ensure pyarrow does that for us (so we don't have to specify types_mapper by default), and then pyarrow will continue respecting the pandas metadata and roundtrip a string dtype.
I will ensure this is done for the next pyarrow release (scheduled for october). Given that this future.infer_string option will likely be optional until that time, I think that is fine (although of course even when pandas 3.0 is out, that would mean people using not the latest but an older pyarrow version will still run into it).

Now, as I also mentioned, currently the way to fix this is to ensure pyarrow does that for us

Opened apache/arrow#43683 to track this

But that then means it overrules the stored pandas metadata, and so a NA string dtype is not roundtripped.

Do you know exactly the metadata that parquet stores? I was under the (possibly false) impression that we would be storing "str" or "string" in a parquet file generated by pandas. If that is so, can't that be used to disambiguate between the two types?

Ultimately my concern goes back to one of the major points of PDEP-14, in that this is going to break backwards compatability for users that have been using our StringDtype for the past 5 years, assumedly to take them full circle after PDEP-16. Hoping to avoid that confusion

I was under the (possibly false) impression that we would be storing "str" or "string" in a parquet file generated by pandas.

That's indeed what happens

If that is so, can't that be used to disambiguate between the two types?

Not if the user passes types_mapper (and the "user" here is pandas), because that overrides the mechanism of getting the resulting dtype from this stored metadata. Quoting myself from above:

PyArrow has multiple mechanism to decide which pandas data type to use in converting a pyarrow.Table to pandas with to_pandas(). The relevant ones here are that it 1) looks at the pandas metadata info that is stored in the arrow schema metadata (for a column using the NA-variant string dtype, this info stores that this column has a pandas "string" dtype, which is then used to correctly roundtrip in the arrow->pandas conversion), and 2) the user can specify a specific pandas dtype to use for a certain pyarrow type using the types_mapper keyword (which we for example use to implement the dtypes_backend="pyarrow" support returning pd.ArrowDtype()).
However, to support using the future default NaN-variant of the string dtype, we do pass types_mapper mapping pyarrow string type to the NaN string dtype, and this overrules the info in the metadata that would otherwise restore the NA string dtype.

The problem is us specifying a types_mapper by default to enable the default string dtype, and as I said, right now this can only be realistically solved in pyarrow by ensuring that pyarrow uses the pandas.StringDtype by default (for pandas>=3 or when the option is enabled), so we don't have to specify types_mapper.

Ultimately my concern goes back to one of the major points of PDEP-14, in that this is going to break backwards compatability

I am very well aware of that, and I only realized this issue while working on this PR (when we were discussing this in context of the PDEP, I also assumed that because of using the different string alias "string" and "str", that would ensure such roundtripping kept working).

But as I said, I currently don't see any solution on the short term. I will ensure this if fixed in pyarrow in the next release, and maybe for pandas 3.0 or 3.1 we could take over a big part of the pandas compat code that currently lives in pyarrow, which ensures we can properly fix this for all older pyarrow versions as well and which would make such similar issues in the future easier to address.
I do think that, as long as this is explicit opt-in (which it is for pandas 2.3), the solution in this PR is acceptable. And for when pandas 3.0 is out and the change is enabled by default, we will ensure that there is no breaking change at least when using the latest pyarrow, and we still have time to also try to find a solution to keep it working for older pyarrow versions.

Ah OK. I was missing the context that pyarrow was doing the type mapping, instead of us.

I do think that, as long as this is explicit opt-in (which it is for pandas 2.3), the solution in this PR is acceptable.

OK sure I'm convinced. I think for now too, users can at least force retention of the nullable type with dtype_backend="numpy_nullable", which they would be doing in other I/O methods anyway

WillAyd · 2024-08-12T15:42:43Z

pandas/core/arrays/string_arrow.py

-        if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
-            values.type
+        if isinstance(values, (pa.Array, pa.ChunkedArray)) and (
+            pa.types.is_string(values.type)


Is there a reason why here we only test for string but in the dictionary-encoded case we allow both string and large_string?

This part of the diff went away now that #59479 is merged. But to answer your question: the reason is that we are checking the types here for which we are going to cast to large_string.
If the input is already large_string, we don't need to cast (although the cast would be cheap zero copy anyway, I think), only if it is string or dictionary (of either string type) we need to cast, to ensure the final array we store is always of arrow type large_string.

WillAyd · 2024-08-12T15:46:20Z

pandas/tests/io/test_parquet.py

+        if using_infer_string:
+            check_round_trip(df, pa, expected=df.astype({"c": "str"}))
+        else:
+            check_round_trip(df, pa)


Shouldn't that still be coming back as "string"? By default doesn't parquet just roundtrip the exact type?

…io-pyarrow

WillAyd · 2024-08-13T17:35:33Z

pandas/tests/io/test_parquet.py

+        if using_infer_string:
+            check_round_trip(df, pa, expected=df.astype({"c": "str"}))
+        else:
+            check_round_trip(df, pa)


OK thanks - that explanation makes sense technically, but I don't think this regression should be shipped. It's also going to be repeated with ADBC drivers.

Maybe our type mapping function just needs to be extended with that data, whether it should be using string inferrence or if the string type is well known?

…io-pyarrow

phofl

lgtm

…io-pyarrow

String dtype: fix pyarrow-based IO + update tests

71da3b0

jorisvandenbossche added Strings String extension data type and string data IO Parquet parquet, feather labels Aug 11, 2024

jorisvandenbossche commented Aug 11, 2024

View reviewed changes

WillAyd requested changes Aug 12, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

8095291

…io-pyarrow

jorisvandenbossche marked this pull request as ready for review August 12, 2024 20:45

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

33cc8d6

…io-pyarrow

WillAyd requested changes Aug 13, 2024

View reviewed changes

jorisvandenbossche added 2 commits August 14, 2024 09:37

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

6a326a9

…io-pyarrow

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

d9cd128

…io-pyarrow

WillAyd approved these changes Aug 14, 2024

View reviewed changes

Merge branch 'main' into string-dtype-tests-io-pyarrow

1db710d

phofl approved these changes Aug 17, 2024

View reviewed changes

jorisvandenbossche added 2 commits August 20, 2024 21:02

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

890b81c

…io-pyarrow

fix expected columns dtype

3ef26fe

jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024

mroeschke and others added 3 commits August 21, 2024 07:43

Merge branch 'main' into string-dtype-tests-io-pyarrow

06bb14b

Merge remote-tracking branch 'upstream/main' into string-dtype-tests-…

3f38a8a

…io-pyarrow

another columns fix

ef2f6cb

jorisvandenbossche merged commit 487c585 into pandas-dev:main Aug 22, 2024
47 checks passed

jorisvandenbossche deleted the string-dtype-tests-io-pyarrow branch August 22, 2024 09:44

jorisvandenbossche mentioned this pull request Aug 22, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

44 tasks

matiaslindgren pushed a commit to matiaslindgren/pandas that referenced this pull request Aug 25, 2024

String dtype: fix pyarrow-based IO + update tests (pandas-dev#59478)

c170fbd

jorisvandenbossche added the backported label Oct 10, 2024

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024

String dtype: fix pyarrow-based IO + update tests (pandas-dev#59478)

faed77b

jorisvandenbossche added a commit that referenced this pull request Oct 10, 2024

String dtype: fix pyarrow-based IO + update tests (#59478)

c9d4b1b

		pa.string(): pd.StringDtype(),
		pa.large_string(): pd.StringDtype(),

Uh oh!

String dtype: fix pyarrow-based IO + update tests #59478

String dtype: fix pyarrow-based IO + update tests #59478

Uh oh!

Conversation

jorisvandenbossche commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 11, 2024 •

edited

Loading