Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: fix pyarrow-based IO + update tests #59478

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 11, 2024

(depends on #59479)

This is fixing / removing xfails from the pyarrow-based IO methods (parquet, feather, ORC).

Fastparquet will require some update for properly supporting the string dtype on their side, so those tests are still xfailed specifically.

There is also one issue with the string dtype roundtrip. By using distinct string reprs for the NaN and NA variant of the string dtype ("str" vs "string"), I had assumed that a dataframe that is specifically using the NA-variant of the string dtype would nicely roundtrip (as it does now). However, with the future string option enabled, it no longer does but comes back as the default NaN-variant.
PyArrow has multiple mechanism to decide which pandas data type to use in converting a pyarrow.Table to pandas with to_pandas(). The relevant ones here are that it 1) looks at the pandas metadata info that is stored in the arrow schema metadata (for a column using the NA-variant string dtype, this info stores that this column has a pandas "string" dtype, which is then used to correctly roundtrip in the arrow->pandas conversion), and 2) the user can specify a specific pandas dtype to use for a certain pyarrow type using the types_mapper keyword (which we for example use to implement the dtypes_backend="pyarrow" support returning pd.ArrowDtype()).
However, to support using the future default NaN-variant of the string dtype, we do pass types_mapper mapping pyarrow string type to the NaN string dtype, and this overrules the info in the metadata that would otherwise restore the NA string dtype.

In short, I don't think we can realistically solve this on our side (unless we would start looking in the schema metadata to check if we should pass types_mapper keyword or not, but this is going to get complex), but I can upstream to pyarrow that they start using the string dtype by default when the pandas option is enabled, and then we can stop specifying the types_mapper keyword for those recent pyarrow version.

(sidenote: longer term, we should maybe consider moving some of that conversion logic from pyarrow to pandas)

xref #54792

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data IO Parquet parquet, feather labels Aug 11, 2024
Comment on lines +30 to +31
pa.string(): pd.StringDtype(),
pa.large_string(): pd.StringDtype(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure that when dtype_backend="numpy_nullable" is passed, the user gets the NA-variant of the string dtype

@@ -672,6 +675,16 @@ def test_read_empty_array(self, pa, dtype):
df, pa, read_kwargs={"dtype_backend": "numpy_nullable"}, expected=expected
)

@pytest.mark.network
@pytest.mark.single_cpu
def test_parquet_read_from_url(self, httpserver, datapath, df_compat, engine):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is just moved down from the Base class to TestBasic class (all tests that are generic and run for both engines are in this second class), but fixing that the engine is actually used in the read_parquet call

Comment on lines +932 to +935
if using_infer_string:
check_round_trip(df, pa, expected=df.astype({"c": "str"}))
else:
check_round_trip(df, pa)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I explained at length in the top post: although the original dataframe has "string" dtype (NA-variant), it currently comes back as "str" dtype (NaN-variant)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that still be coming back as "string"? By default doesn't parquet just roundtrip the exact type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the long explanation in the top post of why this is currently not the case (Arrow<->Parquet roundtrips are exact for most types, but its the pandas<->Arrow roundtrip where it goes wrong if pandas has several extension dtypes for the same arrow type)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks - that explanation makes sense technically, but I don't think this regression should be shipped. It's also going to be repeated with ADBC drivers.

Maybe our type mapping function just needs to be extended with that data, whether it should be using string inferrence or if the string type is well known?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this regression should be shipped

Well, I don't think there is any way around it, unfortunately ..

With future.infer_strings enabled, we definitely want that someone who reads a generic parquet file (that was not written from a pandas dataframe that was using the existing nullable string dtype) ends up with the future default string dtype. PyArrow currently only gives us one way to control this, which is via specifying a types_mapper argument. But that then means it overrules the stored pandas metadata, and so a NA string dtype is not roundtripped.

As far as I can see, we can't have it both ways.
Except, if we would essentially take over the conversion of the pyarrow Table to a pandas DataFrame, parse the pandas metadata from the schema, etc. Which is something that I think we should actually consider doing at some point, i.e move all/most of the pandas compat code from the pyarrow project into pandas. But that's a bigger project not for the coming month.

Now, as I also mentioned, currently the way to fix this is to ensure pyarrow does that for us (so we don't have to specify types_mapper by default), and then pyarrow will continue respecting the pandas metadata and roundtrip a string dtype.
I will ensure this is done for the next pyarrow release (scheduled for october). Given that this future.infer_string option will likely be optional until that time, I think that is fine (although of course even when pandas 3.0 is out, that would mean people using not the latest but an older pyarrow version will still run into it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, as I also mentioned, currently the way to fix this is to ensure pyarrow does that for us

Opened apache/arrow#43683 to track this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that then means it overrules the stored pandas metadata, and so a NA string dtype is not roundtripped.

Do you know exactly the metadata that parquet stores? I was under the (possibly false) impression that we would be storing "str" or "string" in a parquet file generated by pandas. If that is so, can't that be used to disambiguate between the two types?

Ultimately my concern goes back to one of the major points of PDEP-14, in that this is going to break backwards compatability for users that have been using our StringDtype for the past 5 years, assumedly to take them full circle after PDEP-16. Hoping to avoid that confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the (possibly false) impression that we would be storing "str" or "string" in a parquet file generated by pandas.

That's indeed what happens

If that is so, can't that be used to disambiguate between the two types?

Not if the user passes types_mapper (and the "user" here is pandas), because that overrides the mechanism of getting the resulting dtype from this stored metadata. Quoting myself from above:

PyArrow has multiple mechanism to decide which pandas data type to use in converting a pyarrow.Table to pandas with to_pandas(). The relevant ones here are that it 1) looks at the pandas metadata info that is stored in the arrow schema metadata (for a column using the NA-variant string dtype, this info stores that this column has a pandas "string" dtype, which is then used to correctly roundtrip in the arrow->pandas conversion), and 2) the user can specify a specific pandas dtype to use for a certain pyarrow type using the types_mapper keyword (which we for example use to implement the dtypes_backend="pyarrow" support returning pd.ArrowDtype()).
However, to support using the future default NaN-variant of the string dtype, we do pass types_mapper mapping pyarrow string type to the NaN string dtype, and this overrules the info in the metadata that would otherwise restore the NA string dtype.

The problem is us specifying a types_mapper by default to enable the default string dtype, and as I said, right now this can only be realistically solved in pyarrow by ensuring that pyarrow uses the pandas.StringDtype by default (for pandas>=3 or when the option is enabled), so we don't have to specify types_mapper.

Ultimately my concern goes back to one of the major points of PDEP-14, in that this is going to break backwards compatability

I am very well aware of that, and I only realized this issue while working on this PR (when we were discussing this in context of the PDEP, I also assumed that because of using the different string alias "string" and "str", that would ensure such roundtripping kept working).

But as I said, I currently don't see any solution on the short term. I will ensure this if fixed in pyarrow in the next release, and maybe for pandas 3.0 or 3.1 we could take over a big part of the pandas compat code that currently lives in pyarrow, which ensures we can properly fix this for all older pyarrow versions as well and which would make such similar issues in the future easier to address.
I do think that, as long as this is explicit opt-in (which it is for pandas 2.3), the solution in this PR is acceptable. And for when pandas 3.0 is out and the change is enabled by default, we will ensure that there is no breaking change at least when using the latest pyarrow, and we still have time to also try to find a solution to keep it working for older pyarrow versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. I was missing the context that pyarrow was doing the type mapping, instead of us.

I do think that, as long as this is explicit opt-in (which it is for pandas 2.3), the solution in this PR is acceptable.

OK sure I'm convinced. I think for now too, users can at least force retention of the nullable type with dtype_backend="numpy_nullable", which they would be doing in other I/O methods anyway

if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
values.type
if isinstance(values, (pa.Array, pa.ChunkedArray)) and (
pa.types.is_string(values.type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why here we only test for string but in the dictionary-encoded case we allow both string and large_string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the diff went away now that #59479 is merged. But to answer your question: the reason is that we are checking the types here for which we are going to cast to large_string.
If the input is already large_string, we don't need to cast (although the cast would be cheap zero copy anyway, I think), only if it is string or dictionary (of either string type) we need to cast, to ensure the final array we store is always of arrow type large_string.

Comment on lines +932 to +935
if using_infer_string:
check_round_trip(df, pa, expected=df.astype({"c": "str"}))
else:
check_round_trip(df, pa)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that still be coming back as "string"? By default doesn't parquet just roundtrip the exact type?

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 12, 2024 20:45
Comment on lines +932 to +935
if using_infer_string:
check_round_trip(df, pa, expected=df.astype({"c": "str"}))
else:
check_round_trip(df, pa)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks - that explanation makes sense technically, but I don't think this regression should be shipped. It's also going to be repeated with ADBC drivers.

Maybe our type mapping function just needs to be extended with that data, whether it should be using string inferrence or if the string type is well known?

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
@jorisvandenbossche jorisvandenbossche merged commit 487c585 into pandas-dev:main Aug 22, 2024
47 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-tests-io-pyarrow branch August 22, 2024 09:44
matiaslindgren pushed a commit to matiaslindgren/pandas that referenced this pull request Aug 25, 2024
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported IO Parquet parquet, feather Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants