Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

Open
2 of 3 tasks
alippai opened this issue Apr 30, 2023 · 14 comments
Open
2 of 3 tasks

BUG: partitioning parquet by pyarrow.date32 fails when reading #53008

alippai opened this issue Apr 30, 2023 · 14 comments
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather Needs Discussion Requires discussion from core team before further action

Comments

@alippai
Copy link
Contributor

alippai commented Apr 30, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=False), 'dataset', ['c1'])
ret = pd.read_parquet('dataset') # exception

Issue Description

When partitioning is used, the pyarrow date32 is written to the path and read back as a dictionary of strings instead of a dictionary of date32 types (or simply date32, I was surprised dataset writing converts to a category type automatically). When trying to cast string to date32 an exception is thrown.

Expected Behavior

Something similar to this:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
t = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(t, 'dataset', ['c1'])
dataset = pq.ParquetDataset('dataset/', schema=t.schema)
ret = dataset.read().to_pandas()

Which returns the original DataFrame

Installed Versions

pandas : 2.0.1 pyarrow : 11.0.0
@alippai alippai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2023
@alippai
Copy link
Contributor Author

alippai commented Apr 30, 2023

There is a suggestion here: apache/arrow#22510 that the pandas metadata could be used to specify the schema needed for the pyarrow Dataset

@alippai
Copy link
Contributor Author

alippai commented Apr 30, 2023

I'm not familiar with the pandas read_parquet() implementation yet, but something like this could work:

import pyarrow.parquet as pq

dataset_pre = pq.ParquetDataset('dataset/')
columns_to_cast = [c for c in  pq.ParquetDataset('dataset/').schema.pandas_metadata['columns'] if c['field_name'] in dataset_pre.partitioning.schema.names]

correct_schema = dataset_pre.schema.
for column in columns_to_cast:
  correct_schema = correct_schema.set(correct_schema.field_by_name(column['field_name']).with_type(convert_pandas_to_pyarrow(column['numpy_type'])))

dataset_final = pq.ParquetDataset('dataset/', schema=correct_schema)
table = dataset_final.read()
df = table.to_pandas()

Would you be interested in changing the PyArrowImpl to use something like the above or should this be part of pyarrow? I can draft a PR if you think this is desired.

@alippai
Copy link
Contributor Author

alippai commented Apr 30, 2023

@phofl @jorisvandenbossche @mroeschke I see you are the main contributors of this code

@rhshadrach
Copy link
Member

rhshadrach commented May 2, 2023

For the partitioning column, I believe the data is not stored in the parquet files. When reading back in, the column is constructed from the file paths instead, so roundtripping with dtypes is not possible.

In any case, this is a pyarrow issue and not pandas.

@rhshadrach rhshadrach added IO Parquet parquet, feather Closing Candidate May be closeable, needs more eyeballs Arrow pyarrow functionality labels May 2, 2023
@alippai
Copy link
Contributor Author

alippai commented May 2, 2023

@rhshadrach pandas stores metadata info in the parquet files and this extra data makes it possible to restore the schema and dtype. Since it needs pandas specific info, it's debatable which option of the two (pyarrow or pandas) should be implemented.

This is the same suggestion as @jorisvandenbossche had in the linked JIRA issue.

@rhshadrach
Copy link
Member

pandas stores metadata info in the parquet files

I am no expert here, so correct me if this is wrong, but I believe this is done on the pyarrow side and not within pandas.

@alippai
Copy link
Contributor Author

alippai commented May 3, 2023

@rhshadrach looks like it is there. I guess in this case the only question is if it's acceptable that storing and loading a pandas dataframe raises an exception. Note that storing and loading the pyarrow table has no issues.

Regardless, this can be closed, adding the notes to the pyarrow issue now.

@jorisvandenbossche
Copy link
Member

I think there are multiple aspects that interact:

  • When reading a partitioned dataset, the column that originates from the partitioning (embedded in the file names) is indeed reconstructed from the strings. PyArrow itself only does very little type inference when parsing those strings (I think just numbers or strings)
  • The default for this partitioning field is to reconstruct it as a dictionary array (categorical), because it always results in a highly repetitive value (one value for all rows that come from the same file). This can be turned off in pyarrow, would have to check if you can also do this through pd.read_parquet
  • When converting a pandas DataFrame to pyarrow Table (when writing), pyarrow does store metadata about the original pandas data types, and this metadata already gets used on the conversion back from pyarrow to pandas.

And it's the combination of those three items that results in having a dictionary typed column with string categories (result of the first two bullet points) that we want to try to convert to a pd.ArrowDtype("date32[day]") (because that is what the stored metadata says the original was, result of the third bullet point).

And then it's actually the pandas implementation of this conversion (in ArrowDtype.__from_arrow__) that pandas tries to convert the pyarrow array it gets passed (the dictionary) to the pyarrow type of the ArrowDtype (date32) using pyarrow's cast function. And then this cast fails with "Cast type date32[day] incompatible with dictionary type string" (which is a difficult way to say "cast from dictionary with string type to date32[day] is not implemented").

@jorisvandenbossche jorisvandenbossche removed the Closing Candidate May be closeable, needs more eyeballs label May 3, 2023
@rhshadrach
Copy link
Member

@jorisvandenbossche - is there something that should be done here on the pandas side?

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label May 3, 2023
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 3, 2023

I was still contemplating that, it might depend on the exact behaviour we would prefer to see in practice?

  • If we want that the dictionary string array actually gets converted to date32, then this conversion could be fixed on the pandas side within __from_arrow__ (by special casing dictionary input, by decoding the dictionary or by only casting its dictionary values)
  • Another option could be to say that on the pyarrow side, if a conversion for a column based on calling __arrow_dtype__ (and when inferred from the stored pandas metadata) fails, that in such a case pyarrow falls back to the normal, default conversion (as if there was no metadata). A potential issue with this is that it might also hide errors that ideally are bubbled up to the user.
  • If we want the result to be a dictionary of date32, then in theory pyarrow could know that if the original pandas metadata indicated a non-dictionary type, it should cast the dictionary values to that type, and then create a pandas.Categorical for it. But that might get a bit complicated. A problem is that at the point of converting the pyarrow.Table to pandas.DataFrame, pyarrow doesn't know anymore that a certain dictionary typed column originates from a partitioning field, so it's difficult to do something specific to this use case.

@rhshadrach
Copy link
Member

rhshadrach commented May 4, 2023

As a user, I desire round tripping regardless of whether partitions are used or not. For example:

df1 = pd.DataFrame({'a': [1 ,1, 2], 'b': [3, 4, 5]})
df2 = pd.DataFrame({'a': ['1', '1', '2'], 'b': [3, 4, 5]})

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

I personally do not find getting dictionary array / Categorical for partition columns to be a feature.

@alippai
Copy link
Contributor Author

alippai commented May 4, 2023

I don't have a strong opinion on categorical vs non-categorical (but if defaulting to convert, then a flag to disable it is always nice).
The performance improvements are usually non-negligible, so the extra conversion makes sense.

The int vs string vs date issue is more annoying (mainly because it raises Exception). I've just realized that I had similar issues with pyarrow partition filters as well (the behavior has changed in the recent versions): apache/arrow#34727

@jorisvandenbossche
Copy link
Member

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

Hmm, that's not what I see:

In [8]: df1.to_parquet('test.parquet', partition_cols=['a']); df1_roundtrip = pd.read_parquet('test.parquet')

In [9]: df1_roundtrip.dtypes
Out[9]: 
b       int64
a    category
dtype: object

In [10]: df1_roundtrip["a"].dtype
Out[10]: CategoricalDtype(categories=[1, 2], ordered=False)

In [11]: df1_roundtrip["a"].dtype.categories
Out[11]: Int64Index([1, 2], dtype='int64')

So the categories are correctly converted back to integers, but the column itself is categorical.

(above is with pyarrow 11.0.0 and pandas 1.5.3)

@rhshadrach
Copy link
Member

Hmm, that's not what I see:

@jorisvandenbossche: #53008 (comment) was my desired behavior, not current state. But I'd be more than happy with a flag to disable categorical conversion, and even really okay if I'm stuck with categories. It's the dtype conversion (string -> int), especially the stripping of leading 0s, that is a pain point for me.

So the categories are correctly converted back to integers, but the column itself is categorical.

df2 are also converted to integers, even though they start out as strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants