Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet metadata persistence of DataFrame.attrs #54346

Merged
merged 8 commits into from
Aug 3, 2023

Conversation

SanjithChockan
Copy link
Contributor

@SanjithChockan SanjithChockan commented Aug 1, 2023

DataFrame.attrs seems to not be attached to pyarrow schema metadata as it is experimental. Added it to the pyarrow table (schema.metadata) so it persists when the parquet file is read back. One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.

@SanjithChockan SanjithChockan changed the title Parquet metadata persistence Parquet metadata persistence of DataFrame.attrs Aug 1, 2023
@xiki-tempula
Copy link
Contributor

xiki-tempula commented Aug 1, 2023

On one hand, this is not ideal as the round trip is not complete. On the other hand, in the current implementation, if the column name of a pandas dataframe is not a string, it will be converted to a string, when converting to a parquet file so is still consistent with other behaviours.

Similar to Duplicate column names and non-string columns names are not supported.

Several caveats.

Duplicate column names and non-string columns names are not supported.

, we just need to add a note saying that non-string attrs keys are not supported and will will be converted to a string.

@mroeschke mroeschke added IO Parquet parquet, feather metadata _metadata, .attrs labels Aug 1, 2023
@SanjithChockan
Copy link
Contributor Author

On one hand, this is not ideal as the round trip is not complete.
I'm not sure I understand. Could you explain this?

@xiki-tempula
Copy link
Contributor

One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.
So I would imagine that
if I do

df.attrs = {1:1}
df.to_parquet('test.p')
new_df = pd.read_parquet('test.p')

The new_df.attrs ({'1':1}) would not be the same as the df.attrs ({1:1}). This is what I mean by round trip, where you go to parquet file then go back.

@SanjithChockan
Copy link
Contributor Author

oh yeah. Can't really think of another approach other than typecasting keys to an int if possible but doesn't seem feasible.

@xiki-tempula
Copy link
Contributor

@SanjithChockan I think it would be fine. The key to the attrs is not the only thing that would be converted to string when saving to parquet file.

@SanjithChockan
Copy link
Contributor Author

I'll fix the failing checks and wait for someone else to review to see if this approach is okay

pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
@@ -176,8 +176,8 @@ Other enhancements
- Performance improvement in :func:`concat` with homogeneous ``np.float64`` or ``np.float32`` dtypes (:issue:`52685`)
- Performance improvement in :meth:`DataFrame.filter` when ``items`` is given (:issue:`52941`)
- Reductions :meth:`Series.argmax`, :meth:`Series.argmin`, :meth:`Series.idxmax`, :meth:`Series.idxmin`, :meth:`Index.argmax`, :meth:`Index.argmin`, :meth:`DataFrame.idxmax`, :meth:`DataFrame.idxmin` are now supported for object-dtype objects (:issue:`4279`, :issue:`18021`, :issue:`40685`, :issue:`43697`)
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)
- :meth:`DataFrame.to_parquet` and :func:`read_parquet` will now write and read ``attrs`` respectively (:issue:`54346`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@mroeschke mroeschke added this to the 2.1 milestone Aug 3, 2023
@mroeschke mroeschke merged commit 152595c into pandas-dev:main Aug 3, 2023
30 of 32 checks passed
@mroeschke
Copy link
Member

Thanks @SanjithChockan

@martindurant
Copy link
Contributor

fastparquet here - would have appreciated at least some notification of this.

dask/fastparquet#900

@mroeschke
Copy link
Member

Ah sorry @martindurant. The original request mentioned pyarrow so fastparquet slipped the review. Happy to have PRs to add this to the fastparquet engine too!

@aufdenkampe
Copy link

aufdenkampe commented Nov 21, 2023

@mroeschke, does this PR include the ability to read/write column-level metadata from pandas.Series.attrs? This feature is particularly important to environmental data scientists who need to keep track of column/variable metadata fields such as:

  • long name
  • description
  • units

It's also a data sharing requirement of FAIR Data Principles.

@mroeschke
Copy link
Member

I believe not, no. This PR only supports attrs from DataFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather metadata _metadata, .attrs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants