-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-42168: [Python][Parquet] Pyarrow store decimal as integer #42169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added store_decimal_as_integer
Added store_decimal_as_integer
Added store_decimal_as_integer
store_decimal_as_integer WIP
store_decimal_as_integer completed
Removed list option from store_decimal_as_integer
|
|
Update tests to use decimal python types
Fixed variable typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM
BTW, You can follow some logic here: https://github.com/apache/arrow/pull/38360/files
Linting fix
Linting fix
Last linting nit
Remove extra whitespace
|
Sorry for the delay, I'll take a around and try that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for quick iterations @bkief !
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 0 experience on how we are testing parquet from Python. This PR seems to work and we can write and read to/from parquet when we set this new property.
The test does not validate that the stored value is indeed using int but I am not sure if this is already covered on the C++ tests.
I guess my only concern is that we are not really validating that the functionality is doing anything new, we only test we are able to set the flag and writing / reading.
That's why I can't give a +1 (will wait for others with more experience on this part of the code).
Maybe a way is check the parquet schema and verify the column schema is Integer ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnSchema.html#pyarrow.parquet.ColumnSchema ). |
That's a good idea, at least we can check that there is a difference on the schema between setting the flag to True or False. |
@raulcd I did consider directly inspecting the schema, and maybe this is better for directness, this but I found another way to test the same. The test with delta encoding ensures the fields are stored as integers, if not the writer would throw an error as |
Thanks! that sounds good to me, could you add a comment to the test to explain the above to make it more explicit? |
* Added explicent physical_type test of parquet schema * Fix whitespace * Fix whitespace
|
I just went ahead and added the explicit test. It would be a good pattern to have for future tests |
|
@raulcd should we better set this a blocker? I'll merge this if all CI pass and no negative comments in 6.30 night |
Looks like there was some filesytem permission errors in the AppVeyor CI. I'll try a different pattern to see if that resolves it |
Change tempdir approach
|
CI failed is unrelated ( Maybe ORC related), merged |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit edfa343. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
This PR exposes the store_decimal_as_integer parquet writer property to pyarrow
Rationale for this change
This will allow storing fixed-point decimals as integers in the parquet format and take advantage of more efficient storage codecs
What changes are included in this PR?
Pyarrow parquet writer and related Cython
Are these changes tested?
Tests were included for the new parameter
Are there any user-facing changes?
Docstrings were updated