Skip to content

Conversation

@bkief
Copy link
Contributor

@bkief bkief commented Jun 17, 2024

This PR exposes the store_decimal_as_integer parquet writer property to pyarrow

Rationale for this change

This will allow storing fixed-point decimals as integers in the parquet format and take advantage of more efficient storage codecs

What changes are included in this PR?

Pyarrow parquet writer and related Cython

Are these changes tested?

Tests were included for the new parameter

Are there any user-facing changes?

Docstrings were updated

bkief added 9 commits June 16, 2024 11:48
Added store_decimal_as_integer
Added store_decimal_as_integer
Added store_decimal_as_integer
store_decimal_as_integer WIP
store_decimal_as_integer completed
Removed list option from store_decimal_as_integer
@github-actions
Copy link

⚠️ GitHub issue #42168 has been automatically assigned in GitHub to PR creator.

bkief added 3 commits June 16, 2024 20:46
Update tests to use decimal python types
Fixed variable typo
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM
BTW, You can follow some logic here: https://github.com/apache/arrow/pull/38360/files

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 17, 2024
bkief added 3 commits June 17, 2024 11:21
Linting fix
Last linting nit
@bkief bkief requested a review from mapleFU June 17, 2024 17:27
@mapleFU mapleFU requested a review from AlenkaF June 17, 2024 17:35
Remove extra whitespace
@bkief
Copy link
Contributor Author

bkief commented Jun 25, 2024

@mapleFU or @AlenkaF - Can we merge this before the July 1 v17 cutoff?

@mapleFU
Copy link
Member

mapleFU commented Jun 25, 2024

Sorry for the delay, I'll take a around and try that

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 25, 2024
Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for quick iterations @bkief !

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 27, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 0 experience on how we are testing parquet from Python. This PR seems to work and we can write and read to/from parquet when we set this new property.
The test does not validate that the stored value is indeed using int but I am not sure if this is already covered on the C++ tests.
I guess my only concern is that we are not really validating that the functionality is doing anything new, we only test we are able to set the flag and writing / reading.
That's why I can't give a +1 (will wait for others with more experience on this part of the code).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 28, 2024
@mapleFU
Copy link
Member

mapleFU commented Jun 28, 2024

I guess my only concern is that we are not really validating that the functionality is doing anything new, we only test we are able to set the flag and writing / reading.

Maybe a way is check the parquet schema and verify the column schema is Integer ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnSchema.html#pyarrow.parquet.ColumnSchema ).

@raulcd
Copy link
Member

raulcd commented Jun 28, 2024

Maybe a way is check the parquet schema and verify the column schema is Integer ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ColumnSchema.html#pyarrow.parquet.ColumnSchema ).

That's a good idea, at least we can check that there is a difference on the schema between setting the flag to True or False.

@bkief
Copy link
Contributor Author

bkief commented Jun 28, 2024

I have 0 experience on how we are testing parquet from Python. This PR seems to work and we can write and read to/from parquet when we set this new property. The test does not validate that the stored value is indeed using int but I am not sure if this is already covered on the C++ tests. I guess my only concern is that we are not really validating that the functionality is doing anything new, we only test we are able to set the flag and writing / reading. That's why I can't give a +1 (will wait for others with more experience on this part of the code).

@raulcd I did consider directly inspecting the schema, and maybe this is better for directness, this but I found another way to test the same. The test with delta encoding ensures the fields are stored as integers, if not the writer would throw an error as DELTA_BINARY_PACKED requires the columns to be of type INT32 or INT64.

@raulcd
Copy link
Member

raulcd commented Jun 28, 2024

@raulcd I did consider directly inspecting the schema, and maybe this is better for directness, this but I found another way to test the same. The test with delta encoding ensures the fields are stored as integers, if not the writer would throw an error as DELTA_BINARY_PACKED requires the columns to be of type INT32 or INT64.

Thanks! that sounds good to me, could you add a comment to the test to explain the above to make it more explicit?

* Added explicent physical_type test of parquet schema

* Fix whitespace

* Fix whitespace
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 28, 2024
@bkief
Copy link
Contributor Author

bkief commented Jun 28, 2024

I just went ahead and added the explicit test. It would be a good pattern to have for future tests

@mapleFU
Copy link
Member

mapleFU commented Jun 28, 2024

@raulcd should we better set this a blocker? I'll merge this if all CI pass and no negative comments in 6.30 night

@bkief
Copy link
Contributor Author

bkief commented Jun 28, 2024

I just went ahead and added the explicit test. It would be a good pattern to have for future tests

Looks like there was some filesytem permission errors in the AppVeyor CI. I'll try a different pattern to see if that resolves it

@mapleFU mapleFU merged commit edfa343 into apache:main Jul 1, 2024
@mapleFU mapleFU removed the awaiting change review Awaiting change review label Jul 1, 2024
@mapleFU
Copy link
Member

mapleFU commented Jul 1, 2024

CI failed is unrelated ( Maybe ORC related), merged

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit edfa343.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants