-
Notifications
You must be signed in to change notification settings - Fork 941
Improve performance of reading int8/int16 Parquet data #7055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmarks for integer conversion. int8 details
int16 details
full details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- These are some nice speedups ❤️
I pulled the benchmarks out into a separate PR so I can re-run the benchmarks so I can confirm the results. This PR is looking very nice
parquet/benches/arrow_reader.rs
Outdated
@@ -1280,6 +1292,18 @@ fn add_benches(c: &mut Criterion) { | |||
let string_list_desc = schema.column(14); | |||
let mandatory_binary_column_desc = schema.column(15); | |||
let optional_binary_column_desc = schema.column(16); | |||
let mandatory_uint8_column_desc = schema.column(27); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own convenience I pulled the benchmark into its own PR to make it easier to compare this branch to main:
🤖 |
🤖: Benchmark completed Details
|
Thank you @etseidl. This is a great speedup!. Also the changed behavior matches that of |
These are some pretty sweet speedups. Thanks again @etseidl |
Thanks for the review @alamb! |
@@ -261,6 +262,45 @@ where | |||
// - date64: cast int32 to date32, then date32 to date64. | |||
// - decimal: cast int32 to decimal, int64 to decimal | |||
let array = match target_type { | |||
// Using `arrow_cast::cast` has been found to be very slow for converting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other conversions that can avoid the type cast? i saw this being expensive in some other benchmarks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly looked at others and it wasn't super obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what benchmark, btw? I am curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of the clickbench queries (not sure what data types, but it was spending like 20% of samples in casting during parquet reading).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to burrow down into that code again early next week and see if there are any other obvious candidates. I did try signed->unsigned for 32 and 64 bit ints and there was no difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of the clickbench queries (not sure what data types, but it was spending like 20% of samples in casting during parquet reading).
FWIW many of the clickbench columns are Int16, as I found when working on #7470.
I started running some benchmarks on a draft update to parquet in this PR (hopefully it will show some improvements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try signed->unsigned for 32 and 64 bit ints and there was no difference.
Ahh, the reason for this is that I32/64->U32/64 is handled above (around L171). I would think anything that falls through and relies on arrow_cast::cast
is going to be potentially slow due to use of unary_opt
, but a quick glance at the decimal code looks like it will figure out which casts are infallible and use unary
instead. Perhaps other conversions do a similar optimization.
It might be worth exploring enumerating all of the allowed Parquet physical to logical type mappings and account for them here and not rely on arrow_cast
machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Rationale for this change
While investigating #7040 it was found that reading Parquet files with int8/int16 columns was slower than expected. Avoiding the use of
arrow_cast::cast
for these columns and instead directly casting usingPrimitiveArray::unary
is much faster.What changes are included in this PR?
Modifies
PrimitiveArrayReader
to explicitly handle conversion of Parquet physical type INT32 to Arrow (u)int8/(u)int16.This PR also includes additions to the arrow_reader benchmark.
Are there any user-facing changes?
No API changes, but there is a change in behavior. Before, improperly encoded columns would return nulls upon being read, whereas now the columns will be read and truncated to the proper bitwidth. For example,
238u8
might be encoded as0xffffffee
rather than0x000000ee
.arrow_cast::cast
will returnNone
for this conversion, this PR will instead return238u8
.