-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6195] [SQL] Adds in-memory column type for fixed-precision decimals #4938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #28367 has started for PR 4938 at commit
|
Test build #28367 has finished for PR 4938 at commit
|
Test PASSed. |
|
||
// The first 4 bytes in the buffer indicate the column type. This field is not used now, | ||
// because we always know the data type of the column ahead of time. | ||
dup.getInt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this line is not necessary any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call has side effect, still need to call it to read 4 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, we can remove this line after removing the whole column type ID stuff.
Test build #28391 has started for PR 4938 at commit
|
Test build #28391 has finished for PR 4938 at commit
|
Test PASSed. |
We only enables the specialized column type when the precision is less then 19.
Added micro benchmark result. Notice that explicit casting is required because |
Test build #28608 has started for PR 4938 at commit
|
/cc @yhuai, this should be helpful for the TPC-DS benchmark. Gonna merge this once Jenkins nods. |
Test build #28608 has finished for PR 4938 at commit
|
Test PASSed. |
This PR adds a specialized in-memory column type for fixed-precision decimals.
For all other column types, a single integer column type ID is enough to determine which column type to use. However, this doesn't apply to fixed-precision decimal types with different precision and scale parameters. Moreover, according to the previous design, there seems no trivial way to encode precision and scale information into the columnar byte buffer. On the other hand, considering we always know the data type of the column to be built / scanned ahead of time. This PR no longer use column type ID to construct
ColumnBuilder
s andColumnAccessor
s, but resorts to the actual column data type. In this way, we can pass precision / scale information along the way.The column type ID is now not used anymore and can be removed in a future PR.
Micro benchmark result
The following micro benchmark builds a simple table with 2 million decimals (precision = 10, scale = 0), cache it in memory, then count all the rows. Code (simply paste it into Spark shell):
With
FIXED_DECIMAL
column type:Without
FIXED_DECIMAL
column type: