Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve decimal read performance in CometVector #756

Closed
wants to merge 9 commits into from

Conversation

parthchandra
Copy link
Contributor

Which issue does this PR close?

Part of #679 and #670

Rationale for this change

Improves performance of reading decimals in Comet vector

What changes are included in this PR?

There are two changes

  1. For values with precision less than Decimal.MAX_LONG_DIGITS we will read the values into a long rather than into a BigInteger/BigDecimal
  2. While reading the decimal buffer from native to jvm, we now read the entire buffer at one time instead of one value at a time. We were incurring a bounds check in JVM code for every read from native that is now reduced from once per value to once per vector.
    Caveat: the code now uses num_elements_in_vector * size of element = 4K * 4 = 16 bytes more than previously.

How are these changes tested?

Existing tests

@parthchandra parthchandra marked this pull request as draft August 1, 2024 19:57
@parthchandra
Copy link
Contributor Author

parthchandra commented Aug 1, 2024

(Updated) Micro Benchmark numbers -

Running benchmark: TPCDS Micro Benchmarks
  Running case: scan_decimal
  Stopped after 100 iterations, 584248 ms
  Running case: scan_decimal: Comet (Scan)
24/08/02 15:36:58 INFO core/src/lib.rs: Comet native library initialized
  Stopped after 100 iterations, 504950 ms
  Running case: scan_decimal: Comet (Scan) Decimal 128 enabled 
  Stopped after 100 iterations, 711803 ms
  Running case: scan_decimal: Comet (Scan, Exec)
  Stopped after 100 iterations, 503991 ms

OpenJDK 64-Bit Server VM 11.0.19+7-LTS on Mac OS X 14.6
Apple M3 Max
TPCDS Micro Benchmarks:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
scan_decimal                                              5713           5842         188          0.5        1983.7       1.0X
scan_decimal: Comet (Scan)                                4967           5050          75          0.6        1724.9       1.2X
scan_decimal: Comet (Scan) Decimal 128 enabled            7026           7118          66          0.4        2439.8       0.8X
scan_decimal: Comet (Scan, Exec)                          4965           5040         200          0.6        1724.1       1.2X

Previously
scan_decimal: Comet (Scan) Decimal 128 enabled was 0.5x of scan_decimal

@kazuyukitanimura
Copy link
Contributor

Oops, I did not see this earlier... related #758

Comment on lines 91 to 94
} else if (!useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
return createDecimal(getLong(i), precision, scale);
} else if (useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);
Copy link
Member

@andygrove andygrove Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for readability, I think it is better not to repeat the precision <= Decimal.MAX_LONG_DIGITS() expression

Suggested change
} else if (!useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
return createDecimal(getLong(i), precision, scale);
} else if (useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);
} else if (precision <= Decimal.MAX_LONG_DIGITS()) {
if (useDecimal128) {
return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);
} else {
return createDecimal(getLong(i), precision, scale);
}

Comment on lines 153 to 154
System.arraycopy(
DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, DECIMAL_BYTES, 0, DECIMAL_BYTE_WIDTH);
Copy link
Member

@andygrove andygrove Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be copying to dest rather than DECIMAL_BYTES?

Suggested change
System.arraycopy(
DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, DECIMAL_BYTES, 0, DECIMAL_BYTE_WIDTH);
System.arraycopy(
DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, dest, 0, DECIMAL_BYTE_WIDTH);

@parthchandra
Copy link
Contributor Author

@andygrove @kazuyukitanimura, this is still in progress. There are ci failures and I may have to remove the part where this PR converts a 16 byte array to int/long. The conversion is not as simple as it looks (a glance at BigInteger code shows how complex it can be), and the code in this PR fails to convert negative values correctly. Including all the checks will take us back to the same performance as BigInteger.

@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2024

Codecov Report

Attention: Patch coverage is 61.11111% with 7 lines in your changes missing coverage. Please review.

Project coverage is 33.78%. Comparing base (ffb96c3) to head (82d99e0).
Report is 8 commits behind head on main.

Files Patch % Lines
...main/java/org/apache/comet/vector/CometVector.java 64.70% 3 Missing and 3 partials ⚠️
...in/java/org/apache/comet/parquet/ColumnReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #756      +/-   ##
============================================
+ Coverage     33.68%   33.78%   +0.10%     
- Complexity      822      871      +49     
============================================
  Files           111      112       +1     
  Lines         42671    42745      +74     
  Branches       9403     9436      +33     
============================================
+ Hits          14372    14442      +70     
+ Misses        25325    25324       -1     
- Partials       2974     2979       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@parthchandra parthchandra marked this pull request as ready for review August 6, 2024 21:16
@parthchandra
Copy link
Contributor Author

@kazuyukitanimura @andygrove After ci fixes, this PR only brings us up to 0.8x of Spark. Looking at another fix to close this gap, but in the meantime, this is good to review.

Comment on lines 145 to 146
// If the index is zero and DECIMAL_BYTES_ALL already has data, we have a new
// batch of data in the vector's backing buffer. So read it again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on why i==0 means that have new data? I'm not entirely clear on why this is the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is now unnecessary so I've removed it. For reference, ColumnReader.loadVector specified the rules under which the vector may be reused for a new batch of data. When a new batch is read into the same vector, the underlying memory buffer is reused and there is no change in the buffer memory address of the value vector. The check for index==0 banked on the fact that the iterator reading the values will start from the beginning every time a new batch is read. There was a gotcha though. If the first N values were null the iterator would skip this method and call it for the first time with an index of N which could be non-zero.
The only safe way is to reset the buffer when a new batch is read (which is what this PR does now).

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get new speed up numbers?

@@ -88,7 +90,11 @@ public Decimal getDecimal(int i, int precision, int scale) {
if (!useDecimal128 && precision <= Decimal.MAX_INT_DIGITS() && type instanceof IntegerType) {
return createDecimal(getInt(i), precision, scale);
} else if (precision <= Decimal.MAX_LONG_DIGITS()) {
return createDecimal(useDecimal128 ? getLongDecimal(i) : getLong(i), precision, scale);
if (useDecimal128) {
return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure how this is working with dictionary...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometDictionaryVector overrides getBinaryDecimal so we get the correct value in the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getBinaryDecimal calls decodeToBinary that copies buffer... I am curious whether it may slows down things for dictionaries...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR leaves CometDictionary.decodeToBinary untouched (mainly because dictionary for decimals is a very rare case). However, CometDictionary.decodeToBinary does the same copy that CometVector was doing by calling CometVector.copyBinaryValues which this PR improves, so one would expect better performance?
I haven't validated performance with dictionary values, though I have checked that the unit tests pass (ParquetReadSuite covers dictionary encoded decimals).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will look into this more to get better understanding.
The current getLongDecimal does not call decodeToBinary so I feel this change has more copies...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I did not look into your getLongDecimal change. That might be better. Let me also verify.

@parthchandra
Copy link
Contributor Author

Is it possible to get new speed up numbers?

The original PR comment is updated with the latest benchmark numbers.

@parthchandra
Copy link
Contributor Author

@kazuyukitanimura looks like thegetLongDecimal change is indeed better (and also has zero additional memory). Since that is already merged, let me close this PR.
Thanks for reviewing!

@kazuyukitanimura
Copy link
Contributor

@parthchandra I think the idea of getting all buffers at once may help for

  1. Decimal128 cases
  2. Platform.getLong is still expensive so calling ByteBuffer.wrap(bytes).getLong(8) might be cheaper

@parthchandra
Copy link
Contributor Author

parthchandra commented Aug 9, 2024

@parthchandra I think the idea of getting all buffers at once may help for

  1. Decimal128 cases

I thought about that but the cost of creating BigInteger overshadowed the cost of getting the buffer from the vector. For high precision decimals, I feel we should focus on getting everything done in native.

  1. Platform.getLong is still expensive so calling ByteBuffer.wrap(bytes).getLong(8) might be cheaper

I tried this. It's slower than your current implementation. ByteBuffer creation itself turned out to be expensive and I believe, ByteBuffer.getLong also ends up calling unsafe so I suspect the implementation might be shared.
I also tried to convert the decimal bytes to long in native and calling native directly in CometVector. The getLongDecimal implementation is the same speed as that so Platform.getLong is as fast as we will be able to get for individual values.
We could just pass the buffer address to a native method, vectorize the conversion, and return an array/vector of longs which is likely to be even faster, but I skipped that since the getLongDecimal implementation is already as fast as the jvm version and does not use any additional memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants