perf: improve decimal read performance in CometVector #756

parthchandra · 2024-08-01T19:56:44Z

Which issue does this PR close?

Part of #679 and #670

Rationale for this change

Improves performance of reading decimals in Comet vector

What changes are included in this PR?

There are two changes

For values with precision less than Decimal.MAX_LONG_DIGITS we will read the values into a long rather than into a BigInteger/BigDecimal
While reading the decimal buffer from native to jvm, we now read the entire buffer at one time instead of one value at a time. We were incurring a bounds check in JVM code for every read from native that is now reduced from once per value to once per vector.
Caveat: the code now uses num_elements_in_vector * size of element = 4K * 4 = 16 bytes more than previously.

How are these changes tested?

Existing tests

parthchandra · 2024-08-01T20:00:29Z

(Updated) Micro Benchmark numbers -

Running benchmark: TPCDS Micro Benchmarks
  Running case: scan_decimal
  Stopped after 100 iterations, 584248 ms
  Running case: scan_decimal: Comet (Scan)
24/08/02 15:36:58 INFO core/src/lib.rs: Comet native library initialized
  Stopped after 100 iterations, 504950 ms
  Running case: scan_decimal: Comet (Scan) Decimal 128 enabled 
  Stopped after 100 iterations, 711803 ms
  Running case: scan_decimal: Comet (Scan, Exec)
  Stopped after 100 iterations, 503991 ms

OpenJDK 64-Bit Server VM 11.0.19+7-LTS on Mac OS X 14.6
Apple M3 Max
TPCDS Micro Benchmarks:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------
scan_decimal                                              5713           5842         188          0.5        1983.7       1.0X
scan_decimal: Comet (Scan)                                4967           5050          75          0.6        1724.9       1.2X
scan_decimal: Comet (Scan) Decimal 128 enabled            7026           7118          66          0.4        2439.8       0.8X
scan_decimal: Comet (Scan, Exec)                          4965           5040         200          0.6        1724.1       1.2X

Previously
scan_decimal: Comet (Scan) Decimal 128 enabled was 0.5x of scan_decimal

kazuyukitanimura · 2024-08-02T07:35:33Z

Oops, I did not see this earlier... related #758

common/src/main/java/org/apache/comet/vector/CometVector.java

andygrove · 2024-08-02T15:39:17Z

common/src/main/java/org/apache/comet/vector/CometVector.java

    } else if (!useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
      return createDecimal(getLong(i), precision, scale);
+    } else if (useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {
+      return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);


nit: for readability, I think it is better not to repeat the precision <= Decimal.MAX_LONG_DIGITS() expression

Suggested change

} else if (!useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {

return createDecimal(getLong(i), precision, scale);

} else if (useDecimal128 && precision <= Decimal.MAX_LONG_DIGITS()) {

return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);

} else if (precision <= Decimal.MAX_LONG_DIGITS()) {

if (useDecimal128) {

return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);

} else {

return createDecimal(getLong(i), precision, scale);

}

andygrove · 2024-08-02T16:16:17Z

common/src/main/java/org/apache/comet/vector/CometVector.java

+    System.arraycopy(
+        DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, DECIMAL_BYTES, 0, DECIMAL_BYTE_WIDTH);


Should this be copying to dest rather than DECIMAL_BYTES?

Suggested change

System.arraycopy(

DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, DECIMAL_BYTES, 0, DECIMAL_BYTE_WIDTH);

System.arraycopy(

DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, dest, 0, DECIMAL_BYTE_WIDTH);

parthchandra · 2024-08-02T17:00:33Z

@andygrove @kazuyukitanimura, this is still in progress. There are ci failures and I may have to remove the part where this PR converts a 16 byte array to int/long. The conversion is not as simple as it looks (a glance at BigInteger code shows how complex it can be), and the code in this PR fails to convert negative values correctly. Including all the checks will take us back to the same performance as BigInteger.

…l.MAX_LONG_DIGITS

codecov-commenter · 2024-08-06T04:42:12Z

Codecov Report

Attention: Patch coverage is 61.11111% with 7 lines in your changes missing coverage. Please review.

Project coverage is 33.78%. Comparing base (ffb96c3) to head (82d99e0).
Report is 8 commits behind head on main.

Files	Patch %	Lines
...main/java/org/apache/comet/vector/CometVector.java	64.70%	3 Missing and 3 partials ⚠️
...in/java/org/apache/comet/parquet/ColumnReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #756      +/-   ##
============================================
+ Coverage     33.68%   33.78%   +0.10%     
- Complexity      822      871      +49     
============================================
  Files           111      112       +1     
  Lines         42671    42745      +74     
  Branches       9403     9436      +33     
============================================
+ Hits          14372    14442      +70     
+ Misses        25325    25324       -1     
- Partials       2974     2979       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parthchandra · 2024-08-06T21:17:18Z

@kazuyukitanimura @andygrove After ci fixes, this PR only brings us up to 0.8x of Spark. Looking at another fix to close this gap, but in the meantime, this is good to review.

andygrove · 2024-08-07T21:38:12Z

common/src/main/java/org/apache/comet/vector/CometVector.java

+    // If the index is zero and DECIMAL_BYTES_ALL already has data, we have a new
+    // batch of data in the vector's backing buffer. So read it again.


Could you expand on why i==0 means that have new data? I'm not entirely clear on why this is the case.

This check is now unnecessary so I've removed it. For reference, ColumnReader.loadVector specified the rules under which the vector may be reused for a new batch of data. When a new batch is read into the same vector, the underlying memory buffer is reused and there is no change in the buffer memory address of the value vector. The check for index==0 banked on the fact that the iterator reading the values will start from the beginning every time a new batch is read. There was a gotcha though. If the first N values were null the iterator would skip this method and call it for the first time with an index of N which could be non-zero.
The only safe way is to reset the buffer when a new batch is read (which is what this PR does now).

kazuyukitanimura

Is it possible to get new speed up numbers?

kazuyukitanimura · 2024-08-07T23:46:35Z

common/src/main/java/org/apache/comet/vector/CometVector.java

@@ -88,7 +90,11 @@ public Decimal getDecimal(int i, int precision, int scale) {
    if (!useDecimal128 && precision <= Decimal.MAX_INT_DIGITS() && type instanceof IntegerType) {
      return createDecimal(getInt(i), precision, scale);
    } else if (precision <= Decimal.MAX_LONG_DIGITS()) {
-      return createDecimal(useDecimal128 ? getLongDecimal(i) : getLong(i), precision, scale);
+      if (useDecimal128) {
+        return createDecimal(getLongFromDecimalBytes(getBinaryDecimal(i)), precision, scale);


Hmm not sure how this is working with dictionary...

CometDictionaryVector overrides getBinaryDecimal so we get the correct value in the buffer.

I think getBinaryDecimal calls decodeToBinary that copies buffer... I am curious whether it may slows down things for dictionaries...

This PR leaves CometDictionary.decodeToBinary untouched (mainly because dictionary for decimals is a very rare case). However, CometDictionary.decodeToBinary does the same copy that CometVector was doing by calling CometVector.copyBinaryValues which this PR improves, so one would expect better performance?
I haven't validated performance with dictionary values, though I have checked that the unit tests pass (ParquetReadSuite covers dictionary encoded decimals).

Thanks, I will look into this more to get better understanding.
The current getLongDecimal does not call decodeToBinary so I feel this change has more copies...

Oh, right. I did not look into your getLongDecimal change. That might be better. Let me also verify.

parthchandra · 2024-08-08T00:42:11Z

Is it possible to get new speed up numbers?

The original PR comment is updated with the latest benchmark numbers.

parthchandra · 2024-08-08T17:50:53Z

@kazuyukitanimura looks like thegetLongDecimal change is indeed better (and also has zero additional memory). Since that is already merged, let me close this PR.
Thanks for reviewing!

kazuyukitanimura · 2024-08-08T19:23:56Z

@parthchandra I think the idea of getting all buffers at once may help for

Decimal128 cases
Platform.getLong is still expensive so calling ByteBuffer.wrap(bytes).getLong(8) might be cheaper

parthchandra · 2024-08-09T22:51:21Z

@parthchandra I think the idea of getting all buffers at once may help for

Decimal128 cases

I thought about that but the cost of creating BigInteger overshadowed the cost of getting the buffer from the vector. For high precision decimals, I feel we should focus on getting everything done in native.

Platform.getLong is still expensive so calling ByteBuffer.wrap(bytes).getLong(8) might be cheaper

I tried this. It's slower than your current implementation. ByteBuffer creation itself turned out to be expensive and I believe, ByteBuffer.getLong also ends up calling unsafe so I suspect the implementation might be shared.
I also tried to convert the decimal bytes to long in native and calling native directly in CometVector. The getLongDecimal implementation is the same speed as that so Platform.getLong is as fast as we will be able to get for individual values.
We could just pass the buffer address to a native method, vectorize the conversion, and return an array/vector of longs which is likely to be even faster, but I skipped that since the getLongDecimal implementation is already as fast as the jvm version and does not use any additional memory.

parthchandra marked this pull request as draft August 1, 2024 19:57

andygrove reviewed Aug 2, 2024

View reviewed changes

common/src/main/java/org/apache/comet/vector/CometVector.java Show resolved Hide resolved

andygrove reviewed Aug 2, 2024

View reviewed changes

andygrove mentioned this pull request Aug 2, 2024

fix: Optimize getDecimal for small precision #758

Merged

parthchandra added 8 commits August 5, 2024 16:33

Use Long instead of BigDecimal for Decimal precision less than Decima…

1bb7715

…l.MAX_LONG_DIGITS

Use bulk copy of vector instead of one value at a time

8d3dff3

formatting

60d514e

more formatting after rebase

438217f

fix for ci failures

eabd61f

format

1658dd2

reset decimal_bytes_all on close

94bfeb0

reset decimal_bytes_all if vector is reused

82d99e0

parthchandra force-pushed the decimals2 branch from 224bcb3 to 82d99e0 Compare August 6, 2024 03:52

parthchandra marked this pull request as ready for review August 6, 2024 21:16

andygrove reviewed Aug 7, 2024

View reviewed changes

kazuyukitanimura reviewed Aug 7, 2024

View reviewed changes

remove unnecessary check

2645a3e

parthchandra closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve decimal read performance in CometVector #756

perf: improve decimal read performance in CometVector #756

parthchandra commented Aug 1, 2024

parthchandra commented Aug 1, 2024 •

edited

Loading

kazuyukitanimura commented Aug 2, 2024

andygrove Aug 2, 2024 •

edited

Loading

andygrove Aug 2, 2024 •

edited

Loading

parthchandra commented Aug 2, 2024

codecov-commenter commented Aug 6, 2024 •

edited

Loading

parthchandra commented Aug 6, 2024

andygrove Aug 7, 2024

parthchandra Aug 8, 2024

kazuyukitanimura left a comment

kazuyukitanimura Aug 7, 2024

parthchandra Aug 8, 2024

kazuyukitanimura Aug 8, 2024

parthchandra Aug 8, 2024

kazuyukitanimura Aug 8, 2024

parthchandra Aug 8, 2024

parthchandra commented Aug 8, 2024

parthchandra commented Aug 8, 2024

kazuyukitanimura commented Aug 8, 2024

parthchandra commented Aug 9, 2024 •

edited

Loading

		System.arraycopy(
		DECIMAL_BYTES_ALL, i * DECIMAL_BYTE_WIDTH, DECIMAL_BYTES, 0, DECIMAL_BYTE_WIDTH);

		// If the index is zero and DECIMAL_BYTES_ALL already has data, we have a new
		// batch of data in the vector's backing buffer. So read it again.

perf: improve decimal read performance in CometVector #756

perf: improve decimal read performance in CometVector #756

Conversation

parthchandra commented Aug 1, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

parthchandra commented Aug 1, 2024 • edited Loading

kazuyukitanimura commented Aug 2, 2024

andygrove Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

parthchandra commented Aug 2, 2024

codecov-commenter commented Aug 6, 2024 • edited Loading

Codecov Report

parthchandra commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra commented Aug 8, 2024

parthchandra commented Aug 8, 2024

kazuyukitanimura commented Aug 8, 2024

parthchandra commented Aug 9, 2024 • edited Loading

parthchandra commented Aug 1, 2024 •

edited

Loading

andygrove Aug 2, 2024 •

edited

Loading

andygrove Aug 2, 2024 •

edited

Loading

codecov-commenter commented Aug 6, 2024 •

edited

Loading

parthchandra commented Aug 9, 2024 •

edited

Loading