Investigate options for improving performance when reading decimals from Parquet #679

andygrove · 2024-07-17T20:26:55Z

What is the problem the feature request solves?

This issue is for discussing the comment from @parthchandra in #671 (comment).

Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader.
From a profiling run I saw that a potential bottleneck was BosonVector.getDecimal which has an expensive creation of a BigInteger followed by an expensive creation of a BigDecimal.
However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default).
Also, I'm not sure if there is a way to eliminate this though.

Describe the potential solution

No response

Additional context

No response

andygrove · 2024-07-17T20:59:48Z

andygrove · 2024-07-17T21:39:57Z

Here is the full flamegraph for the case where Comet is performing the scan only. The hash aggregate is running in Spark, so maybe it is not surprising that so much time is in calls to getDecimal.

andygrove · 2024-07-17T21:48:59Z

This is the flamegraph where Comet is performing scan + exec. Because the hash aggregate is operating on Arrow data, we no longer see lots of decimals being accessed from JVM code.

parthchandra · 2024-07-17T22:05:04Z

This seemingly points the finger at Decimal.createUnsafe. But I fail to see how that is taking so much time.

parthchandra · 2024-07-17T22:12:32Z

It might be that the assignment to Decimal.longVal (or Decimal.intVal) in Decimal.createUnsafe is copying data from native to Jvm memory and that has a performance issue.
But there may be no way to overcome this other than to use off-heap memory in the JVM side.

viirya · 2024-07-18T19:03:15Z

It might be that the assignment to Decimal.longVal (or Decimal.intVal) in Decimal.createUnsafe is copying data from native to Jvm memory and that has a performance issue. But there may be no way to overcome this other than to use off-heap memory in the JVM side.

Comet first reads the integer value by calling getInt. If any data copying is significant between off-heap and heap memory, it should be in this call. But you can see from the flame graph that getInt uses much less time.

viirya · 2024-07-18T19:06:28Z

createUnsafe takes Long object. Maybe boxing also takes some time?

def createUnsafe(unscaled: Long, precision: Int, scale: Int): Decimal

EDIT: Oh, it is Scala Long, so it should be Java's long primitive type already.

kazuyukitanimura · 2024-07-20T06:42:47Z

It looks like pure scan is not a problem based on profiling

Baseline

Scan-only enabled

Scan comparison (the name says add_many_decimals, but I was just lazy to change the name)

Now I remember that we need to increase the minimum number of iterations to get stable results. We used to use a remote Linux machine that was stable so 2-3 iterations were fine. But for local testing, we need to increase it. For the third screen shot, I used 33 iterations.

## Which issue does this PR close? Part of #679 and #670 Related #490 ## Rationale for this change For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary. ## What changes are included in this PR? Unpack only for Decimal 128 ## How are these changes tested? Existing test

parthchandra · 2024-07-25T02:00:54Z

I see different results in profiling. I ran a simple query - select ss_net_profit from store_sales for a 100 iterations with useDecimal128 enabled and see the following -

What stands out is that the bulk of the time is being spent in the comet::parquet::read::values::<impl comet::parquet::read::PlainDecoding for comet::parquet::data_type::Int32DecimalType>::decode
Within this method the main time consumers (as a percentage of time spent in cpu) are
core::slice::<impl [T]>::fill - 16.76%
comet::common::bit::memcpy - 7.07%
core::slice::<impl [T]>::fill - 5.18% (second code path)

Overall Comet is 0.4x of Spark.

I made a change to comet::common::bit::memcpy to use copy_nonoverlapped which is unsafe and see a 25% improvement. (After the change, Comet is 0.5x of Spark)

However I don't know the best way to avoid the slice.fill calls without voiding the warranty. I'm looking at MaybeUninit, but the documentation quite rightly warns of there being dragons.

Also, with useDecimal128 disabled, we are slower than Spark because we treat the value a Decimal irrespective of precision. Spark reads and processes the value as Int
A minor change to Comet results in Comet being 1.2x of Spark for this query with useDecimal128 disabled.
I'll post a PR after some testing.

## Which issue does this PR close? Part of #679 and #670 ## Rationale for this change The improvement could be negligible in real use cases, but I see some improvements in micro benchmarks ## What changes are included in this PR? Optimizations in some bit functions ## How are these changes tested? Existing tests

kazuyukitanimura · 2024-08-16T22:48:42Z

We created many fixes. closing for now

## Which issue does this PR close? Part of apache#679 and apache#670 Related apache#490 ## Rationale for this change For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary. ## What changes are included in this PR? Unpack only for Decimal 128 ## How are these changes tested? Existing test (cherry picked from commit c1b7c7d)

## Which issue does this PR close? Part of apache#679 and apache#670 ## Rationale for this change The improvement could be negligible in real use cases, but I see some improvements in micro benchmarks ## What changes are included in this PR? Optimizations in some bit functions ## How are these changes tested? Existing tests (cherry picked from commit ffb96c3)

andygrove added enhancement New feature or request performance labels Jul 17, 2024

This was referenced Jul 17, 2024

chore: Add microbenchmarks #671

Merged

[EPIC] Improving Performance #566

Open

kazuyukitanimura mentioned this issue Jul 22, 2024

fix: dictionary decimal vector optimization #705

Merged

andygrove mentioned this issue Jul 23, 2024

Implement native version of ColumnarToRow #708

Open

andygrove mentioned this issue Jul 25, 2024

[EPIC] Performance focus for 0.2.0 Release #717

Closed

5 tasks

This was referenced Jul 25, 2024

fix: optimize some bit functions #718

Merged

fix: skip negative scale checks for creating decimals #722

Closed

fix: skip negative scale checks for creating decimals #723

Merged

parthchandra mentioned this issue Jul 26, 2024

perf: decimal decode improvements #727

Merged

kazuyukitanimura mentioned this issue Jul 26, 2024

fix: optimize isNullAt #732

Merged

parthchandra mentioned this issue Aug 1, 2024

perf: improve decimal read performance in CometVector #756

Closed

This was referenced Aug 1, 2024

fix: Optimize getDecimal for small precision #758

Merged

fix: Optimize decimal creation macros #764

Merged

kazuyukitanimura closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate options for improving performance when reading decimals from Parquet #679

Investigate options for improving performance when reading decimals from Parquet #679

andygrove commented Jul 17, 2024

andygrove commented Jul 17, 2024

andygrove commented Jul 17, 2024 •

edited

Loading

andygrove commented Jul 17, 2024

parthchandra commented Jul 17, 2024

parthchandra commented Jul 17, 2024

viirya commented Jul 18, 2024

viirya commented Jul 18, 2024 •

edited

Loading

kazuyukitanimura commented Jul 20, 2024 •

edited

Loading

parthchandra commented Jul 25, 2024

kazuyukitanimura commented Aug 16, 2024

Investigate options for improving performance when reading decimals from Parquet #679

Investigate options for improving performance when reading decimals from Parquet #679

Comments

andygrove commented Jul 17, 2024

What is the problem the feature request solves?

Describe the potential solution

Additional context

andygrove commented Jul 17, 2024

andygrove commented Jul 17, 2024 • edited Loading

andygrove commented Jul 17, 2024

parthchandra commented Jul 17, 2024

parthchandra commented Jul 17, 2024

viirya commented Jul 18, 2024

viirya commented Jul 18, 2024 • edited Loading

kazuyukitanimura commented Jul 20, 2024 • edited Loading

Baseline

Scan-only enabled

Scan comparison (the name says add_many_decimals, but I was just lazy to change the name)

parthchandra commented Jul 25, 2024

kazuyukitanimura commented Aug 16, 2024

andygrove commented Jul 17, 2024 •

edited

Loading

viirya commented Jul 18, 2024 •

edited

Loading

kazuyukitanimura commented Jul 20, 2024 •

edited

Loading