Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options for improving performance when reading decimals from Parquet #679

Closed
Tracked by #717
andygrove opened this issue Jul 17, 2024 · 10 comments
Closed
Tracked by #717
Labels
enhancement New feature or request performance

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

This issue is for discussing the comment from @parthchandra in #671 (comment).

Just looking at this one case, with decimal fields and only scan enabled, we are much slower. This is consistent with something I saw when working on the parallel reader.
From a profiling run I saw that a potential bottleneck was BosonVector.getDecimal which has an expensive creation of a BigInteger followed by an expensive creation of a BigDecimal.
However, this path would be hit only for precision > 18 or if spark.comet.use.decimal128 was set to true (it is false by default).
Also, I'm not sure if there is a way to eliminate this though.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Jul 17, 2024
@andygrove
Copy link
Member Author

Screenshot from 2024-07-17 14-59-18

@andygrove
Copy link
Member Author

andygrove commented Jul 17, 2024

Here is the full flamegraph for the case where Comet is performing the scan only. The hash aggregate is running in Spark, so maybe it is not surprising that so much time is in calls to getDecimal.

Screenshot from 2024-07-17 15-39-08

@andygrove
Copy link
Member Author

This is the flamegraph where Comet is performing scan + exec. Because the hash aggregate is operating on Arrow data, we no longer see lots of decimals being accessed from JVM code.

Screenshot from 2024-07-17 15-47-29

@parthchandra
Copy link
Contributor

This seemingly points the finger at Decimal.createUnsafe. But I fail to see how that is taking so much time.

@parthchandra
Copy link
Contributor

It might be that the assignment to Decimal.longVal (or Decimal.intVal) in Decimal.createUnsafe is copying data from native to Jvm memory and that has a performance issue.
But there may be no way to overcome this other than to use off-heap memory in the JVM side.

@viirya
Copy link
Member

viirya commented Jul 18, 2024

It might be that the assignment to Decimal.longVal (or Decimal.intVal) in Decimal.createUnsafe is copying data from native to Jvm memory and that has a performance issue. But there may be no way to overcome this other than to use off-heap memory in the JVM side.

Comet first reads the integer value by calling getInt. If any data copying is significant between off-heap and heap memory, it should be in this call. But you can see from the flame graph that getInt uses much less time.

@viirya
Copy link
Member

viirya commented Jul 18, 2024

createUnsafe takes Long object. Maybe boxing also takes some time?

def createUnsafe(unscaled: Long, precision: Int, scale: Int): Decimal

EDIT: Oh, it is Scala Long, so it should be Java's long primitive type already.

@kazuyukitanimura
Copy link
Contributor

kazuyukitanimura commented Jul 20, 2024

It looks like pure scan is not a problem based on profiling

Baseline

Screenshot 2024-07-19 at 10 49 16 PM

Scan-only enabled

Screenshot 2024-07-19 at 10 48 59 PM

Scan comparison (the name says add_many_decimals, but I was just lazy to change the name)

Screenshot 2024-07-19 at 11 28 02 PM

Now I remember that we need to increase the minimum number of iterations to get stable results. We used to use a remote Linux machine that was stable so 2-3 iterations were fine. But for local testing, we need to increase it. For the third screen shot, I used 33 iterations.

kazuyukitanimura added a commit that referenced this issue Jul 24, 2024
## Which issue does this PR close?

Part of #679 and #670
Related #490

## Rationale for this change

For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary.

## What changes are included in this PR?

Unpack only for Decimal 128

## How are these changes tested?

Existing test
@parthchandra
Copy link
Contributor

I see different results in profiling. I ran a simple query - select ss_net_profit from store_sales for a 100 iterations with useDecimal128 enabled and see the following -
Screenshot 2024-07-24 at 5 48 23 PM

What stands out is that the bulk of the time is being spent in the comet::parquet::read::values::<impl comet::parquet::read::PlainDecoding for comet::parquet::data_type::Int32DecimalType>::decode
Within this method the main time consumers (as a percentage of time spent in cpu) are
core::slice::<impl [T]>::fill - 16.76%
comet::common::bit::memcpy - 7.07%
core::slice::<impl [T]>::fill - 5.18% (second code path)

Overall Comet is 0.4x of Spark.

I made a change to comet::common::bit::memcpy to use copy_nonoverlapped which is unsafe and see a 25% improvement. (After the change, Comet is 0.5x of Spark)

However I don't know the best way to avoid the slice.fill calls without voiding the warranty. I'm looking at MaybeUninit, but the documentation quite rightly warns of there being dragons.

Also, with useDecimal128 disabled, we are slower than Spark because we treat the value a Decimal irrespective of precision. Spark reads and processes the value as Int
A minor change to Comet results in Comet being 1.2x of Spark for this query with useDecimal128 disabled.
I'll post a PR after some testing.

kazuyukitanimura added a commit that referenced this issue Aug 2, 2024
## Which issue does this PR close?

Part of #679 and #670

## Rationale for this change

The improvement could be negligible in real use cases, but I see some improvements in micro benchmarks 

## What changes are included in this PR?

Optimizations in some bit functions

## How are these changes tested?

Existing tests
@kazuyukitanimura
Copy link
Contributor

We created many fixes. closing for now

himadripal pushed a commit to himadripal/datafusion-comet that referenced this issue Sep 7, 2024
## Which issue does this PR close?

Part of apache#679 and apache#670
Related apache#490

## Rationale for this change

For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary.

## What changes are included in this PR?

Unpack only for Decimal 128

## How are these changes tested?

Existing test

(cherry picked from commit c1b7c7d)
himadripal pushed a commit to himadripal/datafusion-comet that referenced this issue Sep 7, 2024
## Which issue does this PR close?

Part of apache#679 and apache#670

## Rationale for this change

The improvement could be negligible in real use cases, but I see some improvements in micro benchmarks

## What changes are included in this PR?

Optimizations in some bit functions

## How are these changes tested?

Existing tests

(cherry picked from commit ffb96c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

4 participants