Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Dec 30, 2022

What changes were proposed in this pull request?

This aims to regenerate benchmark results on master branch as a part of preparing Apache Spark 3.4.0 release to identify potential regressions during QA period.

In addition,

  • MetadataStructBenchmark benchmark is added by SPARK-37980, but the results are added here.

Why are the changes needed?

This are reference values.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review.

Join w 2 ints wholestage off 320974 321015 57 0.1 15305.3 1.0X
Join w 2 ints wholestage on 237636 238622 567 0.1 11331.4 1.4X
Join w 2 ints wholestage off 206309 209960 2865 0.1 9837.6 1.0X
Join w 2 ints wholestage on 260860 265467 1202 0.1 12438.8 0.8X
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regression because Java8/11/17 shows the same regression here.

Join w 2 ints wholestage off 358176 358191 22 0.1 17079.1 1.0X
Join w 2 ints wholestage on 207044 207239 239 0.1 9872.6 1.7X
Join w 2 ints wholestage off 169271 170528 1778 0.1 8071.5 1.0X
Join w 2 ints wholestage on 162252 164248 NaN 0.1 7736.8 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Join w 2 ints wholestage off 426979 427823 1194 0.0 20359.9 1.0X
Join w 2 ints wholestage on 208958 209733 509 0.1 9963.9 2.0X
Join w 2 ints wholestage off 183695 184609 1292 0.1 8759.3 1.0X
Join w 2 ints wholestage on 180862 181268 311 0.1 8624.2 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -0,0 +1,40 @@
================================================================================================
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missed at SPARK-37980.

Native ORC Vectorized 30460 30772 441 0.0 29049.4 0.9X
Hive built-in ORC 32291 36141 NaN 0.0 30795.2 1.0X
Native ORC MR 94939 95045 149 0.0 90541.2 0.3X
Native ORC Vectorized 93062 93335 386 0.0 88750.4 0.3X
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at this, Single Struct Column Scan with 600 Fields. This seems to be related to some new SPARK patches.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @HyukjinKwon ?

@HyukjinKwon
Copy link
Member

Should be good to go since we don't test this in CI

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon . Yes, right. The yarn ut failure is a flaky one. I'll merge this to proceed more.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-41782 branch December 30, 2022 06:12
@LuciferYang
Copy link
Contributor

late LGTM

@dongjoon-hyun
Copy link
Member Author

Hi, FYI, @HyukjinKwon and @LuciferYang .
I found the root cause of regression. Here is my comment about that Spark commit.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun added a commit that referenced this pull request Jan 3, 2023
…e feature

### What changes were proposed in this pull request?

This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower).

SPARK-39862 should propose a fix without perf regression.

### Why are the changes needed?

During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression.
- #39301 (comment)

### Does this PR introduce _any_ user-facing change?

After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately.

### How was this patch tested?

Pass the CI.

Closes #39362 from dongjoon-hyun/SPARK-41858.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants