-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark #22617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Retest this please. |
Test build #97348 has finished for PR 22617 at commit
|
f016e20
to
1c30755
Compare
Retest this please. |
Test build #97438 has finished for PR 22617 at commit
|
Retest this please |
Test build #97504 has finished for PR 22617 at commit
|
@dongjoon-hyun , @kiszk could you please help me how take a step forward with this PR? |
val array = new ExternalAppendOnlyUnsafeRowArray( | ||
ExternalAppendOnlyUnsafeRowArray.DefaultInitialSizeOfInMemoryBuffer, | ||
numSpillThreshold) | ||
val array = new ExternalAppendOnlyUnsafeRowArray(numSpillThreshold, numSpillThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @peter-toth . Could you explain why we need to replace ExternalAppendOnlyUnsafeRowArray.DefaultInitialSizeOfInMemoryBuffer
with numSpillThreshold
here? Actually, this is not an obvious refactoring.
If this is related to Fix issue in ExternalAppendOnlyUnsafeRowArray creation
, please add some comments here or PR description clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @dongjoon-hyun. I added some details to the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, following my logic in the description, I think
val array = new ExternalAppendOnlyUnsafeRowArray(numSpillThreshold, numSpillThreshold)
should be changed to
val array = new ExternalAppendOnlyUnsafeRowArray(0, numSpillThreshold)
in testAgainstRawUnsafeExternalSorter
in "WITH SPILL" cases so as to compare ExternalAppendOnlyUnsafeRowArray
to UnsafeExternalSorter
when it behaves so.
But would be great if someone could confirm this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we need two-step comparision.
- Refactoring only to ensure no regression.
- Change that value to check the performance value difference.
Could you rollback this line and let us finish Step 1
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Reverted the change.
41c202e
to
3a52abc
Compare
Retest this please |
retest this please |
Please help review it @dongjoon-hyun @kiszk @wangyum |
Test build #100609 has finished for PR 22617 at commit
|
Sure. @gatorsmile . |
@dongjoon-hyun, sure, I will fix it soon. |
Change-Id: I245c16a77a071c6386ce671a9c4a6d8f8fe3b78d
3a52abc
to
0b04fa3
Compare
@dongjoon-hyun, I rebased my commits and fixed the build issue. |
Thank you for updating! I'll review today. |
* {{{ | ||
* 1. without sbt: | ||
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> | ||
* 2. build/sbt "sql/test:runMain <this class>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that 2~3 should be the same except SPARK_GENERATE_BENCHMARK_FILES=1
.
Also, we need spark.memory.debugFill
configuration for 1 (spark-submit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm removing spark.memory.debugFill=true
from the configuration of 3 to become similar to 1 (spark-submit). spark.memory.debugFill
is false
by default and setting it to true
adds enormous overhead.
I think I can change it to += \"-Dspark.memory.debugFill=false\"
if that better fits here.
val spillThreshold = 100 * 1000 | ||
testAgainstRawArrayBuffer(spillThreshold, 100 * 1000, 1 << 10) | ||
testAgainstRawArrayBuffer(spillThreshold, 1000, 1 << 18) | ||
testAgainstRawArrayBuffer(spillThreshold, 30 * 1000, 1 << 14) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original sequence; 1000
-> 30 * 1000
-> 100 * 1000
. Increasing order is more intuitive.
Ah, I got it. This is reordered by the calculation. Please forgot about the above comment.
>>> 1000 * (1<<18)
262144000
>>> 30 * 1000 * (1<<14)
491520000
>>> 100 * 1000 * (1<<10)
102400000
Spilling with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------ | ||
UnsafeExternalSorter 15829 / 15845 16.6 60.4 1.0X | ||
ExternalAppendOnlyUnsafeRowArray 10158 / 10174 25.8 38.7 1.6X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks meaningfully different from the previous result. Let's see the server result together. I'm running this.
@peter-toth . What do you mean by
It seems that we need to update the command in the PR description and in the comment; |
I think I should change it, it is a bit confusing now. I used |
If I changed to
|
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions | ||
* in Test -= \"-Dspark.memory.debugFill=true\";test:runMain <this class>" | ||
* in Test += \"-Dspark.memory.debugFill=false\";test:runMain <this class>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was confused with -=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i did it a bit confusing way, but updated now to += ...=false
in a new commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, in PR description, runMain
is repeated twice; test:runMain test:runMain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------ | ||
ArrayBuffer 8839 / 8951 29.7 33.7 1.0X | ||
ExternalAppendOnlyUnsafeRowArray 9884 / 9888 26.5 37.7 0.9X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run this once more in your side? For me, I've got the followings. The ratio difference is too big.
Mac
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
[info] Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[info] Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer 10226 / 10272 25.6 39.0 1.0X
[info] ExternalAppendOnlyUnsafeRowArray 24301 / 24425 10.8 92.7 0.4X
EC2 Server
[info] OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
[info] Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[info] Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer 11988 / 12027 21.9 45.7 1.0X
[info] ExternalAppendOnlyUnsafeRowArray 37480 / 37574 7.0 143.0 0.3X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is the only difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rerunning it soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got the same ratio as you have this time:
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6
[info] Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
[info] Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer 10028 / 10197 26.1 38.3 1.0X
[info] ExternalAppendOnlyUnsafeRowArray 30053 / 30312 8.7 114.6 0.3X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for confirmation. Master branch seems to be changed.
I made a PR to you, @peter-toth .
And, for the record, the following is Mac result. Most stuffs are consistent with this PR except
|
EC2 result
* 1. without sbt: | ||
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> | ||
* 2. build/sbt build/sbt ";project sql;set javaOptions | ||
* * in Test += \"-Dspark.memory.debugFill=false\";test:runMain <this class>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix * *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Change-Id: I8bcecad2863c97091d8bfb4c65386a59051938c1
------------------------------------------------------------------------------------------------ | ||
UnsafeExternalSorter 11 / 11 14.8 67.4 1.0X | ||
ExternalAppendOnlyUnsafeRowArray 9 / 9 17.6 56.8 1.2X | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the original master branch and get the following. Since the trend is the same, this refactoring PR looks safe.
$ bin/spark-submit --class org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars core/target/scala-2.12/spark-core_2.12-3.0.0-SNAPSHOT-tests.jar sql/core/target/scala-2.12/spark-sql_2.12-3.0.0-SNAPSHOT-tests.jar
...
Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ArrayBuffer 9556 / 9633 27.4 36.5 1.0X
ExternalAppendOnlyUnsafeRowArray 18514 / 18700 14.2 70.6 0.5X
Array with 30000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ArrayBuffer 22180 / 22195 22.2 45.1 1.0X
ExternalAppendOnlyUnsafeRowArray 24254 / 24331 20.3 49.3 0.9X
Array with 100000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ArrayBuffer 4998 / 5052 20.5 48.8 1.0X
ExternalAppendOnlyUnsafeRowArray 4778 / 4821 21.4 46.7 1.0X
Spilling with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter 17536 / 17596 14.9 66.9 1.0X
ExternalAppendOnlyUnsafeRowArray 10380 / 10451 25.3 39.6 1.7X
Spilling with 10000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter 6 / 7 25.3 39.5 1.0X
ExternalAppendOnlyUnsafeRowArray 6 / 7 26.3 38.0 1.0X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending Jenkins).
Thank you so much, @peter-toth . |
Retest this please. |
Thanks for the review @dongjoon-hyun. |
Test build #100966 has finished for PR 22617 at commit
|
Merged to master. |
…chmark ## What changes were proposed in this pull request? Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method. ## How was this patch tested? Manually tested and regenerated results. Please note that `spark.memory.debugFill` setting has a huge impact on this benchmark. Since it is set to true by default when running the benchmark from SBT, we need to disable it: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions in Test += \"-Dspark.memory.debugFill=false\";test:runMain org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark" ``` Closes apache#22617 from peter-toth/SPARK-25484. Lead-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: Peter Toth <ptoth@hortonworks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method.
How was this patch tested?
Manually tested and regenerated results.
Please note that
spark.memory.debugFill
setting has a huge impact on this benchmark. Since it is set to true by default when running the benchmark from SBT, we need to disable it: