Skip to content

[SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark #22617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Oct 2, 2018

What changes were proposed in this pull request?

Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method.

How was this patch tested?

Manually tested and regenerated results.
Please note that spark.memory.debugFill setting has a huge impact on this benchmark. Since it is set to true by default when running the benchmark from SBT, we need to disable it:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions in Test += \"-Dspark.memory.debugFill=false\";test:runMain org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"

@peter-toth
Copy link
Contributor Author

cc @dongjoon-hyun @seancxmao

@peter-toth peter-toth changed the title [SPARK-25484][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark [SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark Oct 11, 2018
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 13, 2018

Test build #97348 has finished for PR 22617 at commit f016e20.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97438 has finished for PR 22617 at commit 1c30755.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Oct 17, 2018

Retest this please

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97504 has finished for PR 22617 at commit 1c30755.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

@dongjoon-hyun , @kiszk could you please help me how take a step forward with this PR?

val array = new ExternalAppendOnlyUnsafeRowArray(
ExternalAppendOnlyUnsafeRowArray.DefaultInitialSizeOfInMemoryBuffer,
numSpillThreshold)
val array = new ExternalAppendOnlyUnsafeRowArray(numSpillThreshold, numSpillThreshold)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @peter-toth . Could you explain why we need to replace ExternalAppendOnlyUnsafeRowArray.DefaultInitialSizeOfInMemoryBuffer with numSpillThreshold here? Actually, this is not an obvious refactoring.
If this is related to Fix issue in ExternalAppendOnlyUnsafeRowArray creation, please add some comments here or PR description clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @dongjoon-hyun. I added some details to the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, following my logic in the description, I think

val array = new ExternalAppendOnlyUnsafeRowArray(numSpillThreshold, numSpillThreshold)

should be changed to

val array = new ExternalAppendOnlyUnsafeRowArray(0, numSpillThreshold)

in testAgainstRawUnsafeExternalSorter in "WITH SPILL" cases so as to compare ExternalAppendOnlyUnsafeRowArray to UnsafeExternalSorter when it behaves so.

But would be great if someone could confirm this idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we need two-step comparision.

  1. Refactoring only to ensure no regression.
  2. Change that value to check the performance value difference.

Could you rollback this line and let us finish Step 1 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Reverted the change.

@peter-toth peter-toth force-pushed the SPARK-25484 branch 2 times, most recently from 41c202e to 3a52abc Compare October 30, 2018 20:47
@wangyum
Copy link
Member

wangyum commented Nov 7, 2018

Retest this please

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

Please help review it @dongjoon-hyun @kiszk @wangyum

@SparkQA
Copy link

SparkQA commented Jan 1, 2019

Test build #100609 has finished for PR 22617 at commit 3a52abc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Sure. @gatorsmile .
Could you rebase this and fix the build error, @peter-toth ?

@peter-toth
Copy link
Contributor Author

@dongjoon-hyun, sure, I will fix it soon.

@peter-toth
Copy link
Contributor Author

@dongjoon-hyun, I rebased my commits and fixed the build issue.

@dongjoon-hyun
Copy link
Member

Thank you for updating! I'll review today.

* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that 2~3 should be the same except SPARK_GENERATE_BENCHMARK_FILES=1.
Also, we need spark.memory.debugFill configuration for 1 (spark-submit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm removing spark.memory.debugFill=true from the configuration of 3 to become similar to 1 (spark-submit). spark.memory.debugFill is false by default and setting it to true adds enormous overhead.
I think I can change it to += \"-Dspark.memory.debugFill=false\" if that better fits here.

val spillThreshold = 100 * 1000
testAgainstRawArrayBuffer(spillThreshold, 100 * 1000, 1 << 10)
testAgainstRawArrayBuffer(spillThreshold, 1000, 1 << 18)
testAgainstRawArrayBuffer(spillThreshold, 30 * 1000, 1 << 14)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original sequence; 1000 -> 30 * 1000 -> 100 * 1000. Increasing order is more intuitive.
Ah, I got it. This is reordered by the calculation. Please forgot about the above comment.

>>> 1000 * (1<<18)
262144000
>>> 30 * 1000 * (1<<14)
491520000
>>> 100 * 1000 * (1<<10)
102400000

Spilling with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter 15829 / 15845 16.6 60.4 1.0X
ExternalAppendOnlyUnsafeRowArray 10158 / 10174 25.8 38.7 1.6X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks meaningfully different from the previous result. Let's see the server result together. I'm running this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 9, 2019

@peter-toth . What do you mean by disable? The command and comment are using spark.memory.debugFill=true.

Please note that spark.memory.debugFill setting has a huge impact on this benchmark. Since it is set to true by default when running the benchmark from SBT, we need to disable it:

It seems that we need to update the command in the PR description and in the comment; spark.memory.debugFill=false.

@peter-toth
Copy link
Contributor Author

I think I should change it, it is a bit confusing now. I used Test -= ... to disable it.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 9, 2019

If I changed to false, the result looks very different.

$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions in Test -= \"-Dspark.memory.debugFill=false\";test:runMain org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
...
[info] OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
[info] Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[info] Array with 100000 rows:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer                                   6653 / 6916         15.4          65.0       1.0X
[info] ExternalAppendOnlyUnsafeRowArray            25856 / 25968          4.0         252.5       0.3X

* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions
* in Test -= \"-Dspark.memory.debugFill=true\";test:runMain <this class>"
* in Test += \"-Dspark.memory.debugFill=false\";test:runMain <this class>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I was confused with -=.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i did it a bit confusing way, but updated now to += ...=false in a new commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, in PR description, runMain is repeated twice; test:runMain test:runMain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Array with 1000 rows: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ArrayBuffer 8839 / 8951 29.7 33.7 1.0X
ExternalAppendOnlyUnsafeRowArray 9884 / 9888 26.5 37.7 0.9X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run this once more in your side? For me, I've got the followings. The ratio difference is too big.

Mac

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
[info] Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
[info] Array with 1000 rows:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer                                 10226 / 10272         25.6          39.0       1.0X
[info] ExternalAppendOnlyUnsafeRowArray            24301 / 24425         10.8          92.7       0.4X

EC2 Server

[info] OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
[info] Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[info] Array with 1000 rows:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer                                 11988 / 12027         21.9          45.7       1.0X
[info] ExternalAppendOnlyUnsafeRowArray            37480 / 37574          7.0         143.0       0.3X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is the only difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rerunning it soon

Copy link
Contributor Author

@peter-toth peter-toth Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the same ratio as you have this time:

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6
[info] Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
[info] Array with 1000 rows:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------
[info] ArrayBuffer                                 10028 / 10197         26.1          38.3       1.0X
[info] ExternalAppendOnlyUnsafeRowArray            30053 / 30312          8.7         114.6       0.3X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirmation. Master branch seems to be changed.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 9, 2019

I made a PR to you, @peter-toth .

And, for the record, the following is Mac result. Most stuffs are consistent with this PR except Array with 1000 rows.

================================================================================================
WITHOUT SPILL
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Array with 100000 rows:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                   5071 / 5178         20.2          49.5       1.0X
ExternalAppendOnlyUnsafeRowArray              5564 / 5583         18.4          54.3       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Array with 1000 rows:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                 10146 / 10169         25.8          38.7       1.0X
ExternalAppendOnlyUnsafeRowArray            24099 / 24414         10.9          91.9       0.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Array with 30000 rows:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                 22738 / 22748         21.6          46.3       1.0X
ExternalAppendOnlyUnsafeRowArray            28000 / 28096         17.6          57.0       0.8X


================================================================================================
WITH SPILL
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Spilling with 1000 rows:                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter                        18076 / 18179         14.5          69.0       1.0X
ExternalAppendOnlyUnsafeRowArray            11861 / 11872         22.1          45.2       1.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.2
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Spilling with 10000 rows:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter                             7 /    7         23.1          43.3       1.0X
ExternalAppendOnlyUnsafeRowArray                 7 /    8         22.3          44.8       1.0X

* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
* 2. build/sbt build/sbt ";project sql;set javaOptions
* * in Test += \"-Dspark.memory.debugFill=false\";test:runMain <this class>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix * *?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Change-Id: I8bcecad2863c97091d8bfb4c65386a59051938c1
------------------------------------------------------------------------------------------------
UnsafeExternalSorter 11 / 11 14.8 67.4 1.0X
ExternalAppendOnlyUnsafeRowArray 9 / 9 17.6 56.8 1.2X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the original master branch and get the following. Since the trend is the same, this refactoring PR looks safe.

$ bin/spark-submit --class org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars core/target/scala-2.12/spark-core_2.12-3.0.0-SNAPSHOT-tests.jar sql/core/target/scala-2.12/spark-sql_2.12-3.0.0-SNAPSHOT-tests.jar
...
Array with 1000 rows:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                   9556 / 9633         27.4          36.5       1.0X
ExternalAppendOnlyUnsafeRowArray            18514 / 18700         14.2          70.6       0.5X

Array with 30000 rows:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                 22180 / 22195         22.2          45.1       1.0X
ExternalAppendOnlyUnsafeRowArray            24254 / 24331         20.3          49.3       0.9X

Array with 100000 rows:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
ArrayBuffer                                   4998 / 5052         20.5          48.8       1.0X
ExternalAppendOnlyUnsafeRowArray              4778 / 4821         21.4          46.7       1.0X

Spilling with 1000 rows:                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter                        17536 / 17596         14.9          66.9       1.0X
ExternalAppendOnlyUnsafeRowArray            10380 / 10451         25.3          39.6       1.7X

Spilling with 10000 rows:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
UnsafeExternalSorter                             6 /    7         25.3          39.5       1.0X
ExternalAppendOnlyUnsafeRowArray                 6 /    7         26.3          38.0       1.0X

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins).

@dongjoon-hyun
Copy link
Member

Thank you so much, @peter-toth .

@dongjoon-hyun
Copy link
Member

Retest this please.

@peter-toth
Copy link
Contributor Author

Thanks for the review @dongjoon-hyun.

@SparkQA
Copy link

SparkQA commented Jan 9, 2019

Test build #100966 has finished for PR 22617 at commit b0d829e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 49c062b Jan 9, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…chmark

## What changes were proposed in this pull request?

Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method.

## How was this patch tested?

Manually tested and regenerated results.
Please note that `spark.memory.debugFill` setting has a huge impact on this benchmark. Since it is set to true by default when running the benchmark from SBT, we need to disable it:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt ";project sql;set javaOptions in Test += \"-Dspark.memory.debugFill=false\";test:runMain org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
```

Closes apache#22617 from peter-toth/SPARK-25484.

Lead-authored-by: Peter Toth <peter.toth@gmail.com>
Co-authored-by: Peter Toth <ptoth@hortonworks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants