Skip to content

[SPARK-31471][TESTS][BUILD] Add a python script to run multiple benchmarks #28244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 17, 2020

What changes were proposed in this pull request?

Added script from the PR #27078 which allows to run multiple benchmarks.

The script should be executed from Spark's home dir:

$ ./dev/run-benchmarks.py

Why are the changes needed?

Currently, I have to track when one benchmark finishes to launch the next one. This is inconvenient, especially when need to run many benchmarks 3-5 or more. Need to periodically check that current benchmark completed already.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the script manually

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 17, 2020

@HyukjinKwon @dongjoon-hyun I think this script is useful for anyone who cares about Spark performance.

@MaxGekk MaxGekk changed the title [SPARK-31471][TESTS] Add a python script to run multiple benchmarks [SPARK-31471][TESTS][BUILD] Add a python script to run multiple benchmarks Apr 17, 2020
from sparktestsupport.shellutils import run_cmd

benchmarks = [
['sql/test', 'org.apache.spark.sql.execution.benchmark.AggregateBenchmark'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max, can we do testOnly or listing up files in some specific directories so we don't need to add here when we add a new benchmark if this targets to list up all benchmarks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does runMain of specific class. It is recommended way of running benchmarks, look at all benchmarks comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listing up files in some specific directories so we don't need to add here when we add a new benchmark if this targets to list up all benchmarks?

I don't like this automation because I usually run sub-set of benchmarks and for me it is easier to just comment unneeded benchmarks in the script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My use case is running some of benchmarks during night, let's say date-time related + parquet/orc. So, I don't need to run all of them because it can take a day to complete. I put other benchmarks (its project and class names) because I had already tested them in the PR #27078

@SparkQA
Copy link

SparkQA commented Apr 17, 2020

Test build #121411 has finished for PR 28244 at commit 2afdc11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made and used a private one. So, an official script to run all benchmarks might be helpful for release managers and someone who want. However, if this PR doesn't aim for all benchmark, I'm -1 for maintaining this partial one.

For personal use cases, I'd like to recommend to maintain your custom one privately.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 17, 2020

However, if this PR doesn't aim for all benchmark, I'm -1 for maintaining this partial one.

I don't understand your position. Let me ask you how often you run all benchmarks? My use case, and I believe this is most common use case is to run a few benchmarks and don't track each of them during night. And this is main purpose of the script. Each time, I have to find my PR with the script and copy paste it to new EC2 instance.

For personal use cases, I'd like to recommend to maintain your custom one privately.

but why? why you don't want to make life of other developers easier?

for maintaining this partial one.

Maintaining? Don't think, the cost of maintaining of few lines is so high.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 17, 2020

I'm not sure why you don't automate your environment. I'm using terraform/ansible to create the same EC2 benchmark machine and clone Spark and checkout PR(or dev branch) and uploading my benchmark script to the machine for a long time. It's a pretty standard way of automation.

@dongjoon-hyun
Copy link
Member

cc @gatorsmile .

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 21, 2020

Thank you @dongjoon-hyun @HyukjinKwon for review. I am closing this.

@MaxGekk MaxGekk closed this Apr 21, 2020
@MaxGekk MaxGekk deleted the dev-run-benchmarks branch June 5, 2020 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants