Skip to content

[SPARK-25865][CORE] Add GC information to ExecutorMetrics #22874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Oct 29, 2018

What changes were proposed in this pull request?

Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC.
GC metrics are only useful in entire lifetime of executors instead of separated stages.

How was this patch tested?

Adding UT.

@LantaoJin LantaoJin changed the title [WIP][SPARK-25865][CORE] Add GC information to ExecutorMetrics [SPARK-25865][WIP][CORE] Add GC information to ExecutorMetrics Oct 29, 2018
@squito
Copy link
Contributor

squito commented Dec 11, 2018

@LantaoJin #22612 has been merged (and unfortunately seems like there are many merge conflicts now), would you like to update this?

@LantaoJin
Copy link
Contributor Author

@squito Yes, I will update when I get a chance.

@LantaoJin LantaoJin changed the title [SPARK-25865][WIP][CORE] Add GC information to ExecutorMetrics [SPARK-25865][CORE] Add GC information to ExecutorMetrics Dec 17, 2018
@LantaoJin
Copy link
Contributor Author

LantaoJin commented Dec 17, 2018

Updated. cc @squito @rezasafi

@rezasafi
Copy link
Contributor

@LantaoJin Thank you very much for the work. Is that possible for you to add an actual spark event log with these new GC metrics under the core/test/resources/spark-events and then update HistoryServerSuite and other places accordingly? (similar to what was done for SPARK-23429(application_1506645932520_24630151) and SPARK-24958(application_1538416563558_0014)). It would be nice to have that here as well.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure these metrics make sense with ExecutorMetrics, or maybe it just needs extra handling. The memory metrics make sense as instaneous snapshots, but gc count & gc time are constantly increasing. They make sense over the entire lifetime of the executor, but not when viewed within one stage -- you'd want to subtract out the value at the beginning of the stage.

BTW spark already tracks total gc time as a task metrics -- of course that's a little misleading since gc is not assignable to just one task, but its at least available, and can already be aggregated for the whole executor. So we should think about what is the value add over that existing metric.

Sorry for only commenting about this now, I hadn't thought carefully about what these metrics mean before this.

@LantaoJin
Copy link
Contributor Author

They make sense over the entire lifetime of the executor, but not when viewed within one stage -- you'd want to subtract out the value at the beginning of the stage.

You are right. They only make sense over the entire lifetime. I don't want to separate this metrics to multi-stage. I will check the current implementation. One of purposes to add this is determining the frequency of major & minor GC. Only memory usage couldn't tell us the rationality of memory allocation. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC.

@LantaoJin
Copy link
Contributor Author

retest this please

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a smaller event log file? you really just need one, maybe two executors with this update, right? This one is really too big to be checked in.

-rwxr-xr-x  1 irashid  staff    34M Jan  7 15:43 application_1536831636016_56998
-rw-r--r--  1 irashid  staff   336K Jan  3 14:52 application_1538416563558_0014
-rw-r--r--  1 irashid  staff   148K Jan  3 14:52 application_1506645932520_24630151
-rwxr-xr-x  1 irashid  staff   135K Oct 14  2016 local-1430917381534
-rwxr-xr-x  1 irashid  staff   126K Aug  6 12:13 app-20161115172038-0000
-rwxr-xr-x  1 irashid  staff   121K Aug  6 12:13 app-20161116163331-0000
-rwxr-xr-x  1 irashid  staff   120K Nov 28 16:47 application_1516285256255_0012
-rwxr-xr-x  1 irashid  staff   111K Nov 28 16:47 app-20180109111548-0000
-rwxr-xr-x  1 irashid  staff    67K Oct 26  2016 local-1422981780767
-rwxr-xr-x  1 irashid  staff    65K Oct 26  2016 local-1422981759269
-rwxr-xr-x  1 irashid  staff    65K Oct 26  2016 local-1425081759269
-rwxr-xr-x  1 irashid  staff    17K Oct 26  2016 local-1426533911241
-rwxr-xr-x  1 irashid  staff    17K Oct 26  2016 local-1426633911242
-rw-r--r--  1 irashid  staff   5.1K Oct 26  2016 local-1430917381535_1
-rw-r--r--  1 irashid  staff   5.1K Oct 26  2016 local-1430917381535_2

@LantaoJin
Copy link
Contributor Author

@squito Replaced with a small one:

285K Jan 8 12:42 core/src/test/resources/spark-events/application_1536831636016_59384_1

@squito
Copy link
Contributor

squito commented Jan 8, 2019

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Jan 8, 2019

Test build #100934 has finished for PR 22874 at commit c3faf61.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@gaborgsomogyi
Copy link
Contributor

@LantaoJin it won't work even with retest because of the following:

Could not find Apache license headers in the following files:
 !????? /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/resources/spark-events/application_1536831636016_59384_1

@rezasafi
Copy link
Contributor

rezasafi commented Jan 9, 2019

@LantaoJin you should add "application_1536831636016_59384_1 "to /dev/.rat-excludes

@LantaoJin
Copy link
Contributor Author

@gaborgsomogyi @rezasafi Thank you for the kindly reminding.

@LantaoJin
Copy link
Contributor Author

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Jan 10, 2019

Test build #101001 has finished for PR 22874 at commit 3d9a02c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

Fails by Flaky Test: HiveClientSuites

@LantaoJin
Copy link
Contributor Author

retest this please

@LantaoJin
Copy link
Contributor Author

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Jan 11, 2019

Test build #101079 has finished for PR 22874 at commit 6dede3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

Gentle ping @squito @kiszk

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, a couple of minor things

@LantaoJin
Copy link
Contributor Author

Fix the conflicts of master.
retest this please

@SparkQA
Copy link

SparkQA commented Feb 19, 2019

Test build #102507 has finished for PR 22874 at commit 6d5643c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102534 has finished for PR 22874 at commit 4d1021c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than a nit, lgtm, assuming tests pass.

Can you please update the description, as that becomes the commit msg? I would add some of your prior explanation for the utility of this as a metric over the lifetime of the executor #22874 (comment), though its not so useful per-stage.

@SparkQA
Copy link

SparkQA commented Feb 21, 2019

Test build #4561 has finished for PR 22874 at commit 4d1021c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2019

Test build #102572 has finished for PR 22874 at commit ba500fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else if (!nonBuiltInCollectors.contains(mxBean.getName)) {
nonBuiltInCollectors = mxBean.getName +: nonBuiltInCollectors
// log it when first seen
logWarning(s"Add the non-built-in garbage collector(s) $nonBuiltInCollectors " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to me, the subject of this sentence is unclear. At first, I interpreted as (Spark) add the ..... Can we make it clear?
For example, Please add ... or Users should add ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk Yes, I've modified it now.

@kiszk
Copy link
Member

kiszk commented Feb 24, 2019

LGTM except one minor comment.

@SparkQA
Copy link

SparkQA commented Feb 25, 2019

Test build #102745 has finished for PR 22874 at commit 958a728.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

Gently ping @squito @kiszk

@squito
Copy link
Contributor

squito commented Mar 4, 2019

merged to master, thanks @LantaoJin -- sorry for the delay from me, was out last week

@asfgit asfgit closed this in e5c502c Mar 4, 2019
Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin Could you submit another PR to document these new metrics in https://spark.apache.org/docs/latest/monitoring.html

If we do not document them, the Spark users will not be aware of them.

@LantaoJin
Copy link
Contributor Author

@LantaoJin Could you submit another PR to document these new metrics in https://spark.apache.org/docs/latest/monitoring.html

If we do not document them, the Spark users will not be aware of them.

@gatorsmile PR: #24090 , Could you help to review when you have time?

@drauschenbach
Copy link

drauschenbach commented Jul 27, 2020

I use the IBM java sdk 8.0, and after upgrading to Spark 3 I see these warnings on every single job run:

20/07/26 11:40:38 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
20/07/26 11:40:38 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(global, scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

Does anyone know what to do?

I'm using the Baizaicloud Prometheus push gateway metrics sink, if that's relevant.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Jul 28, 2020

@drauschenbach I am not sure, but looks like scavenge is the young generation collector of IBM JDK and the global may be the old gen. (To confirm it, you'd better to refer the document of IBM JDK). So you can add spark.eventLog.gcMetrics.youngGenerationGarbageCollectors scavenge and spark.eventLog.gcMetrics.oldGenerationGarbageCollectors global in spark-defaults.conf or set --conf spark.eventLog.gcMetrics.youngGenerationGarbageCollectors=scavenge when submitting. This is used to collect GC metrics from Spark executors. If you don't need these GC metrics, just ignore the warn log.

@drauschenbach
Copy link

@LantaoJin Works great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants