-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-25865][CORE] Add GC information to ExecutorMetrics #22874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@LantaoJin #22612 has been merged (and unfortunately seems like there are many merge conflicts now), would you like to update this? |
@squito Yes, I will update when I get a chance. |
a049454
to
6a7d06b
Compare
@LantaoJin Thank you very much for the work. Is that possible for you to add an actual spark event log with these new GC metrics under the core/test/resources/spark-events and then update HistoryServerSuite and other places accordingly? (similar to what was done for SPARK-23429(application_1506645932520_24630151) and SPARK-24958(application_1538416563558_0014)). It would be nice to have that here as well. |
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure these metrics make sense with ExecutorMetrics
, or maybe it just needs extra handling. The memory metrics make sense as instaneous snapshots, but gc count & gc time are constantly increasing. They make sense over the entire lifetime of the executor, but not when viewed within one stage -- you'd want to subtract out the value at the beginning of the stage.
BTW spark already tracks total gc time as a task metrics -- of course that's a little misleading since gc is not assignable to just one task, but its at least available, and can already be aggregated for the whole executor. So we should think about what is the value add over that existing metric.
Sorry for only commenting about this now, I hadn't thought carefully about what these metrics mean before this.
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
You are right. They only make sense over the entire lifetime. I don't want to separate this metrics to multi-stage. I will check the current implementation. One of purposes to add this is determining the frequency of major & minor GC. Only memory usage couldn't tell us the rationality of memory allocation. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC. |
retest this please |
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a smaller event log file? you really just need one, maybe two executors with this update, right? This one is really too big to be checked in.
-rwxr-xr-x 1 irashid staff 34M Jan 7 15:43 application_1536831636016_56998
-rw-r--r-- 1 irashid staff 336K Jan 3 14:52 application_1538416563558_0014
-rw-r--r-- 1 irashid staff 148K Jan 3 14:52 application_1506645932520_24630151
-rwxr-xr-x 1 irashid staff 135K Oct 14 2016 local-1430917381534
-rwxr-xr-x 1 irashid staff 126K Aug 6 12:13 app-20161115172038-0000
-rwxr-xr-x 1 irashid staff 121K Aug 6 12:13 app-20161116163331-0000
-rwxr-xr-x 1 irashid staff 120K Nov 28 16:47 application_1516285256255_0012
-rwxr-xr-x 1 irashid staff 111K Nov 28 16:47 app-20180109111548-0000
-rwxr-xr-x 1 irashid staff 67K Oct 26 2016 local-1422981780767
-rwxr-xr-x 1 irashid staff 65K Oct 26 2016 local-1422981759269
-rwxr-xr-x 1 irashid staff 65K Oct 26 2016 local-1425081759269
-rwxr-xr-x 1 irashid staff 17K Oct 26 2016 local-1426533911241
-rwxr-xr-x 1 irashid staff 17K Oct 26 2016 local-1426633911242
-rw-r--r-- 1 irashid staff 5.1K Oct 26 2016 local-1430917381535_1
-rw-r--r-- 1 irashid staff 5.1K Oct 26 2016 local-1430917381535_2
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
@squito Replaced with a small one:
|
Jenkins, ok to test |
Test build #100934 has finished for PR 22874 at commit
|
retest this please |
@LantaoJin it won't work even with retest because of the following:
|
@LantaoJin you should add "application_1536831636016_59384_1 "to /dev/.rat-excludes |
@gaborgsomogyi @rezasafi Thank you for the kindly reminding. |
Jenkins, ok to test |
Test build #101001 has finished for PR 22874 at commit
|
Fails by Flaky Test: HiveClientSuites |
retest this please |
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
Jenkins, ok to test |
Test build #101079 has finished for PR 22874 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, a couple of minor things
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
3e8c6b7
to
6d5643c
Compare
Fix the conflicts of master. |
Test build #102507 has finished for PR 22874 at commit
|
retest this please |
Test build #102534 has finished for PR 22874 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than a nit, lgtm, assuming tests pass.
Can you please update the description, as that becomes the commit msg? I would add some of your prior explanation for the utility of this as a metric over the lifetime of the executor #22874 (comment), though its not so useful per-stage.
core/src/main/scala/org/apache/spark/metrics/ExecutorMetricType.scala
Outdated
Show resolved
Hide resolved
Test build #4561 has finished for PR 22874 at commit
|
Test build #102572 has finished for PR 22874 at commit
|
} else if (!nonBuiltInCollectors.contains(mxBean.getName)) { | ||
nonBuiltInCollectors = mxBean.getName +: nonBuiltInCollectors | ||
// log it when first seen | ||
logWarning(s"Add the non-built-in garbage collector(s) $nonBuiltInCollectors " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to me, the subject of this sentence is unclear. At first, I interpreted as (Spark) add the ....
. Can we make it clear?
For example, Please add ...
or Users should add ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Yes, I've modified it now.
LGTM except one minor comment. |
Test build #102745 has finished for PR 22874 at commit
|
merged to master, thanks @LantaoJin -- sorry for the delay from me, was out last week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin Could you submit another PR to document these new metrics in https://spark.apache.org/docs/latest/monitoring.html
If we do not document them, the Spark users will not be aware of them.
@gatorsmile PR: #24090 , Could you help to review when you have time? |
I use the IBM java sdk 8.0, and after upgrading to Spark 3 I see these warnings on every single job run:
Does anyone know what to do? I'm using the Baizaicloud Prometheus push gateway metrics sink, if that's relevant. |
@drauschenbach I am not sure, but looks like |
@LantaoJin Works great, thanks! |
What changes were proposed in this pull request?
Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC.
GC metrics are only useful in entire lifetime of executors instead of separated stages.
How was this patch tested?
Adding UT.