[SPARK-26260][Core]For disk store tasks summary table should show only successful tasks summary #26508

shahidki31 · 2019-11-13T22:16:28Z

…sks metrics for disk store

What changes were proposed in this pull request?

After #23088 task Summary table in the stage page shows successful tasks metrics for lnMemory store. In this PR, it added for disk store also.

Why are the changes needed?

Now both InMemory and disk store will be consistent in showing the task summary table in the UI, if there are non successful tasks

Does this PR introduce any user-facing change?

no

How was this patch tested?

Added UT. Manually verified

Test steps:

add the config in spark-defaults.conf -> spark.history.store.path /tmp/store
sbin/start-hitoryserver
bin/spark-shell
sc.parallelize(1 to 1000, 2).map(x => throw new Exception("fail")).count

…sks metrics for disk store

shahidki31 · 2019-11-13T22:20:14Z

cc @vanzin @srowen This PR is a followup for the PR #23088. Kindly review

core/src/main/scala/org/apache/spark/status/storeTypes.scala

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

SparkQA · 2019-11-14T00:07:42Z

Test build #113730 has finished for PR 26508 at commit df3e56b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T01:24:33Z

Test build #113732 has finished for PR 26508 at commit 03304e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T19:43:18Z

Test build #113800 has finished for PR 26508 at commit a5690f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-11-15T21:43:54Z

So, to solve this problem, you're changing the format of the data on disk. That breaks backwards compatibility (you'll run into problems if you run with an old disk store). So you have to update AppStatusStore.CURRENT_VERSION.

Given that I don't see a way to fix this without changing the disk format, I wonder if there are better alternatives than basically doubling the amount of disk space needed to store metrics.

What if you use the same existing fields and indices, but record successful tasks with positive numbers, and in progress or failed tasks as negative? You need some slight adjustments (e.g. TaskDataWrapper.hasMetrics needs to be a field now instead of computed from metrics, you need to use math.abs when returning metrics, etc), but wouldn't it result in the same thing?

shahidki31 · 2019-11-16T07:48:32Z

Thanks @vanzin for the input. Let me check and update.

SparkQA · 2019-11-16T18:48:48Z

Test build #113942 has finished for PR 26508 at commit f7a15d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-16T18:56:36Z

Test build #113943 has finished for PR 26508 at commit c8196b8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2019-11-16T19:36:15Z

What if you use the same existing fields and indices, but record successful tasks with positive numbers, and in progress or failed tasks as negative? You need some slight adjustments (e.g. TaskDataWrapper.hasMetrics needs to be a field now instead of computed from metrics, you need to use math.abs when returning metrics, etc), but wouldn't it result in the same thing?

Thanks @vanzin , I have updated accordingly. It is working fine for both live and history UI

shahidki31 · 2019-11-16T19:40:35Z

I have tested manually also updated UT,

failed tasks
1 lac tasks per stage
bin/spark-shell
sc.parallelize(1 to 100000, 100000).count()

Time to load task page 1st time from InMemory ( for both Live and History UI) = ~9-10 sec
Time to load task page 1st time from DiskStore = ~ 14-15 sec

SparkQA · 2019-11-16T21:23:11Z

Test build #113944 has finished for PR 26508 at commit e8d0d14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-16T22:24:09Z

Test build #113946 has finished for PR 26508 at commit b0988c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-16T22:46:09Z

Test build #113947 has finished for PR 26508 at commit 34420ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-17T16:56:09Z

Test build #113958 has finished for PR 26508 at commit f52eed8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-17T17:49:47Z

Test build #113961 has finished for PR 26508 at commit af7244e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Small test suggestion, otherwise looks ok.

core/src/main/scala/org/apache/spark/status/LiveEntity.scala

core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala

shahidki31 · 2019-11-21T04:45:42Z

@vanzin Is it ok to go in? Thanks

core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala

shahidki31 · 2019-11-22T11:25:58Z

Thanks @vanzin for the comments. I have updated accordingly

SparkQA · 2019-11-22T13:57:38Z

Test build #114295 has finished for PR 26508 at commit a8233dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-22T13:58:26Z

Test build #114294 has finished for PR 26508 at commit c5d76fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-22T16:14:47Z

Test build #114299 has finished for PR 26508 at commit d11859c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-22T16:28:18Z

Test build #114302 has finished for PR 26508 at commit 07e4350.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Couple of small nits, then good to go.

core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala

core/src/main/scala/org/apache/spark/status/storeTypes.scala

SparkQA · 2019-11-23T14:39:17Z

Test build #114317 has finished for PR 26508 at commit 8c5a37d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-23T14:44:55Z

Test build #114316 has finished for PR 26508 at commit 733b20e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-11-25T18:03:53Z

Merging to master.

shahidki31 · 2019-11-25T18:25:28Z

Thanks a lot @vanzin , @srowen

…ks` percentile metrics ### What changes were proposed in this pull request? #### Background In PR #26508 (SPARK-26260) the SHS stage metric percentiles were updated to only include successful tasks when using disk storage. It did this by making the values for each metric negative when the task is not in a successful state. This approach was chosen to avoid breaking changes to disk storage. See [this comment](#26508 (comment)) for context. To get the percentiles, it reads the metric values, starting at 0, in ascending order. This filters out all tasks that are not successful because the values are less than 0. To get the percentile values it scales the percentiles to the list index of successful tasks. For example if there are 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are [0, 50, 100, 150, 199]. #### Issue For metrics 1) shuffle total reads and 2) shuffle total blocks, PR #26508 incorrectly makes the metric indices positive. This means tasks that are not successful are included in the percentile calculations. The percentile lookup index calculation is still based on the number of successful task so the wrong task metric is returned for a given percentile. This was not caught because the unit test only verified values for one metric, `executorRunTime`. #### Fix The index values for `SHUFFLE_TOTAL_READS` and `SHUFFLE_TOTAL_BLOCKS` should not convert back to positive metric values for tasks that are not successful. I believe this was done because the metrics values are summed from two other metrics. Using the raw values still creates the desired outcome. `negative + negative = negative` and `positive + positive = positive`. There is no case where one metric will be negative and one will be positive. I also verified that these two metrics are only used in the percentile calculations where only successful tasks are used. ### Why are the changes needed? This change is required so that the SHS stage percentile metrics for shuffle read bytes and shuffle total blocks are correct. ### Does this PR introduce _any_ user-facing change? Yes. The user will see the correct percentile values for the stage summary shuffle read bytes. ### How was this patch tested? I updated the unit test to verify the percentile values for every task metric. I also modified the unit test to have unique values for every metric. Previously the test had the same metrics for every field. This would not catch bugs like the wrong field being read by accident. I manually validated the fix in the UI. **BEFORE** ![image](https://user-images.githubusercontent.com/5604993/155433460-322078c5-1821-4f2e-8e53-8fc3902eb7fe.png) **AFTER** ![image](https://user-images.githubusercontent.com/5604993/155433491-25ce3acf-290b-4b83-a0a9-0f9b71c7af04.png) I manually validated the fix in the task summary API (`/api/v1/applications/application_123/1/stages/14/0/taskSummary\?quantiles\=0,0.25,0.5,0.75,1.0`). See `shuffleReadMetrics.readBytes` and `shuffleReadMetrics.totalBlocksFetched`. Before: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ -2.0, -2.0, -2.0, -2.0, 5.63718681E8 ], "totalBlocksFetched":[ -2.0, -2.0, -2.0, -2.0, 2.0 ], ... }, ... } ``` After: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ 5.62865286E8, 5.63779421E8, 5.63941681E8, 5.64327925E8, 5.7674183E8 ], "totalBlocksFetched":[ 2.0, 2.0, 2.0, 2.0, 2.0 ], ... } ... } ``` Closes #35637 from robreeves/SPARK-38309. Authored-by: Rob Reeves <roreeves@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…ks` percentile metrics ### What changes were proposed in this pull request? #### Background In PR #26508 (SPARK-26260) the SHS stage metric percentiles were updated to only include successful tasks when using disk storage. It did this by making the values for each metric negative when the task is not in a successful state. This approach was chosen to avoid breaking changes to disk storage. See [this comment](#26508 (comment)) for context. To get the percentiles, it reads the metric values, starting at 0, in ascending order. This filters out all tasks that are not successful because the values are less than 0. To get the percentile values it scales the percentiles to the list index of successful tasks. For example if there are 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are [0, 50, 100, 150, 199]. #### Issue For metrics 1) shuffle total reads and 2) shuffle total blocks, PR #26508 incorrectly makes the metric indices positive. This means tasks that are not successful are included in the percentile calculations. The percentile lookup index calculation is still based on the number of successful task so the wrong task metric is returned for a given percentile. This was not caught because the unit test only verified values for one metric, `executorRunTime`. #### Fix The index values for `SHUFFLE_TOTAL_READS` and `SHUFFLE_TOTAL_BLOCKS` should not convert back to positive metric values for tasks that are not successful. I believe this was done because the metrics values are summed from two other metrics. Using the raw values still creates the desired outcome. `negative + negative = negative` and `positive + positive = positive`. There is no case where one metric will be negative and one will be positive. I also verified that these two metrics are only used in the percentile calculations where only successful tasks are used. ### Why are the changes needed? This change is required so that the SHS stage percentile metrics for shuffle read bytes and shuffle total blocks are correct. ### Does this PR introduce _any_ user-facing change? Yes. The user will see the correct percentile values for the stage summary shuffle read bytes. ### How was this patch tested? I updated the unit test to verify the percentile values for every task metric. I also modified the unit test to have unique values for every metric. Previously the test had the same metrics for every field. This would not catch bugs like the wrong field being read by accident. I manually validated the fix in the UI. **BEFORE** ![image](https://user-images.githubusercontent.com/5604993/155433460-322078c5-1821-4f2e-8e53-8fc3902eb7fe.png) **AFTER** ![image](https://user-images.githubusercontent.com/5604993/155433491-25ce3acf-290b-4b83-a0a9-0f9b71c7af04.png) I manually validated the fix in the task summary API (`/api/v1/applications/application_123/1/stages/14/0/taskSummary\?quantiles\=0,0.25,0.5,0.75,1.0`). See `shuffleReadMetrics.readBytes` and `shuffleReadMetrics.totalBlocksFetched`. Before: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ -2.0, -2.0, -2.0, -2.0, 5.63718681E8 ], "totalBlocksFetched":[ -2.0, -2.0, -2.0, -2.0, 2.0 ], ... }, ... } ``` After: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ 5.62865286E8, 5.63779421E8, 5.63941681E8, 5.64327925E8, 5.7674183E8 ], "totalBlocksFetched":[ 2.0, 2.0, 2.0, 2.0, 2.0 ], ... } ... } ``` Closes #35637 from robreeves/SPARK-38309. Authored-by: Rob Reeves <roreeves@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0ad7677) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…ks` percentile metrics ### What changes were proposed in this pull request? #### Background In PR apache#26508 (SPARK-26260) the SHS stage metric percentiles were updated to only include successful tasks when using disk storage. It did this by making the values for each metric negative when the task is not in a successful state. This approach was chosen to avoid breaking changes to disk storage. See [this comment](apache#26508 (comment)) for context. To get the percentiles, it reads the metric values, starting at 0, in ascending order. This filters out all tasks that are not successful because the values are less than 0. To get the percentile values it scales the percentiles to the list index of successful tasks. For example if there are 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are [0, 50, 100, 150, 199]. #### Issue For metrics 1) shuffle total reads and 2) shuffle total blocks, PR apache#26508 incorrectly makes the metric indices positive. This means tasks that are not successful are included in the percentile calculations. The percentile lookup index calculation is still based on the number of successful task so the wrong task metric is returned for a given percentile. This was not caught because the unit test only verified values for one metric, `executorRunTime`. #### Fix The index values for `SHUFFLE_TOTAL_READS` and `SHUFFLE_TOTAL_BLOCKS` should not convert back to positive metric values for tasks that are not successful. I believe this was done because the metrics values are summed from two other metrics. Using the raw values still creates the desired outcome. `negative + negative = negative` and `positive + positive = positive`. There is no case where one metric will be negative and one will be positive. I also verified that these two metrics are only used in the percentile calculations where only successful tasks are used. ### Why are the changes needed? This change is required so that the SHS stage percentile metrics for shuffle read bytes and shuffle total blocks are correct. ### Does this PR introduce _any_ user-facing change? Yes. The user will see the correct percentile values for the stage summary shuffle read bytes. ### How was this patch tested? I updated the unit test to verify the percentile values for every task metric. I also modified the unit test to have unique values for every metric. Previously the test had the same metrics for every field. This would not catch bugs like the wrong field being read by accident. I manually validated the fix in the UI. **BEFORE** ![image](https://user-images.githubusercontent.com/5604993/155433460-322078c5-1821-4f2e-8e53-8fc3902eb7fe.png) **AFTER** ![image](https://user-images.githubusercontent.com/5604993/155433491-25ce3acf-290b-4b83-a0a9-0f9b71c7af04.png) I manually validated the fix in the task summary API (`/api/v1/applications/application_123/1/stages/14/0/taskSummary\?quantiles\=0,0.25,0.5,0.75,1.0`). See `shuffleReadMetrics.readBytes` and `shuffleReadMetrics.totalBlocksFetched`. Before: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ -2.0, -2.0, -2.0, -2.0, 5.63718681E8 ], "totalBlocksFetched":[ -2.0, -2.0, -2.0, -2.0, 2.0 ], ... }, ... } ``` After: ```json { "quantiles":[ 0.0, 0.25, 0.5, 0.75, 1.0 ], "shuffleReadMetrics":{ "readBytes":[ 5.62865286E8, 5.63779421E8, 5.63941681E8, 5.64327925E8, 5.7674183E8 ], "totalBlocksFetched":[ 2.0, 2.0, 2.0, 2.0, 2.0 ], ... } ... } ``` Closes apache#35637 from robreeves/SPARK-38309. Authored-by: Rob Reeves <roreeves@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0ad7677) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit e067b12) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-26260][Core]Tasks summary table should show only successful ta…

df3e56b

…sks metrics for disk store

shahidki31 force-pushed the task branch from 7e36c41 to df3e56b Compare November 13, 2019 22:17

shahidki31 changed the title ~~[SPARK-26260][Core]Tasks summary table should show only successful ta…~~ [SPARK-26260][Core]For disk store tasks summary table should show only successful tasks summary Nov 13, 2019

minor edit

03304e1

srowen reviewed Nov 13, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/status/storeTypes.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala Show resolved Hide resolved

dongjoon-hyun added the SPARK CORE label Nov 14, 2019

edit test

a5690f6

shahidki31 added 2 commits November 16, 2019 21:54

address comment

f7a15d6

minor update

c8196b8

scalastyle

e8d0d14

update

34420ea

shahidki31 force-pushed the task branch from b0988c3 to 34420ea Compare November 16, 2019 20:26

shahidki31 added 2 commits November 17, 2019 20:03

More optimisation..

f52eed8

minor update

af7244e

vanzin reviewed Nov 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/status/LiveEntity.scala Outdated Show resolved Hide resolved

core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala Outdated Show resolved Hide resolved

hide pool info for history server

4032471

vanzin reviewed Nov 21, 2019

View reviewed changes

Merge branch 'task' of https://github.com/shahidki31/spark into task

f879b70

shahidki31 force-pushed the task branch from 1aaba97 to c5d76fd Compare November 22, 2019 11:24

shahidki31 force-pushed the task branch from c5d76fd to a8233dd Compare November 22, 2019 11:29

address comments

d11859c

shahidki31 force-pushed the task branch from a8233dd to d11859c Compare November 22, 2019 13:43

nit

07e4350

vanzin reviewed Nov 22, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/status/storeTypes.scala Outdated Show resolved Hide resolved

shahidki31 added 2 commits November 23, 2019 17:41

Merge branch 'task' of https://github.com/shahidki31/spark into task

049b289

address comment

8c5a37d

shahidki31 force-pushed the task branch from 733b20e to 8c5a37d Compare November 23, 2019 12:21

vanzin closed this in bec2068 Nov 25, 2019

shahidki31 deleted the task branch November 25, 2019 18:25

robreeves mentioned this pull request Feb 24, 2022

[SPARK-38309][CORE] Fix SHS shuffleTotalReads and shuffleTotalBlocks percentile metrics #35637

Closed

[SPARK-26260][Core]For disk store tasks summary table should show only successful tasks summary #26508

[SPARK-26260][Core]For disk store tasks summary table should show only successful tasks summary #26508

Uh oh!

Conversation

shahidki31 commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

shahidki31 commented Nov 13, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

SparkQA commented Nov 14, 2019

Uh oh!

vanzin commented Nov 15, 2019

Uh oh!

shahidki31 commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

shahidki31 commented Nov 16, 2019

Uh oh!

shahidki31 commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

SparkQA commented Nov 17, 2019

Uh oh!

SparkQA commented Nov 17, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shahidki31 commented Nov 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shahidki31 commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

SparkQA commented Nov 22, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 23, 2019

Uh oh!

SparkQA commented Nov 23, 2019

Uh oh!

vanzin commented Nov 25, 2019

Uh oh!

shahidki31 commented Nov 25, 2019

Uh oh!

Uh oh!

shahidki31 commented Nov 13, 2019 •

edited

Loading