[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279

yaooqinn · 2021-07-09T08:42:04Z

What changes were proposed in this pull request?

We have a job that has a stage that contains about 8k tasks. Most tasks take about 1~10min to finish but 3 of them tasks run extremely slow with similar data sizes. They take about 1 hour each to finish and also do their speculations.

The root cause is most likely the delay of the storage system. But it's not straightforward enough to find where the performance issue occurs, in the phase of shuffle read, task execution, output, commitment e.t.c..

2021-07-09 03:05:17 CST SparkHadoopMapRedUtil INFO - attempt_20210709022249_0003_m_007050_37351: Committed
2021-07-09 03:05:17 CST Executor INFO - Finished task 7050.0 in stage 3.0 (TID 37351). 3311 bytes result sent to driver
2021-07-09 04:06:10 CST ShuffleBlockFetcherIterator INFO - Getting 9 non-empty blocks including 0 local blocks and 9 remote blocks
2021-07-09 04:06:10 CST TransportClientFactory INFO - Found inactive connection to

Why are the changes needed?

On the spark side, we can record the time cost in logs for better bug hunting or performance tuning.

Does this PR introduce any user-facing change?

no

How was this patch tested?

passing GA

…tting the task.

yaooqinn · 2021-07-09T10:00:23Z

cc @cloud-fan @maropu @dongjoon-hyun, thanks

SparkQA · 2021-07-09T10:00:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45355/

SparkQA · 2021-07-09T10:14:53Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45358/

SparkQA · 2021-07-09T10:35:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45355/

SparkQA · 2021-07-09T11:37:08Z

Test build #140844 has finished for PR 33279 at commit 7950524.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-09T11:57:28Z

Test build #140847 has finished for PR 33279 at commit fb3af2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2021-07-09T16:55:17Z

thanks, merged to master

cloud-fan · 2021-07-14T07:15:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+          dataWriter.writeWithIterator(iterator)
+          dataWriter.commit()
+        }
+        logInfo(s"$taskAttemptID finished to write and commit. Elapsed time: $timeCost ms.")


After some more thought, I think it's better to use SQL metrics for it. It's very hard to know max/min/avg by reading the logs.

@AngersZhuuuu I think you tried it before. Can you restore the work?

After some more thought, I think it's better to use SQL metrics for it. It's very hard to know max/min/avg by reading the logs.

@AngersZhuuuu I think you tried it before. Can you restore the work?

Yea, working on this

[SPARK-36070][CORE] Add time cost info for writing rows out and commi…

7950524

…tting the task.

github-actions bot added CORE SQL labels Jul 9, 2021

debug info

fb3af2a

yaooqinn changed the title ~~[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task.~~ [SPARK-36070][CORE] Log time cost info for writing rows out and committing the task Jul 9, 2021

cloud-fan approved these changes Jul 9, 2021

View reviewed changes

yaooqinn closed this in f5a6332 Jul 9, 2021

yaooqinn deleted the SPARK-36070 branch July 9, 2021 16:55

cloud-fan reviewed Jul 14, 2021

View reviewed changes

cloud-fan mentioned this pull request Jul 23, 2021

[SPARK-34399][SQL] Add commit duration to SQL tab's graph node. #31522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279

[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279

Uh oh!

yaooqinn commented Jul 9, 2021 •

edited

Loading

Uh oh!

yaooqinn commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

yaooqinn commented Jul 9, 2021

Uh oh!

cloud-fan Jul 14, 2021

Uh oh!

AngersZhuuuu Jul 14, 2021

Uh oh!

Uh oh!

[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279

[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279

Uh oh!

Conversation

yaooqinn commented Jul 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

yaooqinn commented Jul 9, 2021

Uh oh!

cloud-fan Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaooqinn commented Jul 9, 2021 •

edited

Loading