-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-36070][CORE] Log time cost info for writing rows out and committing the task #33279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @cloud-fan @maropu @dongjoon-hyun, thanks |
Kubernetes integration test starting |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test status success |
Test build #140844 has finished for PR 33279 at commit
|
Test build #140847 has finished for PR 33279 at commit
|
thanks, merged to master |
dataWriter.writeWithIterator(iterator) | ||
dataWriter.commit() | ||
} | ||
logInfo(s"$taskAttemptID finished to write and commit. Elapsed time: $timeCost ms.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thought, I think it's better to use SQL metrics for it. It's very hard to know max/min/avg by reading the logs.
@AngersZhuuuu I think you tried it before. Can you restore the work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thought, I think it's better to use SQL metrics for it. It's very hard to know max/min/avg by reading the logs.
@AngersZhuuuu I think you tried it before. Can you restore the work?
Yea, working on this
What changes were proposed in this pull request?
We have a job that has a stage that contains about 8k tasks. Most tasks take about 1~10min to finish but 3 of them tasks run extremely slow with similar data sizes. They take about 1 hour each to finish and also do their speculations.
The root cause is most likely the delay of the storage system. But it's not straightforward enough to find where the performance issue occurs, in the phase of shuffle read, task execution, output, commitment e.t.c..
Why are the changes needed?
On the spark side, we can record the time cost in logs for better bug hunting or performance tuning.
Does this PR introduce any user-facing change?
no
How was this patch tested?
passing GA