-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-37877][SQL] Support clear shuffle dependencies eagerly for thrift server #35178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FWIW, there's a bug on updating the status in the check (#35179 (comment)). We should check the actual tests e.g., at https://github.com/yaooqinn/spark/runs/4798556788 |
| } | ||
| if (cleanShuffleDeps && rdd != null) { | ||
| rdd.cleanShuffleDependencies() | ||
| rdd = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When THRIFTSERVER_INCREMENTAL_COLLECT is true, we will end up cleaning up dependencies before we have used up the data from df.
Wrap iter (IterableFetchIterator) as a CompletionIterator and move this there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the input
| HiveThriftServer2.eventManager.onStatementParsed(statementId, | ||
| result.queryExecution.toString()) | ||
| df.queryExecution.toString()) | ||
| rdd = df.rdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.rdd has tripped me in the past - the fact that it actually ends up executing the prefix dag was surprising (some team was leveraging on df.rdd.partitions.length to make some decisions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes here cannot trigger the eager clean for all cases, for example, CTAS, where the S contains a shuffle, will not be able to be cleared. The possible and proper place may be SparkPlan
|
Still in POC, any inputs are welcome :) |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In this PR, we add a config to control whether thrift server operations will clear shuffle dependencies eagerly after being executed.
In long-running applications, like thrift server, with large driver JVMs, where there is little memory pressure on the driver, the driver gc may happen very occasionally or not at all. Not cleaning at all may lead to executors running out of disk space after a while.
Why are the changes needed?
For the thrift server, it currently relies on a periodical system.gc to release shuffle meta/data e.t.c, which is not efficient enough towards its usage scenario as the gc does not always occur immediately. Unclean data may cause thrift server memory issues, like OOM, or disk issues on the work side, like no space left for the device.
Does this PR introduce any user-facing change?
added new conf
How was this patch tested?