Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 12, 2022

What changes were proposed in this pull request?

In this PR, we add a config to control whether thrift server operations will clear shuffle dependencies eagerly after being executed.

In long-running applications, like thrift server, with large driver JVMs, where there is little memory pressure on the driver, the driver gc may happen very occasionally or not at all. Not cleaning at all may lead to executors running out of disk space after a while.

Why are the changes needed?

For the thrift server, it currently relies on a periodical system.gc to release shuffle meta/data e.t.c, which is not efficient enough towards its usage scenario as the gc does not always occur immediately. Unclean data may cause thrift server memory issues, like OOM, or disk issues on the work side, like no space left for the device.

Does this PR introduce any user-facing change?

added new conf

How was this patch tested?

@github-actions github-actions bot added the SQL label Jan 12, 2022
@yaooqinn yaooqinn self-assigned this Jan 12, 2022
@yaooqinn yaooqinn changed the title [SPARK-37877][SQL] Support clear shuffle dependencies eagerly for thrift server [WIP][SPARK-37877][SQL] Support clear shuffle dependencies eagerly for thrift server Jan 13, 2022
@yaooqinn yaooqinn marked this pull request as draft January 13, 2022 03:35
@HyukjinKwon
Copy link
Member

FWIW, there's a bug on updating the status in the check (#35179 (comment)). We should check the actual tests e.g., at https://github.com/yaooqinn/spark/runs/4798556788

}
if (cleanShuffleDeps && rdd != null) {
rdd.cleanShuffleDependencies()
rdd = null
Copy link
Contributor

@mridulm mridulm Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When THRIFTSERVER_INCREMENTAL_COLLECT is true, we will end up cleaning up dependencies before we have used up the data from df.
Wrap iter (IterableFetchIterator) as a CompletionIterator and move this there ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the input

HiveThriftServer2.eventManager.onStatementParsed(statementId,
result.queryExecution.toString())
df.queryExecution.toString())
rdd = df.rdd
Copy link
Contributor

@mridulm mridulm Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.rdd has tripped me in the past - the fact that it actually ends up executing the prefix dag was surprising (some team was leveraging on df.rdd.partitions.length to make some decisions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes here cannot trigger the eager clean for all cases, for example, CTAS, where the S contains a shuffle, will not be able to be cleared. The possible and proper place may be SparkPlan

@yaooqinn
Copy link
Member Author

Still in POC, any inputs are welcome :)

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 25, 2022
@github-actions github-actions bot closed this Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants