[SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer (For branch-1.1)#3321
Conversation
|
QA tests have started for PR 3321 at commit
|
|
@JoshRosen can you take a look since you wrote the original patch? |
|
QA tests have finished for PR 3321 at commit
|
|
Test PASSed. |
|
This looks good to me. I tried doing a cherry-pick myself and it looks like the original merge conflicts were caused by the ConnectionManager being moved to the |
|
(Edit: earlier comment said |
|
Yeah it would be good to get this in before we cut another RC for 1.1.1 |
|
Alright, this looks good, so I'm going to merge it into branch-1.1. Thanks @sarutak! |
|
I've merged this, but GitHub won't automatically close PRs made against non-master branches. Here's the commit in the Apache Git (the mirroring hasn't picked it up yet): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=91b5fa82477e5fd43712fdf067d92a31d4037a83 Do you mind closing this now that I've merged it? Thanks again! |
|
Thanks for notification. I close this PR. |
This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.
This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.