-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49479][CORE] Use daemon ScheduledThreadPoolExecutor for BarrierCoordinator #47956
base: master
Are you sure you want to change the base?
[SPARK-49479][CORE] Use daemon ScheduledThreadPoolExecutor for BarrierCoordinator #47956
Conversation
This commit depends on #47957 and #47945 for fixing SPARK-49479 in branch 3.5. |
@LuciferYang @yaooqinn Please take a look, thanks! |
Thanks for the reproducible test - let me try to debug with it. |
So the issue here is the mixing of two different api's - Hence, when In SPARK-46895,
That is what causes the issue here. To test, the following diff ensures that the attached test that @jshmchenxi provided works fine (it is just a POC, not a complete fix).
Note that there are other cases where this happens ... and simply changing to +CC @jiangxb1987 who wrote the initial impl, and @LuciferYang who merged SPARK-46895 for thoughts |
…rCoordinator It is observed that when using Spark Torch Distributor, Spark driver pod could hang around after calling spark.stop(). Although the Spark Context was shutdown, the JVM was still running. The reason was that there is a non-daemon timer thread named BarrierCoordinator barrier epoch increment timer, which prevented the driver JVM from stopping. In SPARK-46895 we replaced the timer with non-daemon single thread scheduled executor, but the issue still exists. We should use daemon single thread scheduled executor instead.
43d7636
to
afbb58a
Compare
What changes were proposed in this pull request?
Use daemon
ScheduledThreadPoolExecutor
instead of non-daemon one as timer forBarrierCoordinator
.Why are the changes needed?
In Barrier Execution Mode, Spark driver JVM could hang around after calling
spark.stop()
. Although the Spark Context was shutdown, the JVM was still running. The reason was that there is a non-daemon timer thread namedBarrierCoordinator barrier epoch increment timer
, which prevented the driver JVM from stopping.Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual test.
Run barrier_example.py script locally using
./bin/spark-submit barrier_example.py
. Without this change, the JVM would hang there and not exit. With this change it would exit successuflly.Was this patch authored or co-authored using generative AI tooling?
No