-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[coe] Remove gil when submit actor and tasks to avoid deadlock for some cases. #26421
Conversation
I tried to get a script to reproduce the issue, but it seems hard. Also, I actually can't find the root cause, backtrace is in here The thing here is that submit task is holding GIL and trying to hold a mutex. The mutex should have been released by the other thread, but somehow not. But just with this fix, the KubeRay HttpProxy won't hang there. @edoakes are you think it's a safe fix? Do you know how can I verify the original issue is not there anymore? |
Also cc @scv119 please share some ideas if you have. |
@iycheng I read through the old thread and vaguely remember the context, but unfortunately I don't think I was ever able to get a minimal repo:
I would be pretty nervous about merging this PR as-is though given that we don't understand what would have fixed it independently... |
@edoakes finally I figure out what's the deadlock here. It's actually very simple, but the bad naming of the lock (we have a lot of mutex named as mu_ :( ) direct me to the wrong place. Here when actor died, In SubmitActorTask, GIL is not released, and it tries to acquire I don't think it can be easily fixed since in the disconnect actor, it actually need the mutex to be held. We probably should give this PR a try (whether nightly test or other test failed) first. Let me see whether a test can be added. |
@iycheng I'm mostly following -- why does |
Anyways, let's definitely run a full suite of nightly tests before merging. |
@edoakes yes, the callback contains a python function, so it'll need to acquire the GIL. |
Makes sense, Python callbacks have been a big source of deadlocks in the past. Would be great if we had a way to catch GIL-related deadlocks with absl somehow... |
Running here: https://buildkite.com/ray-project/release-tests-pr/builds/9465#018200d7-56d9-4926-8fe3-530358fe5f16 I'll keep an eye on the result. |
There are many tests failing... In a glance, there should be no issues holding gil when calling c++ funcs... do you know why this was written this way in the first place? (wonder if we are missing any edge cases) |
Please check the description. I believe it's because of memory corruption in some places before. (maybe cython bugs) |
Btw, @rkooo567 CI test failure is not related. Most nightly tests seems not related as well. I'll take another look. |
nightly tests are ok (either not related or failed in master). Merge this PR. |
…me cases. (ray-project#26421) When submit task, GIL is not released due to this PR. This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released. In the previous PR, it seems to fix some memory issue, but it's seems not there any more. Signed-off-by: Yi Cheng <chengyidna@gmail.com> Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
…me cases. (ray-project#26421) When submit task, GIL is not released due to this PR. This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released. In the previous PR, it seems to fix some memory issue, but it's seems not there any more. Signed-off-by: Yi Cheng <chengyidna@gmail.com> Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Why are these changes needed?
When submit task, GIL is not released due to this PR.
This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released.
In the previous PR, it seems to fix some memory issue, but it's seems not there any more.
Related issue number
#26414
Checks
scripts/format.sh
to lint the changes in this PR.