Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coe] Remove gil when submit actor and tasks to avoid deadlock for some cases. #26421

Merged
merged 1 commit into from
Jul 18, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Jul 9, 2022

Why are these changes needed?

When submit task, GIL is not released due to this PR.
This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released.

In the previous PR, it seems to fix some memory issue, but it's seems not there any more.

Related issue number

#26414

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone changed the title fix [coe] Remove gil when submit actor and tasks to avoid deadlock for some cases. Jul 9, 2022
@fishbone fishbone marked this pull request as ready for review July 10, 2022 06:35
@fishbone
Copy link
Contributor Author

I tried to get a script to reproduce the issue, but it seems hard. Also, I actually can't find the root cause, backtrace is in here

The thing here is that submit task is holding GIL and trying to hold a mutex. The mutex should have been released by the other thread, but somehow not.

But just with this fix, the KubeRay HttpProxy won't hang there.

@edoakes are you think it's a safe fix? Do you know how can I verify the original issue is not there anymore?

@fishbone
Copy link
Contributor Author

Also cc @scv119 please share some ideas if you have.

@edoakes
Copy link
Collaborator

edoakes commented Jul 14, 2022

@iycheng I read through the old thread and vaguely remember the context, but unfortunately I don't think I was ever able to get a minimal repo:

This fixes the issue by not releasing the GIL while calling SubmitActorTask (also included SubmitTask because it's likely to have the same issue). The fix was verified by manual testing using Serve's test_api.py. Unfortunately I don't fully understand the offending memory invalidation and I wasn't able to come up with a minimal regression test, but I think we should merge the fix regardless.

I would be pretty nervous about merging this PR as-is though given that we don't understand what would have fixed it independently...

@fishbone
Copy link
Contributor Author

fishbone commented Jul 15, 2022

@edoakes finally I figure out what's the deadlock here. It's actually very simple, but the bad naming of the lock (we have a lot of mutex named as mu_ :( ) direct me to the wrong place.

Here when actor died, mu_ will be held and also later, it'll try to execute python function which require GIL

In SubmitActorTask, GIL is not released, and it tries to acquire mu_ and thus deadlock.

I don't think it can be easily fixed since in the disconnect actor, it actually need the mutex to be held. We probably should give this PR a try (whether nightly test or other test failed) first.

Let me see whether a test can be added.

Signed-off-by: Yi Cheng <chengyidna@gmail.com>
@edoakes
Copy link
Collaborator

edoakes commented Jul 15, 2022

@iycheng I'm mostly following -- why does DisconnectActor need to acquire the GIL? Is it in the callbacks executed at the end of the function?

@edoakes
Copy link
Collaborator

edoakes commented Jul 15, 2022

Anyways, let's definitely run a full suite of nightly tests before merging.

@fishbone
Copy link
Contributor Author

@iycheng I'm mostly following -- why does DisconnectActor need to acquire the GIL? Is it in the callbacks executed at the end of the function?

@edoakes yes, the callback contains a python function, so it'll need to acquire the GIL.

@edoakes
Copy link
Collaborator

edoakes commented Jul 15, 2022

Makes sense, Python callbacks have been a big source of deadlocks in the past. Would be great if we had a way to catch GIL-related deadlocks with absl somehow...

@fishbone
Copy link
Contributor Author

@rkooo567
Copy link
Contributor

There are many tests failing... In a glance, there should be no issues holding gil when calling c++ funcs... do you know why this was written this way in the first place? (wonder if we are missing any edge cases)

@fishbone
Copy link
Contributor Author

There are many tests failing... In a glance, there should be no issues holding gil when calling c++ funcs... do you know why this was written this way in the first place? (wonder if we are missing any edge cases)

Please check the description. I believe it's because of memory corruption in some places before. (maybe cython bugs)

@fishbone
Copy link
Contributor Author

fishbone commented Jul 16, 2022

Btw, @rkooo567 CI test failure is not related. Most nightly tests seems not related as well. I'll take another look.

@fishbone
Copy link
Contributor Author

nightly tests are ok (either not related or failed in master). Merge this PR.

@fishbone fishbone merged commit df421ad into ray-project:master Jul 18, 2022
@fishbone fishbone deleted the no-gil branch July 18, 2022 19:48
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
…me cases. (ray-project#26421)

When submit task, GIL is not released due to this PR.
This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released.

In the previous PR, it seems to fix some memory issue, but it's seems not there any more.

Signed-off-by: Yi Cheng <chengyidna@gmail.com>
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…me cases. (ray-project#26421)

When submit task, GIL is not released due to this PR.
This cause a potential deadlock when actor died and got notified by GCS. In this case, the callback function submitted by GetAsync is going to execute some python function and the GIL is still hold by submit task. And submit task is blocking by a lock which is not released.

In the previous PR, it seems to fix some memory issue, but it's seems not there any more.

Signed-off-by: Yi Cheng <chengyidna@gmail.com>
Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants