-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip test_multithreading
in Python 2
#4107
Conversation
0e893fe
to
b867dbc
Compare
Test FAILed. |
Test FAILed. |
@raulchen can you elaborate a bit on that? The plasma store is dying? Do we know why? |
I don't know the exact reason yet. My observation is that this issue only occurs in Python 2. And the error is broken pipe (see below paste). Looks like it's because plasma store dies (but I can't imagine why python version would impact plasma store). Also, this test was still stable after @guoyuhong fixed the random ID issue. It seems to me that a recent arrow update triggered this.
|
Reducing workload couldn't fix this test. I was able to reproduce this error on my machine. But I didn't see any error in plasma log. I'm not familiar with plasma code. @pcmoritz do you have any idea why store would close the connection? |
@raulchen @pcmoritz if the plasma store is crashing in a multi-threading situation, my best guess is the issue described in #2610 (comment). Normally, we make the plasma client thread safe in Ray because the C++ client is not thread safe. However, whenever a Python object backed by the object store goes out of scope, it calls |
@robertnishihara thanks, that sounds like the cause. But I still don't understand why this only happens to Python 2. |
I tried locking the release with the following changes: diff --git a/python/ray/worker.py b/python/ray/worker.py
index 42735604..218d56fe 100644
--- a/python/ray/worker.py
+++ b/python/ray/worker.py
@@ -404,6 +404,8 @@ class Worker(object):
object_ids[i + j],
serialization_context,
))
+ with self.plasma_client.lock:
+ del metadata_data_pairs
return results
except pyarrow.DeserializationCallbackError:
# Wait a little bit for the import thread to import the class. Then plasma store crashed with this error:
Looks like a bug of plasma store. Updated: this is also Python 2 only. |
I think this is the same problem as Lines 536 to 542 in 692bb33
Timing of GC in Python 2 is non-deterministic. The So I think we can only fix this by making the cython plasma client internally thread-safe. |
@raulchen I see, thanks for looking into it, I think that's the right strategy (maybe even in C++ instead of Cython). |
@raulchen Thanks for looking into this. Python 2 and Python 3 difference often brings some strange issues. Looking forward to the deprecation of Python 2... ;-) |
Note that their is still a bug here which should be fixed even if we only support Python 3. |
b867dbc
to
d45c235
Compare
PlasmaBuffer.__dealloc__
with lock
I'd like to fix Python 3 and skip the test for Python 2 first to unblock CI. We can fix the cython/c++ part later. |
@raulchen what's the fix that you have in mind? |
@robertnishihara In this PR, protect |
python/ray/worker.py
Outdated
# called right after the reference count decreases to | ||
# 0. We need to fix this by making the underlying | ||
# Cython or C++ plasma client thread-safe. | ||
del metadata_data_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just decrements a reference count, right? But the deallocation won't actually happen here because metadata
and data
are still in scope.
If the retrieved value is a numpy array, then the __dealloc__
won't happen until the object ID goes out of scope (once the user is done using it).
I don't see how to protect the release call with a lock other than by putting a lock in the actual __dealloc__
dealloc method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata
and data
should be out of scope, because the del
is out of the for loop.
However, it the deserialized numpy array still holds the PlasmaBuffer, that's indeed still a problem.
Also I'm still confused why Python 3 could work before this PR. __dealloc__
should be called without lock. Do you have any idea?
Maybe I can just skip the test for Python 2 in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulchen, it's failing in both Python 2 and Python 3, isn't it? Is it really just Python 2?
We can skip the test in Python 2 if you think that's best, but please add a TODO and note that this is a bug and that it should be re-enabled when we fix it in Plasma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it only fails in Python 2. I confirmed this again by running it 50 times on my machine.
Test FAILed. |
PlasmaBuffer.__dealloc__
with locktest_multithreading
in Python 2
Test FAILed. |
@robertnishihara Can this PR be merged now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What do these changes do?
plasma_client.release
will be called. Moreover, the timing of GC is out of our control. So we can only fix this by making the plasma client itself thread-safe, as opposed to guarding it with a lock.test_multithreading
by reduce the work load (18s -> 7s on my machine).Related issue number