Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip test_multithreading in Python 2 #4107

Merged
merged 4 commits into from
Feb 27, 2019

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Feb 21, 2019

What do these changes do?

  • This test fails in Python 2. This is likely because the plasma client is not thread-safe. And when a PlasmaBuffer object gets GC'ed, plasma_client.release will be called. Moreover, the timing of GC is out of our control. So we can only fix this by making the plasma client itself thread-safe, as opposed to guarding it with a lock.
  • Speed up test_multithreading by reduce the work load (18s -> 7s on my machine).

Related issue number

@raulchen raulchen force-pushed the multithreading_test branch from 0e893fe to b867dbc Compare February 21, 2019 04:14
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12185/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12186/
Test FAILed.

@robertnishihara
Copy link
Collaborator

@raulchen can you elaborate a bit on that? The plasma store is dying? Do we know why?

@raulchen
Copy link
Contributor Author

I don't know the exact reason yet. My observation is that this issue only occurs in Python 2. And the error is broken pipe (see below paste). Looks like it's because plasma store dies (but I can't imagine why python version would impact plasma store).

Also, this test was still stable after @guoyuhong fixed the random ID issue. It seems to me that a recent arrow update triggered this.

test/runtest.py:1333: in test_api_in_multi_threads
    run_test_in_multi_threads(test_put_and_get)
test/runtest.py:1293: in run_test_in_multi_threads
    assert future.result() == "ok"
../../../miniconda/lib/python2.7/site-packages/concurrent/futures/_base.py:462: in result
    return self.__get_result()
../../../miniconda/lib/python2.7/site-packages/concurrent/futures/thread.py:63: in run
    result = self.fn(*self.args, **self.kwargs)
test/runtest.py:1286: in wrapper
    test_case()
test/runtest.py:1330: in test_put_and_get
    result = ray.get(ray.put(value))
../../../.local/lib/python2.7/site-packages/ray-0.7.0.dev0-py2.7-linux-x86_64.egg/ray/worker.py:2286: in put
    worker.put_object(object_id, value)
../../../.local/lib/python2.7/site-packages/ray-0.7.0.dev0-py2.7-linux-x86_64.egg/ray/worker.py:355: in put_object
    self.store_and_register(object_id, value)
../../../.local/lib/python2.7/site-packages/ray-0.7.0.dev0-py2.7-linux-x86_64.egg/ray/worker.py:289: in store_and_register
    self.task_driver_id))
../../../.local/lib/python2.7/site-packages/ray-0.7.0.dev0-py2.7-linux-x86_64.egg/ray/utils.py:443: in _wrapper
    return orig_attr(*args, **kwargs)
pyarrow/_plasma.pyx:494: in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
pyarrow/_plasma.pyx:325: in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   raise ArrowIOError(message)
E   ArrowIOError: Broken pipe
pyarrow/error.pxi:83: ArrowIOError

@raulchen
Copy link
Contributor Author

Reducing workload couldn't fix this test.

I was able to reproduce this error on my machine. But I didn't see any error in plasma log.

I'm not familiar with plasma code. @pcmoritz do you have any idea why store would close the connection?

@robertnishihara
Copy link
Collaborator

@raulchen @pcmoritz if the plasma store is crashing in a multi-threading situation, my best guess is the issue described in #2610 (comment).

Normally, we make the plasma client thread safe in Ray because the C++ client is not thread safe. However, whenever a Python object backed by the object store goes out of scope, it calls plasma_client.release(object_id), which is not protected by a lock. This could cause issues. One solution is to make the C++ client thread safe.

@raulchen
Copy link
Contributor Author

@robertnishihara thanks, that sounds like the cause. But I still don't understand why this only happens to Python 2.

@raulchen
Copy link
Contributor Author

raulchen commented Feb 22, 2019

I tried locking the release with the following changes:

diff --git a/python/ray/worker.py b/python/ray/worker.py
index 42735604..218d56fe 100644
--- a/python/ray/worker.py
+++ b/python/ray/worker.py
@@ -404,6 +404,8 @@ class Worker(object):
                                 object_ids[i + j],
                                 serialization_context,
                             ))
+                    with self.plasma_client.lock:
+                        del metadata_data_pairs
                 return results
             except pyarrow.DeserializationCallbackError:
                 # Wait a little bit for the import thread to import the class.

Then plasma store crashed with this error:

F0222 11:00:55.044409 230585792 store.cc:458]  Check failed: entry != nullptr 
*** Check failure stack trace: ***
    @        0x1097034fd  google::LogMessage::Fail()
    @        0x10970131e  google::LogMessage::SendToLog()
    @        0x10970219f  google::LogMessage::Flush()
    @        0x109701fd9  google::LogMessage::~LogMessage()
    @        0x109702295  google::LogMessage::~LogMessage()
    @        0x10971ae45  arrow::util::ArrowLog::~ArrowLog()
    @        0x1096d8ccf  plasma::PlasmaStore::ReleaseObject()
    @        0x1096dc498  plasma::PlasmaStore::ProcessMessage()
    @        0x1096e1b8c  std::__1::__function::__func<>::operator()()
    @        0x1096e562e  plasma::EventLoop::FileEventCallback()
    @        0x1096fb879  aeProcessEvents
    @        0x1096fbbcb  aeMain
    @        0x1096ddfac  plasma::PlasmaStoreRunner::Start()
    @        0x1096ddc9d  plasma::StartServer()
    @        0x1096de823  main

Looks like a bug of plasma store.

Updated: this is also Python 2 only.

@raulchen
Copy link
Contributor Author

I think this is the same problem as

ray/python/ray/actor.py

Lines 536 to 542 in 692bb33

# If the worker is a driver and driver id has changed because
# Ray was shut down re-initialized, the actor is already cleaned up
# and we don't need to send `__ray_terminate__` again.
logger.warning(
"Actor is garbage collected in the wrong driver." +
" Actor id = %s, class name = %s.", self._ray_actor_id,
self._ray_class_name)
(I couldn't find in which issue we discussed this before).

Timing of GC in Python 2 is non-deterministic. The __dealloc__ and __del__ functions won't get called as soon as object's reference decreases to 0. But in Python 3, they will.

So I think we can only fix this by making the cython plasma client internally thread-safe.

@robertnishihara
Copy link
Collaborator

@raulchen I see, thanks for looking into it, I think that's the right strategy (maybe even in C++ instead of Cython).

@guoyuhong
Copy link
Contributor

@raulchen Thanks for looking into this. Python 2 and Python 3 difference often brings some strange issues. Looking forward to the deprecation of Python 2... ;-)

@robertnishihara
Copy link
Collaborator

Note that their is still a bug here which should be fixed even if we only support Python 3.

@raulchen raulchen force-pushed the multithreading_test branch from b867dbc to d45c235 Compare February 22, 2019 05:25
@raulchen raulchen changed the title [WIP] Make test_multithreading less stressful Fix thread-safety issue by protecting PlasmaBuffer.__dealloc__ with lock Feb 22, 2019
@raulchen
Copy link
Contributor Author

I'd like to fix Python 3 and skip the test for Python 2 first to unblock CI. We can fix the cython/c++ part later.

@robertnishihara
Copy link
Collaborator

@raulchen what's the fix that you have in mind?

@raulchen
Copy link
Contributor Author

@robertnishihara In this PR, protect PlasmaBuffer.__dealloc__ with lock. Code already updated. Later, fix plasma's c++ client also by adding locks. I didn't look into the code, but should be similar to what I did for the raylet client.

# called right after the reference count decreases to
# 0. We need to fix this by making the underlying
# Cython or C++ plasma client thread-safe.
del metadata_data_pairs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just decrements a reference count, right? But the deallocation won't actually happen here because metadata and data are still in scope.

If the retrieved value is a numpy array, then the __dealloc__ won't happen until the object ID goes out of scope (once the user is done using it).

I don't see how to protect the release call with a lock other than by putting a lock in the actual __dealloc__ dealloc method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata and data should be out of scope, because the del is out of the for loop.
However, it the deserialized numpy array still holds the PlasmaBuffer, that's indeed still a problem.

Also I'm still confused why Python 3 could work before this PR. __dealloc__ should be called without lock. Do you have any idea?

Maybe I can just skip the test for Python 2 in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen, it's failing in both Python 2 and Python 3, isn't it? Is it really just Python 2?

We can skip the test in Python 2 if you think that's best, but please add a TODO and note that this is a bug and that it should be re-enabled when we fix it in Plasma.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it only fails in Python 2. I confirmed this again by running it 50 times on my machine.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12218/
Test FAILed.

@raulchen raulchen changed the title Fix thread-safety issue by protecting PlasmaBuffer.__dealloc__ with lock Skip test_multithreading in Python 2 Feb 22, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12220/
Test FAILed.

@raulchen
Copy link
Contributor Author

@robertnishihara Can this PR be merged now?

Copy link
Contributor

@guoyuhong guoyuhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@raulchen raulchen merged commit d583edb into ray-project:master Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants