-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sgd] Add file lock to protect compilation of sgd op #3486
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for doing this!
@@ -113,7 +115,11 @@ def __init__(self, | |||
manager_socket = ( | |||
ray.worker.global_worker.plasma_client.manager_socket_name) | |||
if not plasma.tf_plasma_op: | |||
lock_path = os.path.join(pyarrow.__path__[0], "tensorflow", "compile_op.lock") | |||
lock = SoftFileLock(lock_path) | |||
lock.acquire() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this work with a with
statement?
If not, use try/finally
, but I think with
is better.
python/ray/experimental/sgd/util.py
Outdated
# Soft file lock -- behavior depends on the underlying filesystem. | ||
# For a proper lock with additional dependencies see the "filelock" | ||
# package. | ||
class SoftFileLock(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document
Is this portable across OS's?
python/ray/experimental/sgd/util.py
Outdated
|
||
import threading | ||
import random | ||
print("Testing file lock, this will take a few seconds...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger
instead of print
python/ray/experimental/sgd/util.py
Outdated
print("Testing file lock, this will take a few seconds...") | ||
lock_id = random.randint(0, 100000) | ||
lock_path = "/tmp/test" + str(lock_id) + ".lock" | ||
lock1 = SoftFileLock(lock_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test actually ever get run?
python/ray/experimental/sgd/util.py
Outdated
|
||
threading.Thread(target=try_hold_lock).start() | ||
lock1.acquire() | ||
print("Locking test successful!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger
python/ray/experimental/sgd/util.py
Outdated
time.sleep(10.0) | ||
lock2.release() | ||
|
||
threading.Thread(target=try_hold_lock).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test, maybe run 100 of them in separate threads (where each one sleeps for 0.1s) and make sure the whole thing takes at least 10s or something like that?
plasma.build_plasma_tensorflow_op() | ||
lock.release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling plasma.build_plasma_tensorflow_op()
here, maybe we should provide a wrapped version that includes the filelock and call that here.
@@ -136,6 +136,7 @@ def find_version(*filepath): | |||
|
|||
requires = [ | |||
"numpy", | |||
"filelock", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check that this works in Python 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does
If we want to be fancy here it makes more sense to depend on the filelock package directly (which I wanted to avoid but it is also used in the local version of the autoscaler). I made the appropriate changes. |
Test FAILed. |
@pcmoritz this doesn't quite protect against the scenario where we call |
Test PASSed. |
Test FAILed. |
Yeah I think you are right. This needs apache/arrow#3117 |
Test FAILed. |
Test PASSed. |
Test PASSed. |
@@ -14,11 +14,11 @@ | |||
# - PLASMA_STATIC_LIB | |||
# - PLASMA_SHARED_LIB | |||
|
|||
set(arrow_URL https://github.com/apache/arrow.git) | |||
set(arrow_URL https://github.com/pcmoritz/arrow.git) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this temporary or are we're officially building from @pcmoritz's arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not merge this until it's changed back to apache/arrow. This is always just temporary for testing.
Test FAILed. |
Test FAILed. |
jenkins retest this please |
Test FAILed. |
Jenkins, retest this please. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from lint errors
@@ -2,16 +2,18 @@ | |||
from __future__ import division | |||
from __future__ import print_function | |||
|
|||
import filelock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
Test FAILed. |
What do these changes do?
Related issue number
#3405