Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgd] Add file lock to protect compilation of sgd op #3486

Merged
merged 10 commits into from
Dec 9, 2018

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Dec 6, 2018

What do these changes do?

Related issue number

#3405

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for doing this!

@@ -113,7 +115,11 @@ def __init__(self,
manager_socket = (
ray.worker.global_worker.plasma_client.manager_socket_name)
if not plasma.tf_plasma_op:
lock_path = os.path.join(pyarrow.__path__[0], "tensorflow", "compile_op.lock")
lock = SoftFileLock(lock_path)
lock.acquire()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this work with a with statement?

If not, use try/finally, but I think with is better.

# Soft file lock -- behavior depends on the underlying filesystem.
# For a proper lock with additional dependencies see the "filelock"
# package.
class SoftFileLock(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document

Is this portable across OS's?


import threading
import random
print("Testing file lock, this will take a few seconds...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger instead of print

print("Testing file lock, this will take a few seconds...")
lock_id = random.randint(0, 100000)
lock_path = "/tmp/test" + str(lock_id) + ".lock"
lock1 = SoftFileLock(lock_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test actually ever get run?


threading.Thread(target=try_hold_lock).start()
lock1.acquire()
print("Locking test successful!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger

time.sleep(10.0)
lock2.release()

threading.Thread(target=try_hold_lock).start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test, maybe run 100 of them in separate threads (where each one sleeps for 0.1s) and make sure the whole thing takes at least 10s or something like that?

plasma.build_plasma_tensorflow_op()
lock.release()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling plasma.build_plasma_tensorflow_op() here, maybe we should provide a wrapped version that includes the filelock and call that here.

@@ -136,6 +136,7 @@ def find_version(*filepath):

requires = [
"numpy",
"filelock",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check that this works in Python 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Dec 6, 2018

If we want to be fancy here it makes more sense to depend on the filelock package directly (which I wanted to avoid but it is also used in the local version of the autoscaler). I made the appropriate changes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9817/
Test FAILed.

@robertnishihara
Copy link
Collaborator

@pcmoritz this doesn't quite protect against the scenario where we call tf.load_op_library(TF_PLASMA_OP_PATH) from import pyarrow and the file is only half written, right?

https://github.com/apache/arrow/blob/b731b583e870726121d0adeebad014981e05b36c/python/pyarrow/plasma.py#L37-L44

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9820/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9819/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Dec 6, 2018

Yeah I think you are right. This needs apache/arrow#3117

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9816/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9821/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9824/
Test PASSed.

@@ -14,11 +14,11 @@
# - PLASMA_STATIC_LIB
# - PLASMA_SHARED_LIB

set(arrow_URL https://github.com/apache/arrow.git)
set(arrow_URL https://github.com/pcmoritz/arrow.git)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this temporary or are we're officially building from @pcmoritz's arrow?

Copy link
Collaborator

@robertnishihara robertnishihara Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not merge this until it's changed back to apache/arrow. This is always just temporary for testing.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9848/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9850/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Dec 8, 2018

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9855/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9871/
Test FAILed.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from lint errors

@@ -2,16 +2,18 @@
from __future__ import division
from __future__ import print_function

import filelock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9903/
Test FAILed.

@pcmoritz pcmoritz merged commit 87c0d24 into ray-project:master Dec 9, 2018
@pcmoritz pcmoritz deleted the fix-sgd-compiling branch December 9, 2018 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants