Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
- Ray installed from (source or binary): Latest
- Ray version: 0.5.3
- Python version: 3.6
- Exact command to reproduce:
python mnist_example.py --num-workers=8 --devices-per-worker=2 --redis-address="localhost:6379" --gpu
Describe the problem
SGD often fails (even on Jenkins) with this File Too Short
error. This happens even after a successful run.
Source code / logs
(tensorflow_p36) ubuntu@ip-172-31-77-180:~$ python mnist_example.py --num-workers=8 --devices-per-worker=2 --redis-address="localhost:6379" --gpu [29/149]
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` to enable debugging of memory-related crashes.
Creating SGD workers (8 total, 2 devices per worker)
Waiting for gradient configuration
Remote function __init__ failed with:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
*arguments)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/experimental/sgd/sgd_worker.py", line 116, in __init__
plasma.build_plasma_tensorflow_op()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/plasma.py", line 73, in build_plasma_tensorflow_op
tf_plasma_op = tf.load_op_library(TF_PLASMA_OP_PATH)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/tensorflow/plasma_op.so: file too short
Waiting for actors to start
Remote function shard_shapes failed with:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 825, in _process_task
self.reraise_actor_init_error()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 264, in reraise_actor_init_error
raise self.actor_init_error
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
*arguments)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/experimental/sgd/sgd_worker.py", line 116, in __init__
plasma.build_plasma_tensorflow_op()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/plasma.py", line 73, in build_plasma_tensorflow_op
tf_plasma_op = tf.load_op_library(TF_PLASMA_OP_PATH)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/tensorflow/plasma_op.so: file too short
Metadata
Metadata
Assignees
Labels
No labels