Skip to content

[sgd] File Too Short Error #3405

Closed
Closed
@richardliaw

Description

@richardliaw

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): Latest
  • Ray version: 0.5.3
  • Python version: 3.6
  • Exact command to reproduce:
    python mnist_example.py --num-workers=8 --devices-per-worker=2 --redis-address="localhost:6379" --gpu

Describe the problem

SGD often fails (even on Jenkins) with this File Too Short error. This happens even after a successful run.

Source code / logs

(tensorflow_p36) ubuntu@ip-172-31-77-180:~$ python mnist_example.py --num-workers=8 --devices-per-worker=2 --redis-address="localhost:6379" --gpu                                                                                                                                      [29/149]
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` to enable debugging of memory-related crashes.
Creating SGD workers (8 total, 2 devices per worker)
Waiting for gradient configuration
Remote function __init__ failed with:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
    *arguments)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/experimental/sgd/sgd_worker.py", line 116, in __init__
    plasma.build_plasma_tensorflow_op()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/plasma.py", line 73, in build_plasma_tensorflow_op
    tf_plasma_op = tf.load_op_library(TF_PLASMA_OP_PATH)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/tensorflow/plasma_op.so: file too short

Waiting for actors to start
Remote function shard_shapes failed with:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 825, in _process_task
    self.reraise_actor_init_error()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 264, in reraise_actor_init_error
    raise self.actor_init_error
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/worker.py", line 848, in _process_task
    *arguments)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/experimental/sgd/sgd_worker.py", line 116, in __init__
    plasma.build_plasma_tensorflow_op()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/plasma.py", line 73, in build_plasma_tensorflow_op
    tf_plasma_op = tf.load_op_library(TF_PLASMA_OP_PATH)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/tensorflow/plasma_op.so: file too short

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions