Skip to content

Same function assigned different IDs on driver and worker #2089

Closed
@AdamGleave

Description

@AdamGleave

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 17.10
  • Ray installed from (source or binary): Binary
  • Ray version: 0.40
  • Python version: Python 3.6.5
  • Exact command to reproduce:

Describe the problem

ray/worker.py:compute_function_id checks whether main has the file attribute to decide whether to include the source code in the function hash. When starting a Python executable using python -m mypkg.executable, dependencies in mypkg can be imported before main is loaded, resulting in this check failing (and source code being excluded) in the driver. When it comes to running the code, however, main exists. This can lead to function IDs differing between the driver and worker in nested parallelism settings.

The error I get is similar to that of #1446 so it may have a common cause. After diagnosing this I think I've become convinced that it is best for executables to be in the top-level and not nested in the package structure; I'm posting this mostly as a warning to others as it is far from obvious what caused this error.

I think we can make this more robust by using a try-except statement rather than guessing when the source code is present; I'll submit a PR.

Source code / logs

This bug is only triggered with a particular package structure, so this needs a few files. First, create a directory foo containing:

# __init__.py
from foo import bar
# main.py
import ray                                                                                                                                                                                            

from foo import bar

if __name__ == '__main__':
    ray.init(redirect_worker_output=True)
    bar.run()
# bar.py
import ray                                                                                                                                                                                            

@ray.remote
def f(x):
    return x

def g(x):
    return f.remote(x)

@ray.remote
def h(x):
    return ray.get(g(x))

def run():
    print(ray.get(h.remote(42)))

Running this produces:

python -m foo.main
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:39255 to respond...
Waiting for redis server at 127.0.0.1:20998 to respond...
Starting local scheduler with the following resources: {'CPU': 12, 'GPU': 2}.

======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui70409.ipynb?token=d6d894f5459ef32b1986d4cdf348574421d1787c71960fef
======================================================================

Traceback (most recent call last):
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/adam/dev/ray/foo/main.py", line 7, in <module>
    bar.run()
  File "/home/adam/dev/ray/foo/bar.py", line 15, in run
    print(ray.get(h.remote(42)))
Remote function foo.bar.h failed with:

Traceback (most recent call last):
  File "/home/adam/dev/ray/foo/bar.py", line 12, in h
    return ray.get(g(x))
  File "/home/adam/dev/ray/foo/bar.py", line 8, in g
    return f.remote(x)
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/site-packages/ray/worker.py", line 2602, in func_call
    return _submit(args=args, kwargs=kwargs)
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/site-packages/ray/worker.py", line 2622, in _submit
    resources=resources)
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/site-packages/ray/worker.py", line 2430, in _submit_task
    return global_worker.submit_task(function_id, *args, **kwargs)
  File "/home/adam/bin/anaconda3/envs/mypirl/lib/python3.6/site-packages/ray/worker.py", line 580, in submit_task
    self.task_driver_id.id()][function_id.id()]
KeyError: b'\x8d\x04\x11\xd9\xea\xd4\xbe\xad\x90\x0c\x0b\x10\x0f^\x0cq\xcc\xe3\xb1\x07'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions