Skip to content

Specifying Worker Listen Port #1253

Open

Description

Greetings!

I am encountering an issue when specifying the port for the worker to listen on. When using the traditional Dask Distributed with dask-worker (excluding GPU usage), I can utilize the --worker-port parameter to define this behavior. However, with dask-cuda-worker (version 23.10.0), I am unable to locate any option for this purpose, except for the --host parameter.
Consequently, when I execute the following command: CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file scheduler.json --host 127.0.0.1:12345, it results in the following error:

warnings.warn(f'''
2023-09-29 13:39:00,329 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-bpnddwo9', purging
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-09-29 13:39:00,337 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-09-29 13:39:00,338 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1540, in close
    self.log_event(self.address, {"action": "closing-worker", "reason": reason})
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 723, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2023-09-29 13:39:00,340 - distributed.worker - INFO - Stopping worker. Reason: failure-to-start-<class 'OSError'>
2023-09-29 13:39:00,340 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2023-09-29 13:39:00,341 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,386 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:12345'. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,391 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2023-09-29 13:39:00,406 - distributed.nanny - INFO - Worker process 15064 was killed by signal 15
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/worker.py", line 1391, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 810, in listen
    listener = await listen(
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 573, in start
    sockets = netutil.bind_sockets(
  File "/usr/local/lib/python3.10/dist-packages/tornado/netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 616, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 362, in start_unsafe
    response = await self.instantiate()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    msg = await self._wait_until_connected(uid)
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 889, in _wait_until_connected
    raise msg["exception"]
  File "/home/user/.local/lib/python3.10/site-packages/distributed/nanny.py", line 953, in run
    async with worker:
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 630, in __aenter__
    await self
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/dask-cuda-worker", line 8, in <module>
    sys.exit(worker())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 442, in worker
    loop.run_sync(run)
  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cli.py", line 434, in run
    await worker
  File "/home/user/.local/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 244, in _wait
    await asyncio.gather(*self.nannies)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/user/.local/lib/python3.10/site-packages/distributed/core.py", line 624, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.

Without using the --host parameter, everything functions as expected, although I am unable to specify the desired port. Is there a method to achieve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions