Skip to content

Running multiple workers on a single GPU machine #430

Open
@hamidralmasi

Description

I have two machines each equipped with one GPU. I want to run multiple workers on each machine. Is this possible in BytePS? I tried to run 4 worker processes (2 process on each machine) and 2 servers (1 server process on each machine) but the last 3 worker processes fail with the following error and the first worker is stuck. I ran the commands as I would do for a normal 1 worker per GPU machine (which works in that case)

BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64

[19:28:20] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[19:28:20] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[2022-02-17 19:29:02.800368: F byteps/common/operations.cc:290] Check failed: (size) > (0) init tensor size not larger than 0
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 254, in <module>
    launch_bps()
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 240, in launch_bps
    t[i].join()
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
    raise self.exc
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 192, in worker
    subprocess.check_call(command, env=my_env,
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64' returned non-zero exit status 134.

I have two questions here:

  1. Is it possible to run BytePS with multiple workers on a single GPU machine?
  2. Is it possible to run BytePS on CPU-only machines as the workers?

Thank you!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions