Open
Description
I have two machines each equipped with one GPU. I want to run multiple workers on each machine. Is this possible in BytePS? I tried to run 4 worker processes (2 process on each machine) and 2 servers (1 server process on each machine) but the last 3 worker processes fail with the following error and the first worker is stuck. I ran the commands as I would do for a normal 1 worker per GPU machine (which works in that case)
BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64
[19:28:20] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[19:28:20] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[2022-02-17 19:29:02.800368: F byteps/common/operations.cc:290] Check failed: (size) > (0) init tensor size not larger than 0
Aborted (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 254, in <module>
launch_bps()
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 240, in launch_bps
t[i].join()
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
raise self.exc
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 192, in worker
subprocess.check_call(command, env=my_env,
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64' returned non-zero exit status 134.
I have two questions here:
- Is it possible to run BytePS with multiple workers on a single GPU machine?
- Is it possible to run BytePS on CPU-only machines as the workers?
Thank you!
Metadata
Assignees
Labels
No labels