Launcher not registering the user_script as argument. #2612
Closed
Description
Hello, I'm trying to run a basic multi-node DeepSpeed setup on a pod.
When I run deepspeed --hostfile=myhostfile basic_deepspeed.py
, I'm getting
[2022-12-15 21:00:19,543] [INFO] [runner.py:417:main] Using IP address of for node ddp-0.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,544] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,545] [INFO] [runner.py:508:main] cmd = pdsh -S -f 1024 -w ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local export PYTHON_VERSION=3.9.13; export PYTHON_SETUPTOOLS_VERSION=58.1.0; export PYTHON_PIP_VERSION=22.0.4; export PYTHON_GET_PIP_SHA256=5aefe6ade911d997af080b315ebcb7f882212d070465df544e1175ac2be519b4; export PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5eaac1050023df1f5c98b173b248c260023f2278/public/get-pip.py; export PYTHONPATH=/; cd /; /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJkZHAtMC5kZHAubWwtZGV2LnN2Yy5jbHVzdGVyLmxvY2FsIjogWzAsIDFdLCAiZGRwLTEuZGRwLm1sLWRldi5zdmMuY2x1c3Rlci5sb2NhbCI6IFswLCAxXX0= --node_rank=%n --master_addr= --master_port=29500 scripts/basic_deepspeed.py
ddp-0:
ddp-0: _ _ _ _ _ _ _
ddp-0: /\_\ /\ \ /\ \ _ / /\ / /\ / /\ /\ \
ddp-0: / / / _ / \ \ / \ \ /\_\ / / \ / / / / / // \ \
ddp-0: / / / /\_\ / /\ \ \ / /\ \ \_/ / // / /\ \__ / /_/ / / // /\ \ \
ddp-0: / / /__/ / / / / /\ \_\ / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-0: / /\_____/ / / /_/_ \/_/ / / / \/____/ \ \ \ \/___// /\ \___\/ // / / \ \_\
ddp-0: / /\_______/ / /____/\ / / / / / / \ \ \ / / /\/___/ // / / / / /
ddp-0: / / /\ \ \ / /\____\/ / / / / / /_ \ \ \ / / / / / // / / / / /
ddp-0: / / / \ \ \ / / /______ / / / / / //_/\__/ / / / / / / / // / /___/ / /
ddp-0: / / / \ \ \ / / /_______\/ / / / / / \ \/___/ / / / / / / // / /____\/ /
ddp-0: \/_/ \_\_\\/__________/\/_/ \/_/ \_____\/ \/_/ \/_/ \/_________/
ddp-0:
ddp-0:
ddp-0:
ddp-1:
ddp-1: _ _ _ _ _ _ _
ddp-1: /\_\ /\ \ /\ \ _ / /\ / /\ / /\ /\ \
ddp-1: / / / _ / \ \ / \ \ /\_\ / / \ / / / / / // \ \
ddp-1: / / / /\_\ / /\ \ \ / /\ \ \_/ / // / /\ \__ / /_/ / / // /\ \ \
ddp-1: / / /__/ / / / / /\ \_\ / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-1: / /\_____/ / / /_/_ \/_/ / / / \/____/ \ \ \ \/___// /\ \___\/ // / / \ \_\
ddp-1: / /\_______/ / /____/\ / / / / / / \ \ \ / / /\/___/ // / / / / /
ddp-1: / / /\ \ \ / /\____\/ / / / / / /_ \ \ \ / / / / / // / / / / /
ddp-1: / / / \ \ \ / / /______ / / / / / //_/\__/ / / / / / / / // / /___/ / /
ddp-1: / / / \ \ \ / / /_______\/ / / / / / \ \/___/ / / / / / / // / /____\/ /
ddp-1: \/_/ \_\_\\/__________/\/_/ \/_/ \_____\/ \/_/ \/_/ \/_________/
ddp-1:
ddp-1:
ddp-1:
ddp-0: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-0: [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-0: [--module] [--no_python] [--enable_elastic_training]
ddp-0: [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-0: [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-0: [--save_pid SAVE_PID]
ddp-0: [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-0: training_script ...
ddp-0: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-1: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-1: [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-1: [--module] [--no_python] [--enable_elastic_training]
ddp-1: [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-1: [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-1: [--save_pid SAVE_PID]
ddp-1: [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-1: training_script ...
ddp-1: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-0: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-0: ssh exited with exit code 127
ddp-1: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-1: ssh exited with exit code 127
It successfully connects to both pods (named ddp-0 and ddp-1) using ssh, but for some reason runner.py
doesn't pass my script successfully to launch.py
. Any ideas why?
I'm running in Debian 11 with torch 1.13.0 and deepspeed 0.7.7.