Skip to content

Launcher not registering the user_script as argument. #2612

Closed
@dogacancolak

Description

Hello, I'm trying to run a basic multi-node DeepSpeed setup on a pod.

When I run deepspeed --hostfile=myhostfile basic_deepspeed.py, I'm getting

[2022-12-15 21:00:19,543] [INFO] [runner.py:417:main] Using IP address of  for node ddp-0.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,544] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,545] [INFO] [runner.py:508:main] cmd = pdsh -S -f 1024 -w ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local export PYTHON_VERSION=3.9.13; export PYTHON_SETUPTOOLS_VERSION=58.1.0; export PYTHON_PIP_VERSION=22.0.4; export PYTHON_GET_PIP_SHA256=5aefe6ade911d997af080b315ebcb7f882212d070465df544e1175ac2be519b4; export PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5eaac1050023df1f5c98b173b248c260023f2278/public/get-pip.py; export PYTHONPATH=/;  cd /; /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJkZHAtMC5kZHAubWwtZGV2LnN2Yy5jbHVzdGVyLmxvY2FsIjogWzAsIDFdLCAiZGRwLTEuZGRwLm1sLWRldi5zdmMuY2x1c3Rlci5sb2NhbCI6IFswLCAxXX0= --node_rank=%n --master_addr= --master_port=29500 scripts/basic_deepspeed.py
ddp-0: 
ddp-0:          _              _            _             _            _       _    _
ddp-0:         /\_\           /\ \         /\ \     _    / /\         / /\    / /\ /\ \
ddp-0:        / / /  _       /  \ \       /  \ \   /\_\ / /  \       / / /   / / //  \ \
ddp-0:       / / /  /\_\    / /\ \ \     / /\ \ \_/ / // / /\ \__   / /_/   / / // /\ \ \
ddp-0:      / / /__/ / /   / / /\ \_\   / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-0:     / /\_____/ /   / /_/_ \/_/  / / /  \/____/ \ \ \ \/___// /\ \___\/ // / /  \ \_\
ddp-0:    / /\_______/   / /____/\    / / /    / / /   \ \ \     / / /\/___/ // / /   / / /
ddp-0:   / / /\ \ \     / /\____\/   / / /    / / /_    \ \ \   / / /   / / // / /   / / /
ddp-0:  / / /  \ \ \   / / /______  / / /    / / //_/\__/ / /  / / /   / / // / /___/ / /
ddp-0: / / /    \ \ \ / / /_______\/ / /    / / / \ \/___/ /  / / /   / / // / /____\/ /
ddp-0: \/_/      \_\_\\/__________/\/_/     \/_/   \_____\/   \/_/    \/_/ \/_________/
ddp-0: 
ddp-0: 
ddp-0: 
ddp-1: 
ddp-1:          _              _            _             _            _       _    _
ddp-1:         /\_\           /\ \         /\ \     _    / /\         / /\    / /\ /\ \
ddp-1:        / / /  _       /  \ \       /  \ \   /\_\ / /  \       / / /   / / //  \ \
ddp-1:       / / /  /\_\    / /\ \ \     / /\ \ \_/ / // / /\ \__   / /_/   / / // /\ \ \
ddp-1:      / / /__/ / /   / / /\ \_\   / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-1:     / /\_____/ /   / /_/_ \/_/  / / /  \/____/ \ \ \ \/___// /\ \___\/ // / /  \ \_\
ddp-1:    / /\_______/   / /____/\    / / /    / / /   \ \ \     / / /\/___/ // / /   / / /
ddp-1:   / / /\ \ \     / /\____\/   / / /    / / /_    \ \ \   / / /   / / // / /   / / /
ddp-1:  / / /  \ \ \   / / /______  / / /    / / //_/\__/ / /  / / /   / / // / /___/ / /
ddp-1: / / /    \ \ \ / / /_______\/ / /    / / / \ \/___/ /  / / /   / / // / /____\/ /
ddp-1: \/_/      \_\_\\/__________/\/_/     \/_/   \_____\/   \/_/    \/_/ \/_________/
ddp-1: 
ddp-1: 
ddp-1: 
ddp-0: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-0:                  [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-0:                  [--module] [--no_python] [--enable_elastic_training]
ddp-0:                  [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-0:                  [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-0:                  [--save_pid SAVE_PID]
ddp-0:                  [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-0:                  training_script ...
ddp-0: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-1: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-1:                  [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-1:                  [--module] [--no_python] [--enable_elastic_training]
ddp-1:                  [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-1:                  [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-1:                  [--save_pid SAVE_PID]
ddp-1:                  [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-1:                  training_script ...
ddp-1: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-0: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-0: ssh exited with exit code 127
ddp-1: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-1: ssh exited with exit code 127

It successfully connects to both pods (named ddp-0 and ddp-1) using ssh, but for some reason runner.py doesn't pass my script successfully to launch.py. Any ideas why?

I'm running in Debian 11 with torch 1.13.0 and deepspeed 0.7.7.

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions