Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi GPU error #228

Open
yuxuan2015 opened this issue Mar 22, 2023 · 3 comments
Open

multi GPU error #228

yuxuan2015 opened this issue Mar 22, 2023 · 3 comments
Labels
model-usage issues related to how models are used/loaded

Comments

@yuxuan2015
Copy link

yuxuan2015 commented Mar 22, 2023

torchrun --nproc_per_node gpu example.py --ckpt_dir pyllama_data/7B --tokenizer_path pyllama_data/tokenizer.model

Traceback (most recent call last):
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 119, in
Traceback (most recent call last):
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 119, in
fire.Fire(main)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
fire.Fire(main)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
Traceback (most recent call last):
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 119, in
component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(

File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
fire.Fire(main)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component = fn(*varargs, **kwargs)
component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 78, in main

File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component = fn(*varargs, **kwargs)
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 78, in main
generator = load(
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 42, in load
generator = load(
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 42, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=1 but world size is 4
assert world_size == len(component, remaining_args = _CallAndUpdateTrace(

File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
AssertionError: Loading a checkpoint for MP=1 but world size is 4
component = fn(*varargs, **kwargs)
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 78, in main
generator = load(
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 42, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=1 but world size is 4
Traceback (most recent call last):
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 119, in
fire.Fire(main)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 78, in main
generator = load(
File "/data1/Semantic_team/speech/chatgpt/llama/example.py", line 42, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=1 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8748) of binary: /data1/anaconda3/envs/llama_py310/bin/python3.1
Traceback (most recent call last):
File "/data1/anaconda3/envs/llama_py310/bin/torchrun", line 8, in
sys.exit(main())
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/anaconda3/envs/llama_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@papasega
Copy link

Hello, I have the same error :

""" AssertionError: Loading a checkpoint for MP=1 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8748) of binary: /data1/anaconda3/envs/llama_py310/bin/python3."""

And I fixe it by :

torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B ......

I think you have to fixe --nproc_per_node buy the number of GPU you have. In my case I have just 1

@kechan
Copy link

kechan commented Mar 30, 2023

I ran into this issue on Apple M2 Max. My nproc_per_node is 1 and prominent error messages:

File "/Users//Developer/python39_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "/Users/
/Developer/python39_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")

File "/Users//Developer/python39_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/Users/
/Developer/python39_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Any idea what I should do?

@ejsd1989 ejsd1989 added the model-usage issues related to how models are used/loaded label Sep 6, 2023
@ejsd1989
Copy link

ejsd1989 commented Sep 6, 2023

@yuxuan2015 @kechan checking in to see if this was resolved in your testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage issues related to how models are used/loaded
Projects
None yet
Development

No branches or pull requests

4 participants