You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
W1016 02:57:37.220000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606919 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606920 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606921 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606922 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606925 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606926 closing signal SIGTERM
E1016 02:57:37.271000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 4 (pid: 2606923) of binary: /data1/yanzh/.conda/envs/mgie/bin/python
Traceback (most recent call last):
File "/data1/yanzh/.conda/envs/mgie/bin/torchrun", line 8, in
sys.exit(main())
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
Traceback (most recent call last):
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/train_mem.py", line 6, in
from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
File "/data1/yanzh/project/ml-mgie/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 12, in
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
ImportError: cannot import name 'flash_attn_unpadded_qkvpacked_func' from 'flash_attn.flash_attn_interface' (/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py)
W1016 02:57:37.220000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606919 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606920 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606921 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606922 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606925 closing signal SIGTERM
W1016 02:57:37.221000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2606926 closing signal SIGTERM
E1016 02:57:37.271000 140512597451840 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 4 (pid: 2606923) of binary: /data1/yanzh/.conda/envs/mgie/bin/python
Traceback (most recent call last):
File "/data1/yanzh/.conda/envs/mgie/bin/torchrun", line 8, in
sys.exit(main())
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/yanzh/.conda/envs/mgie/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
LLaVA/llava/train/train_mem.py FAILED
Failures:
[1]:
time : 2024-10-16_02:57:37
host : node28
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 2606924)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-10-16_02:57:37
host : node28
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 2606923)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: