Open
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
(llama_factory) PS D:\Ljx\Llama-finetuning> llamafactory-cli env
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
llamafactory
version: 0.9.2.dev0- Platform: Windows-10-10.0.22631-SP0
- Python version: 3.10.16
- PyTorch version: 2.2.2+cu121 (GPU)
- Transformers version: 4.46.1
- Datasets version: 3.1.0
- Accelerate version: 1.0.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
Reproduction
(llama_factory) D:\L\LLaMA-Factory>llamafactory-cli webui
* Running on local URL: http://0.0.0.0:7860
To create a public link, set share=True in launch().
# ---------------UI界面配置好数据集后点击开始后发生如下报错-------------------
[INFO|2025-01-20 22:34:25] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:27432
[2025-01-20 22:34:26,079] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING]
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
launch()
launch() File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
run_exp()
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
run_exp()
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
_training_function(config={"args": args, "callbacks": callbacks})
File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
return _parse_args(parser, args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return _parse_args(parser, args)
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
obj = dtype(**inputs)
File "<string>", line 142, in __init__
obj = dtype(**inputs)
File "<string>", line 142, in __init__
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
self.device
self.device File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
return self._setup_devices
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
return self._setup_devices
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
cached = self.fget(obj)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
cached = self.fget(obj)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
self.distributed_state = PartialState(**accelerator_state_kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
self.distributed_state = PartialState(**accelerator_state_kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
func_return = func(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
default_pg, _ = _new_process_group_helper(
default_pg, _ = _new_process_group_helper( File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
raise RuntimeError("Distributed package doesn't have NCCL built in")RuntimeError
: RuntimeErrorDistributed package doesn't have NCCL built in: Distributed package doesn't have NCCL built in
[2025-01-20 22:34:31,115] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15908) of binary: D:\Anaconda3\envs\llama_factory\python.exe
Traceback (most recent call last):
File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Anaconda3\envs\llama_factory\Scripts\torchrun.exe\__main__.py", line 7, in <module>
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
D:\L\LLaMA-Factory\src\llamafactory\launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-01-20_22:34:31
host : DESKTOP-K8BKR7S
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 16304)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-20_22:34:31
host : DESKTOP-K8BKR7S
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15908)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Others
Windows 环境下单机 双卡RTX4090,可以正常打开webui,但是配置好参数后,开始微调报错,求助