Skip to content

windows下微调llama3.1-instruction 开始训练后报错 无法启动微调 #6725

Open
@LJXCMQ

Description

Reminder

  • I have read the above rules and searched the existing issues.

System Info

(llama_factory) PS D:\Ljx\Llama-finetuning> llamafactory-cli env
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).

  • llamafactory version: 0.9.2.dev0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.10.16
  • PyTorch version: 2.2.2+cu121 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090

Reproduction

(llama_factory) D:\L\LLaMA-Factory>llamafactory-cli webui
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set share=True in launch().
# ---------------UI界面配置好数据集后点击开始后发生如下报错-------------------
[INFO|2025-01-20 22:34:25] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:27432
[2025-01-20 22:34:26,079] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING]
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-20 22:34:26,090] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:27432 (system error: 10049 - 在其上下文中,该请求的地址无效。).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 23, in <module>
    launch()
launch()  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch

  File "D:\L\LLaMA-Factory\src\llamafactory\launcher.py", line 19, in launch
    run_exp()
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
    run_exp()
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 92, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
    _training_function(config={"args": args, "callbacks": callbacks})
  File "D:\L\LLaMA-Factory\src\llamafactory\train\tuner.py", line 52, in _training_function
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 182, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 162, in _parse_train_args
    return _parse_args(parser, args)
      File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
return _parse_args(parser, args)
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\parser.py", line 74, in _parse_args
    return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
    return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\hf_argparser.py", line 387, in parse_dict
    obj = dtype(**inputs)
  File "<string>", line 142, in __init__
    obj = dtype(**inputs)
  File "<string>", line 142, in __init__
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
  File "D:\L\LLaMA-Factory\src\llamafactory\hparams\training_args.py", line 47, in __post_init__
    Seq2SeqTrainingArguments.__post_init__(self)
      File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
Seq2SeqTrainingArguments.__post_init__(self)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 1764, in __post_init__
    self.device
    self.device  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device

  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2277, in device
    return self._setup_devices
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
    return self._setup_devices
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\utils\generic.py", line 60, in __get__
    cached = self.fget(obj)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
    cached = self.fget(obj)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\transformers\training_args.py", line 2207, in _setup_devices
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\accelerate\state.py", line 212, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
      File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
func_return = func(*args, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
default_pg, _ = _new_process_group_helper(  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper

  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
    raise RuntimeError("Distributed package doesn't have NCCL built in")RuntimeError
: RuntimeErrorDistributed package doesn't have NCCL built in: Distributed package doesn't have NCCL built in

[2025-01-20 22:34:31,115] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15908) of binary: D:\Anaconda3\envs\llama_factory\python.exe
Traceback (most recent call last):
  File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\Anaconda3\envs\llama_factory\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Anaconda3\envs\llama_factory\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 812, in main
    run(args)
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\run.py", line 803, in run
    elastic_launch(
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\Anaconda3\envs\llama_factory\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
D:\L\LLaMA-Factory\src\llamafactory\launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-20_22:34:31
  host      : DESKTOP-K8BKR7S
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 16304)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-20_22:34:31
  host      : DESKTOP-K8BKR7S
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15908)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Others

Windows 环境下单机 双卡RTX4090,可以正常打开webui,但是配置好参数后,开始微调报错,求助

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions