-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM微调OneKE #561
Comments
单卡这样运行CUDA_VISIBLE_DEVICES="0" python src/finetune.py |
|
感谢您的解答,我还想请教一下如下问题:
转格式的时候split选项是选择train还是test呢?我之前是选择test,发现报错了key_error,response,然后我换成train可以运行了。 |
问题1中的accelerate和bitsandbytes我都按照requirements.txt的版本安装了。 |
问题1和3解决了,之前报错是因为torch跟cuda不适配,现在量化bits 4参数加回来了,可以正常跑了,开始占用gpu了。希望您看到了可以回答我问题2的疑惑,十分感谢! |
dev的格式与train保持一致,选--split train |
dev是验证集在训练的每个epoch末尾评估模型用的,不是测试集。 |
感谢! |
最后问一个问题,平均长度1000,训练集300条,lora微调OneKE的话大概跑多少个epoch合适呢?一个大概的范围就好了。 |
中文 |
10+ |
|
Describe the bug
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25570 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25571) of binary: /disk1/miniconda3/envs/deepke-llm/bin/python
Traceback (most recent call last):
File "/disk1/miniconda3/envs/deepke-llm/bin/torchrun", line 8, in
sys.exit(main())
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Environment (please complete the following information):
requirement.txt中的所有包的版本都是对齐的。
Screenshots
Additional context
The text was updated successfully, but these errors were encountered: