LLM微调OneKE #561

jack9193 · 2024-08-05T04:51:08Z

Describe the bug

A clear and concise description of what the bug is.
您好，我在使用中主要出现了两个问题：

参数bf16报错，我搜索之后发现可能是v100不支持bf16。
我设置bf16=False之后，报错如下：
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25570 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25571) of binary: /disk1/miniconda3/envs/deepke-llm/bin/python
Traceback (most recent call last):
File "/disk1/miniconda3/envs/deepke-llm/bin/torchrun", line 8, in
sys.exit(main())
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Environment (please complete the following information):

OS: [e.g. mac / window]
Python Version [e.g. 3.6]
requirement.txt中的所有包的版本都是对齐的。

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.
脚本fine_continue.sh如下,数据是根据readme里面转化过的格式：
output_dir='lora/oneke-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py
--do_train --do_eval
--overwrite_output_dir
--model_name_or_path 'models/OneKE'
--stage 'sft'
--model_name 'llama'
--template 'llama2_zh'
--train_file 'data/NER/train.json'
--valid_file 'data/NER/dev.json'
--output_dir=${output_dir}
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--preprocessing_num_workers 16
--num_train_epochs 10
--learning_rate 5e-5
--max_grad_norm 0.5
--optim "adamw_torch"
--max_source_length 400
--cutoff_len 700
--max_target_length 300
--evaluation_strategy "epoch"
--save_strategy "epoch"
--save_total_limit 10
--lora_r 64
--lora_alpha 64
--lora_dropout 0.05
--bf16 False
--bits 4

guihonghao · 2024-08-05T05:05:08Z

单卡这样运行CUDA_VISIBLE_DEVICES="0" python src/finetune.py

jack9193 · 2024-08-05T05:11:42Z

单卡这样运行CUDA_VISIBLE_DEVICES="0" python src/finetune.py
您好，现在的报错变成了这样：
AttributeError: /disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_nf4

guihonghao · 2024-08-05T05:14:53Z

参考
artidoro/qlora#31
bitsandbytes-foundation/bitsandbytes#156
bitsandbytes-foundation/bitsandbytes#134

jack9193 · 2024-08-05T06:22:31Z

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答，我还想请教一下如下问题：

报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
然后我就把sh文件中的--bits 4删除了，但是目前没有再报这个错，请问这个操作会导致什么问题吗？
dev.json通过

python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train

转格式的时候split选项是选择train还是test呢？我之前是选择test，发现报错了key_error，response，然后我换成train可以运行了。
3. 目前正在运行中，但是貌似卡在了trainer.train()函数的某个地方，我发现它一直没有占用gpu，这是正常的吗？

jack9193 · 2024-08-05T06:23:23Z

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答，我还想请教一下如下问题：

报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
然后我就把sh文件中的--bits 4删除了，但是目前没有再报这个错，请问这个操作会导致什么问题吗？

dev.json通过
python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train
转格式的时候split选项是选择train还是test呢？我之前是选择test，发现报错了key_error，response，然后我换成train可以运行了。 3. 目前正在运行中，但是貌似卡在了trainer.train()函数的某个地方，我发现它一直没有占用gpu，这是正常的吗？

问题1中的accelerate和bitsandbytes我都按照requirements.txt的版本安装了。

jack9193 · 2024-08-05T06:51:28Z

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答，我还想请教一下如下问题：

报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
然后我就把sh文件中的--bits 4删除了，但是目前没有再报这个错，请问这个操作会导致什么问题吗？

dev.json通过
python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train
转格式的时候split选项是选择train还是test呢？我之前是选择test，发现报错了key_error，response，然后我换成train可以运行了。 3. 目前正在运行中，但是貌似卡在了trainer.train()函数的某个地方，我发现它一直没有占用gpu，这是正常的吗？
问题1中的accelerate和bitsandbytes我都按照requirements.txt的版本安装了。

问题1和3解决了，之前报错是因为torch跟cuda不适配，现在量化bits 4参数加回来了，可以正常跑了，开始占用gpu了。希望您看到了可以回答我问题2的疑惑，十分感谢！

guihonghao · 2024-08-05T07:52:18Z

dev的格式与train保持一致，选--split train

guihonghao · 2024-08-05T07:54:16Z

dev是验证集在训练的每个epoch末尾评估模型用的，不是测试集。

jack9193 · 2024-08-05T09:22:54Z

dev的格式与train保持一致，选--split train

感谢！

jack9193 · 2024-08-05T09:38:17Z

最后问一个问题，平均长度1000，训练集300条，lora微调OneKE的话大概跑多少个epoch合适呢？一个大概的范围就好了。

jack9193 · 2024-08-05T09:38:36Z

最后问一个问题，平均长度1000，训练集300条，lora微调OneKE的话大概跑多少个epoch合适呢？一个大概的范围就好了。

中文

guihonghao · 2024-08-05T09:48:04Z

10+

jack9193 · 2024-08-05T10:02:37Z

10+
好的，十分感谢！

jack9193 added the bug Something isn't working label Aug 5, 2024

jack9193 closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM微调OneKE #561

LLM微调OneKE #561

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

LLM微调OneKE #561

LLM微调OneKE #561

Comments

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

jack9193 commented Aug 5, 2024

guihonghao commented Aug 5, 2024

jack9193 commented Aug 5, 2024