Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM微调OneKE #561

Closed
jack9193 opened this issue Aug 5, 2024 · 13 comments
Closed

LLM微调OneKE #561

jack9193 opened this issue Aug 5, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@jack9193
Copy link

jack9193 commented Aug 5, 2024

Describe the bug

A clear and concise description of what the bug is.
您好,我在使用中主要出现了两个问题:

  1. 参数bf16报错,我搜索之后发现可能是v100不支持bf16。
  2. 我设置bf16=False之后,报错如下:
    Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25570 closing signal SIGTERM
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25571) of binary: /disk1/miniconda3/envs/deepke-llm/bin/python
    Traceback (most recent call last):
    File "/disk1/miniconda3/envs/deepke-llm/bin/torchrun", line 8, in
    sys.exit(main())
    File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
    return f(*args, **kwargs)
    File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
    File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
    File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Environment (please complete the following information):

  • OS: [e.g. mac / window]
  • Python Version [e.g. 3.6]
    requirement.txt中的所有包的版本都是对齐的。

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.
脚本fine_continue.sh如下,数据是根据readme里面转化过的格式:
output_dir='lora/oneke-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py
--do_train --do_eval
--overwrite_output_dir
--model_name_or_path 'models/OneKE'
--stage 'sft'
--model_name 'llama'
--template 'llama2_zh'
--train_file 'data/NER/train.json'
--valid_file 'data/NER/dev.json'
--output_dir=${output_dir}
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--preprocessing_num_workers 16
--num_train_epochs 10
--learning_rate 5e-5
--max_grad_norm 0.5
--optim "adamw_torch"
--max_source_length 400
--cutoff_len 700
--max_target_length 300
--evaluation_strategy "epoch"
--save_strategy "epoch"
--save_total_limit 10
--lora_r 64
--lora_alpha 64
--lora_dropout 0.05
--bf16 False
--bits 4

@jack9193 jack9193 added the bug Something isn't working label Aug 5, 2024
@guihonghao
Copy link
Contributor

单卡这样运行CUDA_VISIBLE_DEVICES="0" python src/finetune.py

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

单卡这样运行CUDA_VISIBLE_DEVICES="0" python src/finetune.py
您好,现在的报错变成了这样:
AttributeError: /disk1/miniconda3/envs/deepke-llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_nf4

@guihonghao
Copy link
Contributor

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答,我还想请教一下如下问题:

  1. 报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
    然后我就把sh文件中的--bits 4删除了,但是目前没有再报这个错,请问这个操作会导致什么问题吗?
  2. dev.json通过
python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train

转格式的时候split选项是选择train还是test呢?我之前是选择test,发现报错了key_error,response,然后我换成train可以运行了。
3. 目前正在运行中,但是貌似卡在了trainer.train()函数的某个地方,我发现它一直没有占用gpu,这是正常的吗?

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答,我还想请教一下如下问题:

  1. 报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
    然后我就把sh文件中的--bits 4删除了,但是目前没有再报这个错,请问这个操作会导致什么问题吗?
  2. dev.json通过
python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train

转格式的时候split选项是选择train还是test呢?我之前是选择test,发现报错了key_error,response,然后我换成train可以运行了。 3. 目前正在运行中,但是貌似卡在了trainer.train()函数的某个地方,我发现它一直没有占用gpu,这是正常的吗?

问题1中的accelerate和bitsandbytes我都按照requirements.txt的版本安装了。

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

参考 artidoro/qlora#31 bitsandbytes-foundation/bitsandbytes#156 bitsandbytes-foundation/bitsandbytes#134

感谢您的解答,我还想请教一下如下问题:

  1. 报错Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`
    然后我就把sh文件中的--bits 4删除了,但是目前没有再报这个错,请问这个操作会导致什么问题吗?
  2. dev.json通过
python ie2instruction/convert_func.py \
    --src_path data/NER/dev.json \
    --tgt_path data/NER/new/dev.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num -1 \
    --random_sort \
    --split train

转格式的时候split选项是选择train还是test呢?我之前是选择test,发现报错了key_error,response,然后我换成train可以运行了。 3. 目前正在运行中,但是貌似卡在了trainer.train()函数的某个地方,我发现它一直没有占用gpu,这是正常的吗?

问题1中的accelerate和bitsandbytes我都按照requirements.txt的版本安装了。

问题1和3解决了,之前报错是因为torch跟cuda不适配,现在量化bits 4参数加回来了,可以正常跑了,开始占用gpu了。希望您看到了可以回答我问题2的疑惑,十分感谢!

@guihonghao
Copy link
Contributor

dev的格式与train保持一致,选--split train

@guihonghao
Copy link
Contributor

dev是验证集在训练的每个epoch末尾评估模型用的,不是测试集。

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

dev的格式与train保持一致,选--split train

感谢!

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

最后问一个问题,平均长度1000,训练集300条,lora微调OneKE的话大概跑多少个epoch合适呢?一个大概的范围就好了。

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

最后问一个问题,平均长度1000,训练集300条,lora微调OneKE的话大概跑多少个epoch合适呢?一个大概的范围就好了。

中文

@guihonghao
Copy link
Contributor

10+

@jack9193
Copy link
Author

jack9193 commented Aug 5, 2024

10+
好的,十分感谢!

@jack9193 jack9193 closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants