[Bug]: 修改精调src_length报错 #9232

littlesmallrookie · 2024-10-09T06:52:19Z

软件环境

- paddlepaddle:   
- paddlepaddle-gpu:  3.0.0b1
- paddlenlp: 3.0.0b1.post20241009

重复问题

I have searched the existing issues

错误描述

对 Qwen/Qwen2-0.5B 进行精调时，修改 lora_argument.json src_length=10240  进行训练报错
报错内容：
[2024-10-09 11:59:07,127] [   DEBUG] -   Number of trainable parameters = 3,784,704 (per device)
W1009 11:59:08.997602 31629 multiply_fwd_func.cc:75] got different data type, run type promotion automatically, this may cause data type been changed.
Traceback (most recent call last):
  File "/home/aistudio/work/PaddleNLP/llm/run_finetune.py", line 689, in <module>
    main()
  File "/home/aistudio/work/PaddleNLP/llm/run_finetune.py", line 564, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/trainer/trainer.py", line 799, in train
    return self._inner_training_loop(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/trainer/trainer.py", line 993, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/trainer/trainer.py", line 2122, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/trainer/trainer.py", line 2067, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/transformers/qwen2/modeling.py", line 1365, in forward
    loss = self.criterion(logits, labels)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddlenlp-3.0.0b1.post20241009-py3.10.egg/paddlenlp/transformers/qwen2/modeling.py", line 1142, in forward
    loss = paddle.mean(masked_lm_loss)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/tensor/stat.py", line 90, in mean
    return _C_ops.mean(x, axis, keepdim)
ValueError: (InvalidArgument) Tensor need be reduced must not empty.
  [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.] (at ../paddle/phi/kernels/funcs/reduce_function.h:1055)

稳定复现步骤 & 代码

ZHUI · 2024-10-09T07:04:26Z

好像是没有需要算loss的token，导致了报错。masked_lm_loss这个gather出来是空的。

littlesmallrookie · 2024-10-09T08:05:24Z

好像是没有需要算loss的token，导致了报错。masked_lm_loss这个gather出来是空的。
请问如何修复？

DrownFish19 · 2024-10-15T09:36:18Z

好像是没有需要算loss的token，导致了报错。masked_lm_loss这个gather出来是空的。

可以检查一下src_length和max_length，max_length应该大于src_length。max_length = src_length + output_length。
所以降低src_length或者提高max_length应该可以解决这个问题。

littlesmallrookie added the bug Something isn't working label Oct 9, 2024

paddle-bot bot assigned DesmonDay Oct 9, 2024

DrownFish19 mentioned this issue Oct 15, 2024

[Question]: llm 精调 src_length如何修改 #9233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 修改精调src_length报错 #9232

[Bug]: 修改精调src_length报错 #9232

littlesmallrookie commented Oct 9, 2024

ZHUI commented Oct 9, 2024

littlesmallrookie commented Oct 9, 2024

DrownFish19 commented Oct 15, 2024

[Bug]: 修改精调src_length报错 #9232

[Bug]: 修改精调src_length报错 #9232

Comments

littlesmallrookie commented Oct 9, 2024

软件环境

重复问题

错误描述

稳定复现步骤 & 代码

ZHUI commented Oct 9, 2024

littlesmallrookie commented Oct 9, 2024

DrownFish19 commented Oct 15, 2024