Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: PaddleNLP大模型训练任务在老CPU机器上跑不起来 #9194

Open
1 task done
hjx620 opened this issue Sep 25, 2024 · 0 comments
Open
1 task done

[Bug]: PaddleNLP大模型训练任务在老CPU机器上跑不起来 #9194

hjx620 opened this issue Sep 25, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@hjx620
Copy link

hjx620 commented Sep 25, 2024

软件环境

- paddlepaddle-gpu: 0.0.0.post120
- paddlenlp: 2.8.0

重复问题

  • I have searched the existing issues

错误描述

我在用paddlenlp跑大模型lora微调训练,发现该任务在一些机器上能跑起来,在另外一些机器上跑不起来。
跑不起来的机器报错lllegal instruction (core dumped),系统显示 libphi.so 有报错。
对比两种机器后,发现只有使用的cpu不同。怀疑paddle某些算子不支持老CPU。

似乎出问题的机器上CPU都是:Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
可以运行的机器上CPU都是: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz及以上版本

稳定复现步骤 & 代码

  • 进入PaddleNLP-develop/PaddleNLP-develop/llm目录

  • 运行命令 python3 -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./chatglm2/lora_argument.json

  • 在sugon-gpu-4上,任务报错 lllegal instruction (core dumped)
    2e6263f1df3726887fc8b64a83c3a53

  • 在sugon-gpu-6上,任务正常运行
    e042300333f4be8f73744b31ae7715a

其中,sugon-gpu-4用的cpu是Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz,sugon-gpu-6用的cpu是Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz及以上版本。两台机都有4张v100,cuda版本为12.5 。

@hjx620 hjx620 added the bug Something isn't working label Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants