-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Gloo Connection reset by peer #6308
Comments
maybe worthwhile to try the new pipeline parallel? check out https://docs.vllm.ai/en/latest/serving/distributed_serving.html for more details. basically |
I tried this (--tensor-parallel-size 8 --pipeline-parallel-size 2) as well, after a couple of successful requests I get this error:
|
The issue in both cases seems similar - running for a few (< 100) iterations and then running into this type of error suggests to me that there's something wrong with the instances/network setup itself. |
Ok, thank you. I'll check the HW and network than. Are there are any pointers (best practise) for checking this? I'm aware of nccl-tests and those seem to run fine for this setup. |
Indeed. I changed instances and network and now it seems stable (no crash in 1h of nonstop testing). In particular I use different networks for ray/gloo and nccl (IB network) now (but not sure if that's the reason). Anyways closing this, thanks @andoorve! |
I also encountered the same problem, how to solve it? @youkaichao |
Did you solve this problem? How did you solve it? @thies1006 |
usually it is caused by network setup problem. try to set |
@youkaichao I set NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, but the issue is not resolved, do you have any suggestions? |
does the sanity check script run normally? |
@youkaichao No, I executed the corresponding check script, the error message is as below:
Context When the service is just deployed, the requests can be responded normally. When the service is idle for more than 1 hour, the service will be abnormal and will not respond if requested again, and the following error will be reported
my model is Mixtral-8x7B-moe, and I used 2 * A800-40G to deploy the model, with vllm==0.5.0 I also tried setting NCCL_SOCKET_IFNAME=eth0 and GLOO_SOCKET_IFNAME=eth0, It does not work. |
this is another problem like #5084 . And I think #5399 should help. You can try the latest version. cc @njhill it is in the broadcast operation even if no requests are running. looks strange. |
I used different networks for gloo and nccl. Could you try this if possible? Not sure if the problem really got solved by this though (I'm out of office right now), but at least it got much better. I also saw significant improvement in the metrics by this. |
@FangxuLiu this is a known issue that was fixed by #5987 which is in 0.5.1. Please upgrade. |
Your current environment
🐛 Describe the bug
I'm running Llama3-70B on two nodes with 8 GPUs each using TP=16. I tried adding the options eager-mode and disable-custom-all-reduce without any success. First ~100 queries are always running fine, but after a while I get this Runtime Error:
The text was updated successfully, but these errors were encountered: