-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[WIP] Add fluid dist train gpu benchmark data #8550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add fluid dist train gpu benchmark data #8550
Conversation
|
I notice there are some tensorflow numbers to be filled. |
helinwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! LGTM!
| *The performance gap between Fuild and v2 comes from the network interference.* | ||
|
|
||
|
|
||
| ## Test Result (GPU with flowers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change "flowers" to "flowers dataset", otherwise it's hard to understand what does flowers mean. Btw does which dataset we use matters in the performance benchmark?
|
|
||
| ### Hardware Infomation | ||
|
|
||
| - GPU: NVIDIA Tesla P40, Driver Version: 375.26 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: we may need to update the Nvidia driver in the cluster soon, the NCCL 2 we currently uses does not support 380.x (tested work on the latest 390.x).
| | PaddlePaddle v2 (1 port) | 22.06 | 6.0 | - | - | | ||
|
|
||
| NOTE: GPU cluster training have bottle neck when doing communications. | ||
| NOTE: increasing port number for PaddlePaddle v2 will not increase the total throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little hard to reason why increase the number of port could increase the throughput? I thought the throughput is limited by the hardware layer, not the TCP protocol layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use multi-threads on multiple CPU? @typhoonzero
| - Batch Size: 20 | ||
| - Metrics: samples / sec | ||
|
|
||
| | Trainer Count | local | 4 | 10 | 20 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does trainer run locally with Pserver count:4?
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just simple questions.
|
Will update latest data in a new PR. |
Related: #6908
As shown in the
README.md, fluid distributed training using GPU + multi-node encounters network bottleneck while fluid performs much better than v2. I'll look into the details to find out the reason, and then try to increase the throughput of send/recv op.For fluid,
send_opwill take 90% time of each mini-batch.I'll add more test result of increasing the nodes, to see how much multi-GPU multi-node can do.
PS: increase number of ports to 8 does not improve Paddle v2 dist train performance at all, but slows it down.