Skip to content

Conversation

@typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Feb 24, 2018

Related: #6908

As shown in the README.md, fluid distributed training using GPU + multi-node encounters network bottleneck while fluid performs much better than v2. I'll look into the details to find out the reason, and then try to increase the throughput of send/recv op.

For fluid, send_op will take 90% time of each mini-batch.

I'll add more test result of increasing the nodes, to see how much multi-GPU multi-node can do.

PS: increase number of ports to 8 does not improve Paddle v2 dist train performance at all, but slows it down.

@panyx0718 panyx0718 self-requested a review February 24, 2018 13:24
@panyx0718
Copy link
Contributor

I notice there are some tensorflow numbers to be filled.
This is the benchmark that TF team recommends to use, maybe useful in the future.
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM!

*The performance gap between Fuild and v2 comes from the network interference.*


## Test Result (GPU with flowers)
Copy link
Contributor

@helinwang helinwang Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "flowers" to "flowers dataset", otherwise it's hard to understand what does flowers mean. Btw does which dataset we use matters in the performance benchmark?


### Hardware Infomation

- GPU: NVIDIA Tesla P40, Driver Version: 375.26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: we may need to update the Nvidia driver in the cluster soon, the NCCL 2 we currently uses does not support 380.x (tested work on the latest 390.x).

| PaddlePaddle v2 (1 port) | 22.06 | 6.0 | - | - |

NOTE: GPU cluster training have bottle neck when doing communications.
NOTE: increasing port number for PaddlePaddle v2 will not increase the total throughput.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little hard to reason why increase the number of port could increase the throughput? I thought the throughput is limited by the hardware layer, not the TCP protocol layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use multi-threads on multiple CPU? @typhoonzero

- Batch Size: 20
- Metrics: samples / sec

| Trainer Count | local | 4 | 10 | 20 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does trainer run locally with Pserver count:4?

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just simple questions.

@typhoonzero
Copy link
Contributor Author

Will update latest data in a new PR.

@typhoonzero typhoonzero closed this Apr 2, 2018
@typhoonzero typhoonzero deleted the update_v2_vgg_benchmard branch April 2, 2018 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants