Skip to content

Dev paddle plsc arcface #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

Flowingsun007
Copy link
Contributor

No description provided.

@Flowingsun007 Flowingsun007 requested a review from guo-ran March 12, 2021 09:22
@Flowingsun007 Flowingsun007 marked this pull request as ready for review March 15, 2021 01:52
@Flowingsun007 Flowingsun007 requested a review from nlqq March 20, 2021 00:25
@yuanms2
Copy link

yuanms2 commented Mar 29, 2021

测试的结论是? 把oneflow 和 plsc 的结果放在一起看看?

@Flowingsun007
Copy link
Contributor Author

测试的结论是? 把oneflow 和 plsc 的结果放在一起看看?

好的,我在readme中增加一栏对比说明

@yuanms2
Copy link

yuanms2 commented Mar 29, 2021

测试的结论是? 把oneflow 和 plsc 的结果放在一起看看?

好的,我在readme中增加一栏对比说明

也可以放在PR的讨论里面。仓库的其它对比实验都没有把其它框架和oneflow的放在一起对比。

@Flowingsun007
Copy link
Contributor Author

Flowingsun007 commented Mar 29, 2021

Arcface-rn50测试结果对比及说明

我们基于同样的硬件环境、同样的网络及数据集,在单机单卡~4机32卡的集群中,对不同框架(paddle-plsconeflow)进行了测评,对比框架在大规模人脸模型(模型并行)训练时的吞吐率、加速比等主要性能指标。

测试环境

为保证能更好的测试框架本身的性能好坏,做到公平公正,本次测评所有的测试均在相同的物理集群中测试,使用相同的软件环境等。测试环境共有4台机器,每台机器配置了8张V100 GPU显卡。(每台机器配置与NVIDA DGX-1接近)每台机器具体的硬件和软件配置描述如下:

  • Tesla V100-SXM2-16GB x 8
  • InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family
  • Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
  • Memory 384G
  • Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
  • CUDA Version: 10.2, Driver Version: 440.33.01
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  CPU Affinity
GPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     NODE    0-11,24-35
GPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     NODE    0-11,24-35
GPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     PIX     0-11,24-35
GPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     PIX     0-11,24-35
GPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     SYS     12-23,36-47
GPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     SYS     12-23,36-47
GPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     SYS     12-23,36-47
GPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      SYS     12-23,36-47
mlx5_0  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

测试网络及数据集

测试网络统一使用以ResNet50为backbone的大规模人脸分类网络,loss为arcface,数据集使用MS1M-ArcFace,网络的FC全连接层采用模型并行的方式切分参数(全连接层参数切分至不同GPU上存放)。除此之外,使用FP32常规精度以及统一的batch size(128),更详细的信息见:paddle-plsc README;以及:oneflow README

测试结果对比

paddle-plsc

node_num gpu_num_per_node batch_size_per_device samples/s speedup
1 1 128 397.78 1.00
1 4 128 1539.66 3.87
1 8 128 2545.3 6.4
2 8 128 5953.84 14.97
4 8 128 11084.53 27.87

oneflow

node_num gpu_num_per_node batch_size_per_device samples/s speedup
1 1 128 424.75 1.00
1 4 128 1652.16 3.89
1 8 128 3278.55 7.72
2 8 128 6343.74 14.94
4 8 128 12320.24 29.01

对比总结

  • 在单机单卡下,oneflow的吞吐率为424.75(samples/s),相比paddle-plsc的397.78,速度快了约6.8%;
  • 在单机8卡下,吞吐率oneflow 3278.55 vs paddle-plsc 2545.3,oneflow快了约28.9%;
  • 4机的加速比,oneflow是29.01 vs paddle-plsc 27.87,oneflow更为接近线性加速比(32),多机下训练速度更快。

结论:总体来看,oneflow在单机、多机下的大规模人脸分类模型训练速度更快、加速比更高,框架性能更为优异。
(除此之外,对GPU显存的利用率更高,相同条件下GPU占用更低,更省显存)

@yuanms2
Copy link

yuanms2 commented Mar 29, 2021

OK. 辛苦了。 韩广云测试的环境是10Gbps的带宽,和我们配置不一样。

@Flowingsun007
Copy link
Contributor Author

OK. 辛苦了。 韩广云测试的环境是10Gbps的带宽,和我们配置不一样。

恩,上面贴的数据都是在leinao相同环境上测的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants