update the performance page of MXNet. #10761

zheng-da · 2018-05-01T03:01:13Z

Description

This updates the performance page of MXNet on CPU and GPU.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

zheng-da · 2018-05-01T04:36:22Z

@mli do you want to review and merge it?

pengzhao-intel · 2018-05-01T06:22:21Z

@zheng-da did you set KM Affinity when test CPU performance?

zheng-da · 2018-05-01T06:24:26Z

@pengzhao-intel yes, i did. Do you find anything unexpected?

marcoabreu · 2018-05-01T06:31:53Z

docs/faq/perf.md

-  |  32 | 4883.77 | 854.4 | 1197.74  | 493.72  | 713.17 | 294.17 |
+| Batch | Alexnet | VGG    | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
+|-------|---------|--------|--------------|--------------|-----------|------------|
+| 1     | 243.93  | 43.59  | 68.62        | 35.52        | 67.41     | 23.65      |


Seems like we're having a few regressions :(

First, @mli told me that M60 is supposed to be slower than M40.
VGG is slower because a different model was used. The original performance was measured a long time ago. Since then, the implementation of VGG has changed. The current version of VGG has many more layers. If you want to know more details, I think @TaoLv can tell you more.

Right. benchmark_score.py was changed last December (commit) and VGG test was updated from VGG-11 to VGG-16. Perf numbers in this PR are measured on VGG-16 and previous perf numbers of MKLML were measured on VGG-11.

Actually, this change in VGG applies to all benchmark results. Not just MKLML vs. MKLDNN.

marcoabreu

Thanks for updating the chart! It seems like we're having some regressions on the GPU version of MXNet. It would be great if somebody could follow up on those.

pengzhao-intel · 2018-05-01T06:42:01Z

That's good.
Maybe my data is a little out-of-date, we tested about 1 month ago with the master branch.
The below data is from AWS EC2 C5.18xlarge.

For the large BS=32, the performance of Alexnet/VGG/inception-xx have a sligt drop.
Other data are improved a lot, especailly for small batchsize.

Batch	Alexnet		VGG		Inception-BN		Inception-v3		Resnet 50		Resnet 152
1	390.53	253.92	81.57	74.66	124.13	99.56	62.26	52.91	76.22	69.99	32.92	27.96
2	596.45	441.89	100.84	101.57	206.58	165.66	93.36	87.23	119.55	105.04	46.8	40.27
4	710.77	584.52	119.04	109.45	275.55	266.70	127.86	132.27	148.62	149.86	59.36	56.50
8	921.4	810.96	120.38	115.13	380.82	372.59	157.11	167.24	167.95	181.04	70.78	75.85
16	1018.43	1146.89	115.3	124.87	411.67	466.26	168.71	181.07	178.54	188.69	75.13	82.46
32	1290.31	1458.73	107.19	126.04	483.34	518.18	179.38	182.45	193.47	186.43	85.86	81.10

zheng-da · 2018-05-01T07:11:04Z

@pengzhao-intel I'm not sure why the performance for large batch sizes gets worse. It seems to me that your performance was measured after the PRs that improved the performance of MKLDNN a while ago (otherwise, Alexnet should have much worse performance). However, the performance I just measured on C5.18x matches the performance I saw when I wrote the blog.
Can you find out which commit you used to measure the performance? Or which day?

pengzhao-intel · 2018-05-01T07:43:02Z

SW : the master branch of mxnet, commit id: 48749a5

zheng-da · 2018-05-01T07:48:40Z

The commit was two months ago. It's surprising that the performance was better for large batch sizes. I'll try it again tomorrow.

zheng-da · 2018-05-01T18:07:49Z

@pengzhao-intel Here is the performance result on C5.18x for commit id: 48749a5. It's different from yours. Before running the benchmark, I set the thread affinity and the number of OMP threads as below. Do you see anything wrong?

ubuntu@ip-172-31-14-124:~/incubator-mxnet$  export KMP_AFFINITY=granularity=fine,compact,1,0
ubuntu@ip-172-31-14-124:~/incubator-mxnet$ cat /proc/cpuinfo  | grep processor | wc -l
72
ubuntu@ip-172-31-14-124:~/incubator-mxnet$ export OMP_NUM_THREADS=36

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 281.905581
INFO:root:batch size  2, image/sec: 476.257437
INFO:root:batch size  4, image/sec: 638.438876
INFO:root:batch size  8, image/sec: 909.938360
INFO:root:batch size 16, image/sec: 1072.400037
INFO:root:batch size 32, image/sec: 1439.211819
INFO:root:network: vgg-16
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 48.835769
INFO:root:batch size  2, image/sec: 107.220745
INFO:root:batch size  4, image/sec: 113.991915
INFO:root:batch size  8, image/sec: 122.755506
INFO:root:batch size 16, image/sec: 113.957188
INFO:root:batch size 32, image/sec: 108.641076
INFO:root:network: inception-bn
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 103.943418
INFO:root:batch size  2, image/sec: 175.506375
INFO:root:batch size  4, image/sec: 235.703251
INFO:root:batch size  8, image/sec: 372.001180
INFO:root:batch size 16, image/sec: 432.571657
INFO:root:batch size 32, image/sec: 510.245680
INFO:root:network: inception-v3
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 55.605913
INFO:root:batch size  2, image/sec: 89.754883
INFO:root:batch size  4, image/sec: 130.370849
INFO:root:batch size  8, image/sec: 160.991983
INFO:root:batch size 16, image/sec: 167.312594
INFO:root:batch size 32, image/sec: 183.668760
INFO:root:network: resnet-50
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 69.624827
INFO:root:batch size  2, image/sec: 104.606805
INFO:root:batch size  4, image/sec: 154.999145
INFO:root:batch size  8, image/sec: 171.728000
INFO:root:batch size 16, image/sec: 167.945690
INFO:root:batch size 32, image/sec: 169.422334
INFO:root:network: resnet-152
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 30.524941
INFO:root:batch size  2, image/sec: 41.807083
INFO:root:batch size  4, image/sec: 59.314175
INFO:root:batch size  8, image/sec: 74.127132
INFO:root:batch size 16, image/sec: 73.972906
INFO:root:batch size 32, image/sec: 71.838468

pengzhao-intel · 2018-05-02T04:03:47Z

The setting is same with ours.

From your log:
Alexnet BS=32 is 1439 matched with ours 1458 (your current is 1290).
VGG-16 BS=32 is 108 didn't match with ours 126 (your current is 108).
Inception-BN BS=32 is 510 matched with ours 518 (your current is 483).
Inception v3 BS=32 is 183 matched with our 182 (your current is 179).

So, the Alexnet and Inception data are similar but VGG is different, right?

@huangzhiyuan can provide more details.

This reverts commit ebd8a6b.

update perf.

6fcc2b0

zheng-da requested a review from szha as a code owner May 1, 2018 03:01

marcoabreu reviewed May 1, 2018

View reviewed changes

marcoabreu suggested changes May 1, 2018

View reviewed changes

piiswrong merged commit ebd8a6b into apache:master May 2, 2018

marcoabreu added a commit that referenced this pull request May 3, 2018

Revert "update perf. (#10761)"

0df5967

This reverts commit ebd8a6b.

marcoabreu mentioned this pull request May 3, 2018

Revert "update the performance page of MXNet." #10798

Closed

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 7, 2018

update perf. (apache#10761)

f2c230f

jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request May 29, 2018

update perf. (apache#10761)

f3ae6f7

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

update perf. (apache#10761)

51afa8a

zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

update perf. (apache#10761)

06aaaf8

ThomasDelteil mentioned this pull request Jul 30, 2018

Revise the profiler examples and documents #9959

Closed

zheng-da deleted the update_perf branch September 29, 2018 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update the performance page of MXNet. #10761

update the performance page of MXNet. #10761

zheng-da commented May 1, 2018

zheng-da commented May 1, 2018

pengzhao-intel commented May 1, 2018

zheng-da commented May 1, 2018

marcoabreu May 1, 2018

zheng-da May 1, 2018

TaoLv May 1, 2018

zheng-da May 1, 2018

marcoabreu left a comment

pengzhao-intel commented May 1, 2018

zheng-da commented May 1, 2018

pengzhao-intel commented May 1, 2018 •

edited

Loading

zheng-da commented May 1, 2018

zheng-da commented May 1, 2018

pengzhao-intel commented May 2, 2018

update the performance page of MXNet. #10761

update the performance page of MXNet. #10761

Conversation

zheng-da commented May 1, 2018

Description

Checklist

Essentials

Changes

Comments

zheng-da commented May 1, 2018

pengzhao-intel commented May 1, 2018

zheng-da commented May 1, 2018

marcoabreu May 1, 2018

Choose a reason for hiding this comment

zheng-da May 1, 2018

Choose a reason for hiding this comment

TaoLv May 1, 2018

Choose a reason for hiding this comment

zheng-da May 1, 2018

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

pengzhao-intel commented May 1, 2018

zheng-da commented May 1, 2018

pengzhao-intel commented May 1, 2018 • edited Loading

zheng-da commented May 1, 2018

zheng-da commented May 1, 2018

pengzhao-intel commented May 2, 2018

pengzhao-intel commented May 1, 2018 •

edited

Loading