[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

marcoabreu · 2018-03-14T15:41:22Z

Description

This PR upgrades our CI from CUDA 8.0 to CUDA 9.1 and CuDNN from 5 to 7.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

CUDA 8.0 -> 9.1
CuDNN 5 -> 7

Comments

This change requires an update on our slaves because CUDA 9.1 is not supported by nvidia-docker. It requires nvidia-docker2 which will be provided with the new slaves.

I had to do some lint fixes. Although we stick the version, it still seems to be flaky about detecting some things - I don't really understand what's going on there. I've tried to use older versions as well, but it always found some problems.

szha · 2018-03-14T17:04:28Z

How about keeping both?

marcoabreu · 2018-03-14T17:17:44Z

This would increase maintenance efforts and result in a lot more jobs and thus more costs. We should try to prevent mixing integration tests (in that extent) with PR validation.

szha · 2018-03-14T18:31:32Z

We still need a way to track changes that break cuda/cudnn compatibility. For example, when depth-wise convolution was added, a low-level shuffle intrinsic was used in the implementation, which is not supported in sm_20 (as in the symbol doesn't exist for sm_20). This broke the build because sm_20 was still enabled at the time.

A good place to put verification is nightly test. Given that eventually the nightly test would be moved to the public CI, the maintenance efforts need to be put in sooner or later. It would be good if we take that into consideration.

piiswrong · 2018-03-16T07:30:12Z

sm_20 has nothing to do with cuda or cudnn versions.
We should add release tests that verify against various cuda and cudnn versions. But that's another discussion.

marcoabreu · 2018-03-16T08:59:31Z

It does in fact because 2.x architecture is deprecated with cuda 9. But I agree, integration tests before a release would be the way to go here.

marcoabreu · 2018-03-20T22:07:13Z

Hello, I have just finished working on this PR. It would be great if somebody could review and approve it. This would allow me to self-merge the PR while everybody in the US is asleep - that way, we won't have any interruption of development due to downtimes.

szha · 2018-03-20T23:13:49Z

python/mxnet/kvstore.py

@@ -424,13 +424,15 @@ def set_gradient_compression(self, compression_params):
            Other keys in this dictionary are optional and specific to the type
            of gradient compression.
        """
+        # pylint: disable=unsupported-membership-test
        if ('device' in self.type) or ('dist' in self.type):


Please use inline disable on the if block. Same for other occurrences.

szha · 2018-03-20T23:24:08Z

tests/ci_build/deploy/ci_deploy_doc.sh

@@ -26,10 +26,7 @@
 #
 # BUILD_ID: the current build ID for the specified PR
 #
+set -ex


Do you need set +ex at the end to disable?

No, this is restricted to the current scope (aka this script).

Cool. As long as the script is not executed through source it's fine.

cjolivier01 · 2018-03-21T00:09:23Z

python/mxnet/model.py

@@ -848,7 +848,7 @@ def fit(self, X, y=None, eval_data=None, eval_metric='acc',
        # init optmizer
        if isinstance(self.optimizer, str):
            batch_size = data.batch_size
-            if kvstore and 'dist' in kvstore.type and not '_async' in kvstore.type:
+            if kvstore and 'dist' in kvstore.type and '_async' not in kvstore.type:
                batch_size *= kvstore.num_workers


is this unit tested to validate it still works after these changes? if so, where?

need unit test for both cases (in and not in)

I didn't change the logic, this is logically equivalent and was forced by pylint. Feel free to ask @mli, it's his code. I think that request is out of scope for this PR.
I'm happy to explain why NOT(TRUE) is equivalent to IS(FALSE).

let's not use veto for this purpose if possible

why did lint never catch this before?
how can we be sure this has the same effect?

@cjolivier01 I'm referring to your comment in #9931 (comment)

Then came the following:

Clearly there are conflicting messages, so I think it would be better to clarify them.

@szha see email there where you personally pointed out your view that committer questions need not always be answered, thus leaving no choice but to block with a request changes until such time that requester makes the changes or makes sufficient arguments against them (or in this case shows they are not needed).

5th paragraph of second email from top:

https://lists.apache.org/thread.html/5f7843a77f58239dbf317f8fc811be1560d0f6602514adc5487811c6@%3Cdev.mxnet.apache.org%3E

@cjolivier01 I'm merely trying to understand what your take is on that, given that you expressed that you didn't intend to veto while still made "request changes" review.

So what's the verdict?

cjolivier01

.

KellenSunderland

Looks good. Probably could have split this up into a pylint change PR / and a CUDA change PR.

* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments

marcoabreu changed the title ~~Upgrade to cuda 9.1 cudnn 7~~ [MXNET-99] Upgrade to cuda 9.1 cudnn 7 Mar 14, 2018

Upgrade to cuda 9.1 cudnn 7

d3cac2f

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 4b5ae82 to d3cac2f Compare March 14, 2018 15:58

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 5 times, most recently from 3843bf5 to fdf6739 Compare March 20, 2018 17:27

marcoabreu requested a review from szha as a code owner March 20, 2018 17:27

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 2 times, most recently from 8d9ae97 to b644156 Compare March 20, 2018 17:32

Pin pylint version

1bb80e6

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 2 times, most recently from c3878cd to f21c468 Compare March 20, 2018 22:01

Remove aws cli installation

f81a8fa

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from f21c468 to f81a8fa Compare March 20, 2018 23:12

szha reviewed Mar 20, 2018

View reviewed changes

szha approved these changes Mar 20, 2018

View reviewed changes

cjolivier01 suggested changes Mar 21, 2018

View reviewed changes

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 1534b09 to 1d5edae Compare March 21, 2018 00:27

Address review comments

b76809f

marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 1d5edae to b76809f Compare March 21, 2018 00:30

cjolivier01 reviewed Mar 21, 2018

View reviewed changes

cjolivier01 approved these changes Mar 21, 2018

View reviewed changes

reminisce approved these changes Mar 21, 2018

View reviewed changes

KellenSunderland reviewed Mar 21, 2018

View reviewed changes

KellenSunderland approved these changes Mar 21, 2018

View reviewed changes

marcoabreu merged commit b0a6760 into apache:master Mar 21, 2018

marcoabreu deleted the Upgrade-cuda9-cudnn7 branch March 21, 2018 15:13

ashokei pushed a commit to ashokei/incubator-mxnet that referenced this pull request Mar 27, 2018

[MXNET-99] Upgrade CI to cuda 9.1 cudnn 7 (apache#10108)

d147ad1

* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments

jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018

[MXNET-99] Upgrade CI to cuda 9.1 cudnn 7 (apache#10108)

7516079

* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

[MXNET-99] Upgrade CI to cuda 9.1 cudnn 7 (apache#10108)

a679972

* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

[MXNET-99] Upgrade CI to cuda 9.1 cudnn 7 (apache#10108)

b6b60b3

* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

marcoabreu commented Mar 14, 2018 •

edited

Loading

szha commented Mar 14, 2018

marcoabreu commented Mar 14, 2018

szha commented Mar 14, 2018

piiswrong commented Mar 16, 2018

marcoabreu commented Mar 16, 2018 via email

marcoabreu commented Mar 20, 2018

szha Mar 20, 2018

szha Mar 20, 2018

marcoabreu Mar 20, 2018

szha Mar 20, 2018

cjolivier01 Mar 21, 2018

cjolivier01 Mar 21, 2018

marcoabreu Mar 21, 2018 •

edited

Loading

szha Mar 21, 2018

cjolivier01 Mar 21, 2018

szha Mar 21, 2018

szha Mar 21, 2018

cjolivier01 Mar 21, 2018

szha Mar 21, 2018

szha Mar 21, 2018

cjolivier01 left a comment

KellenSunderland left a comment •

edited

Loading

[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

Conversation

marcoabreu commented Mar 14, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

szha commented Mar 14, 2018

marcoabreu commented Mar 14, 2018

szha commented Mar 14, 2018

piiswrong commented Mar 16, 2018

marcoabreu commented Mar 16, 2018 via email

marcoabreu commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 left a comment

Choose a reason for hiding this comment

KellenSunderland left a comment • edited Loading

Choose a reason for hiding this comment

marcoabreu commented Mar 14, 2018 •

edited

Loading

marcoabreu Mar 21, 2018 •

edited

Loading

KellenSunderland left a comment •

edited

Loading