Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108

Merged
merged 4 commits into from
Mar 21, 2018

Conversation

marcoabreu
Copy link
Contributor

@marcoabreu marcoabreu commented Mar 14, 2018

Description

This PR upgrades our CI from CUDA 8.0 to CUDA 9.1 and CuDNN from 5 to 7.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • CUDA 8.0 -> 9.1
  • CuDNN 5 -> 7

Comments

This change requires an update on our slaves because CUDA 9.1 is not supported by nvidia-docker. It requires nvidia-docker2 which will be provided with the new slaves.

I had to do some lint fixes. Although we stick the version, it still seems to be flaky about detecting some things - I don't really understand what's going on there. I've tried to use older versions as well, but it always found some problems.

@marcoabreu marcoabreu changed the title Upgrade to cuda 9.1 cudnn 7 [MXNET-99] Upgrade to cuda 9.1 cudnn 7 Mar 14, 2018
@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 4b5ae82 to d3cac2f Compare March 14, 2018 15:58
@szha
Copy link
Member

szha commented Mar 14, 2018

How about keeping both?

@marcoabreu
Copy link
Contributor Author

This would increase maintenance efforts and result in a lot more jobs and thus more costs. We should try to prevent mixing integration tests (in that extent) with PR validation.

@szha
Copy link
Member

szha commented Mar 14, 2018

We still need a way to track changes that break cuda/cudnn compatibility. For example, when depth-wise convolution was added, a low-level shuffle intrinsic was used in the implementation, which is not supported in sm_20 (as in the symbol doesn't exist for sm_20). This broke the build because sm_20 was still enabled at the time.

A good place to put verification is nightly test. Given that eventually the nightly test would be moved to the public CI, the maintenance efforts need to be put in sooner or later. It would be good if we take that into consideration.

@piiswrong
Copy link
Contributor

sm_20 has nothing to do with cuda or cudnn versions.
We should add release tests that verify against various cuda and cudnn versions. But that's another discussion.

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Mar 16, 2018 via email

@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 5 times, most recently from 3843bf5 to fdf6739 Compare March 20, 2018 17:27
@marcoabreu marcoabreu requested a review from szha as a code owner March 20, 2018 17:27
@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 2 times, most recently from 8d9ae97 to b644156 Compare March 20, 2018 17:32
@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch 2 times, most recently from c3878cd to f21c468 Compare March 20, 2018 22:01
@marcoabreu
Copy link
Contributor Author

Hello, I have just finished working on this PR. It would be great if somebody could review and approve it. This would allow me to self-merge the PR while everybody in the US is asleep - that way, we won't have any interruption of development due to downtimes.

@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from f21c468 to f81a8fa Compare March 20, 2018 23:12
@@ -424,13 +424,15 @@ def set_gradient_compression(self, compression_params):
Other keys in this dictionary are optional and specific to the type
of gradient compression.
"""
# pylint: disable=unsupported-membership-test
if ('device' in self.type) or ('dist' in self.type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use inline disable on the if block. Same for other occurrences.

@@ -26,10 +26,7 @@
#
# BUILD_ID: the current build ID for the specified PR
#
set -ex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need set +ex at the end to disable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is restricted to the current scope (aka this script).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. As long as the script is not executed through source it's fine.

@@ -848,7 +848,7 @@ def fit(self, X, y=None, eval_data=None, eval_metric='acc',
# init optmizer
if isinstance(self.optimizer, str):
batch_size = data.batch_size
if kvstore and 'dist' in kvstore.type and not '_async' in kvstore.type:
if kvstore and 'dist' in kvstore.type and '_async' not in kvstore.type:
batch_size *= kvstore.num_workers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unit tested to validate it still works after these changes? if so, where?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need unit test for both cases (in and not in)

Copy link
Contributor Author

@marcoabreu marcoabreu Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the logic, this is logically equivalent and was forced by pylint. Feel free to ask @mli, it's his code. I think that request is out of scope for this PR.
I'm happy to explain why NOT(TRUE) is equivalent to IS(FALSE).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use veto for this purpose if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did lint never catch this before?
how can we be sure this has the same effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjolivier01 I'm referring to your comment in #9931 (comment)
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then came the following:
image
image
Clearly there are conflicting messages, so I think it would be better to clarify them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha see email there where you personally pointed out your view that committer questions need not always be answered, thus leaving no choice but to block with a request changes until such time that requester makes the changes or makes sufficient arguments against them (or in this case shows they are not needed).

5th paragraph of second email from top:

https://lists.apache.org/thread.html/5f7843a77f58239dbf317f8fc811be1560d0f6602514adc5487811c6@%3Cdev.mxnet.apache.org%3E

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjolivier01 I'm merely trying to understand what your take is on that, given that you expressed that you didn't intend to veto while still made "request changes" review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the verdict?

@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 1534b09 to 1d5edae Compare March 21, 2018 00:27
@marcoabreu marcoabreu force-pushed the Upgrade-cuda9-cudnn7 branch from 1d5edae to b76809f Compare March 21, 2018 00:30
Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Probably could have split this up into a pylint change PR / and a CUDA change PR.

@marcoabreu marcoabreu merged commit b0a6760 into apache:master Mar 21, 2018
@marcoabreu marcoabreu deleted the Upgrade-cuda9-cudnn7 branch March 21, 2018 15:13
ashokei pushed a commit to ashokei/incubator-mxnet that referenced this pull request Mar 27, 2018
* Upgrade to cuda 9.1 cudnn 7

* Pin pylint version

* Remove aws cli installation

* Address review comments
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* Upgrade to cuda 9.1 cudnn 7

* Pin pylint version

* Remove aws cli installation

* Address review comments
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Upgrade to cuda 9.1 cudnn 7

* Pin pylint version

* Remove aws cli installation

* Address review comments
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Upgrade to cuda 9.1 cudnn 7

* Pin pylint version

* Remove aws cli installation

* Address review comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants