-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-99] Upgrade to cuda 9.1 cudnn 7 #10108
Conversation
4b5ae82
to
d3cac2f
Compare
How about keeping both? |
This would increase maintenance efforts and result in a lot more jobs and thus more costs. We should try to prevent mixing integration tests (in that extent) with PR validation. |
We still need a way to track changes that break cuda/cudnn compatibility. For example, when depth-wise convolution was added, a low-level shuffle intrinsic was used in the implementation, which is not supported in sm_20 (as in the symbol doesn't exist for sm_20). This broke the build because sm_20 was still enabled at the time. A good place to put verification is nightly test. Given that eventually the nightly test would be moved to the public CI, the maintenance efforts need to be put in sooner or later. It would be good if we take that into consideration. |
sm_20 has nothing to do with cuda or cudnn versions. |
It does in fact because 2.x architecture is deprecated with cuda 9. But I
agree, integration tests before a release would be the way to go here.
|
3843bf5
to
fdf6739
Compare
8d9ae97
to
b644156
Compare
c3878cd
to
f21c468
Compare
Hello, I have just finished working on this PR. It would be great if somebody could review and approve it. This would allow me to self-merge the PR while everybody in the US is asleep - that way, we won't have any interruption of development due to downtimes. |
f21c468
to
f81a8fa
Compare
python/mxnet/kvstore.py
Outdated
@@ -424,13 +424,15 @@ def set_gradient_compression(self, compression_params): | |||
Other keys in this dictionary are optional and specific to the type | |||
of gradient compression. | |||
""" | |||
# pylint: disable=unsupported-membership-test | |||
if ('device' in self.type) or ('dist' in self.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use inline disable on the if block. Same for other occurrences.
@@ -26,10 +26,7 @@ | |||
# | |||
# BUILD_ID: the current build ID for the specified PR | |||
# | |||
set -ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need set +ex at the end to disable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is restricted to the current scope (aka this script).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. As long as the script is not executed through source
it's fine.
@@ -848,7 +848,7 @@ def fit(self, X, y=None, eval_data=None, eval_metric='acc', | |||
# init optmizer | |||
if isinstance(self.optimizer, str): | |||
batch_size = data.batch_size | |||
if kvstore and 'dist' in kvstore.type and not '_async' in kvstore.type: | |||
if kvstore and 'dist' in kvstore.type and '_async' not in kvstore.type: | |||
batch_size *= kvstore.num_workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this unit tested to validate it still works after these changes? if so, where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need unit test for both cases (in and not in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the logic, this is logically equivalent and was forced by pylint. Feel free to ask @mli, it's his code. I think that request is out of scope for this PR.
I'm happy to explain why NOT(TRUE) is equivalent to IS(FALSE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not use veto for this purpose if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did lint never catch this before?
how can we be sure this has the same effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjolivier01 I'm referring to your comment in #9931 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha see email there where you personally pointed out your view that committer questions need not always be answered, thus leaving no choice but to block with a request changes until such time that requester makes the changes or makes sufficient arguments against them (or in this case shows they are not needed).
5th paragraph of second email from top:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjolivier01 I'm merely trying to understand what your take is on that, given that you expressed that you didn't intend to veto while still made "request changes" review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the verdict?
1534b09
to
1d5edae
Compare
1d5edae
to
b76809f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Probably could have split this up into a pylint change PR / and a CUDA change PR.
* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments
* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments
* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments
* Upgrade to cuda 9.1 cudnn 7 * Pin pylint version * Remove aws cli installation * Address review comments
Description
This PR upgrades our CI from CUDA 8.0 to CUDA 9.1 and CuDNN from 5 to 7.
Checklist
Essentials
Changes
Comments
This change requires an update on our slaves because CUDA 9.1 is not supported by nvidia-docker. It requires nvidia-docker2 which will be provided with the new slaves.
I had to do some lint fixes. Although we stick the version, it still seems to be flaky about detecting some things - I don't really understand what's going on there. I've tried to use older versions as well, but it always found some problems.