Expose the number of GPUs. #10354

tdomhan · 2018-03-31T14:38:08Z

Description

Exposes the number of GPUs available on the system as reported by cudaGetDeviceCount. Right now for Sockeye we resort to calling (see) nvidia-smi, which is far from optimal. With this change we export the number of GPUs through both the C and the python API.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

marcoabreu · 2018-04-01T00:29:23Z

Hey Tobias, definitely helpful - I've seen myself requiring that information quite a few times :)

I'm currently thinking about a way how to test this properly, but the only way I can think of is calling nvidia-smi and that's pretty ugly... I think we could go with a test in https://github.com/apache/incubator-mxnet/tree/master/tests/python/gpu which checks for num_gpu > 0 and leave out CPU tests as we don't have cpu-only tests so far. What do you think?

tdomhan · 2018-04-01T08:05:56Z

yeah, I think that makes sense. I will add that test.

piiswrong · 2018-04-01T23:04:54Z

include/mxnet/base.h

+  int32_t count;
+  cudaError_t e = cudaGetDeviceCount(&count);
+  if (e != cudaSuccess) {
+    return 0;


This should result in an error in front end. Use use cuda_call checks

The reason I did it like this is that on a CPU host, where I would expect to get a device count of 0, I got error 30 when calling cudaGetDeviceCount. This is probably why we have the same logic in storage.cc.

that said, it is probably indeed cleaner to raise an exception in case cuda reports one. The only downside is that this way we wouldn't be able to add a test as suggested by Marco, as we then shouldn't be calling num_gpus on CPU only hosts.

Any idea why cuda reports error? Maybe we can special case error 30 and raise the other errors?

no, I'm not sure why this happens. The error message and error code are unfortunately not very informative.

With respect to CUDA_CALL, unfortunately it is defined in src/common/cuda_utils.h so that I can't import it from base.h.

It is probably cleanest to bubble up any CUDA errors, so I changed the code to do so accordingly to your suggestion.

piiswrong · 2018-04-01T23:05:27Z

include/mxnet/base.h

+  }
+  return count;
+#else
+  return 0;


Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.

So you would raise an exception saying that MXNet was not compiled with CUDA support?

use LOG(FALTA) << "xxx"

piiswrong · 2018-04-01T23:06:10Z

python/mxnet/context.py

+    check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
+    return count.value
+
+def gpus():


This doesn't need to be a core API.

where do you suggest I should add this, or would you like me to remove this?

please remove for now.

piiswrong · 2018-04-01T23:06:22Z

python/mxnet/context.py

@@ -212,6 +216,14 @@ def gpu(device_id=0):
    return Context('gpu', device_id)


+def num_gpus():


Please add documentation

szha · 2018-04-02T23:13:13Z

python/mxnet/context.py

+    return count.value
+
+def gpus():
+    return [gpu(idx) for idx in range(0, num_gpus())]


can there ever be a case where the gpu device indices are not consecutive or not starting from 0?

according to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb this can not be the case: Sets device as the current device for the calling host thread. Valid device id's are 0 to (cudaGetDeviceCount() - 1).

asitstands · 2018-04-03T07:20:11Z

This is great. I needed this, thank you. And, If possible, could I ask you a favor? If you can add functions to query the amount of global/shared memory on a cuda device, it would be very helpful. I'm asking because their implementations might be almost parallel to the changes in this PR.

KellenSunderland · 2018-04-03T12:40:34Z

This has been really useful in other projects. I think it'd be great to have a few utility functions exposed in python to tell you a bit about the build you're using and the environment you're running in. Good first step. Maybe a new non-core API could be exposed for this?

tdomhan · 2018-04-03T12:48:04Z

@KellenSunderland I also wasn't entirely sure whether this deserves a place in the core API, but also wasn't sure where else to put it. What would be a good alternative place?

The idea of exposing other device information such as memory also makes sense to me. I guess we can do that separately.

anirudh2290 · 2018-04-09T00:43:55Z

@tdomhan any updates ?

tdomhan · 2018-04-10T20:17:50Z

Tried to address all comments. I did not add a test, as CUDA seems to run into an issue on a CPU only machine, namely raising the following:

MXNetError: [20:09:38] include/mxnet/base.h:328: Check failed: e == cudaSuccess (30 vs. 0)  CUDA: unknown error

Unfortunately error 30 is an internal CUDA error, which is not very informative.

cjolivier01 · 2018-04-10T20:20:30Z

test_operator_gpu.py only runs on GPU machines, so you can put a test there.

marcoabreu · 2018-04-10T20:37:21Z

I think the right place would be in the general unit test directory as well as in the GPU directory. Otherwise, we don't have coverage on CPU-only instances. Considering this API should be callable in any case and report the necessary information, I think we should test on CPU as well as on GPU.

In the general test, you coukd make sure that API is callable in general and only throwing the expected errors - since we can't know on which instance we're running, it's hard to validate the actual return value. In the GPU test, could you simply check whether the return value is > 0.

tdomhan · 2018-04-10T20:42:36Z

I added a test to test_operator_gpu.py.

For CPU only hosts I'm not sure what to test, as a CUDA internal error isn't exactly the expected behavior, but is what is happening currently. However, a different version of CUDA may behave differently.

marcoabreu · 2018-04-10T20:59:48Z

I think that we should be returning 0 since that's the actual number of GPUs present.

Additionally, we could return a special value like -1 or an exception to indicate the binary was built without CUDA support. In that case, a test would either expect an exception for our cpu builds or a positive number for our gpu builds.

anirudh2290 · 2018-04-10T21:05:54Z

Can we check assertRaises for MXNetError for CPU ?

anirudh2290 · 2018-04-10T21:08:23Z

python/mxnet/context.py

+    -------
+    count : int
+        The number of GPUs.
+


nit: remove whitespace

marcoabreu · 2018-04-10T21:32:17Z

I think we can't use assertRaises because we don't have CPU-only tests (besides the MKL ones) as of now. Instead, we'd have to go with a try catch approach. If it stays in the try clause, we're probably on a GPU instance and it should be > 0 and in the catch clause we should be on a CPU instance.

tdomhan · 2018-04-12T15:15:31Z

I changed the code now to raise an exception when MXNet is built without CUDA support as per suggestion by @piiswrong.

For the CPU tests I'm still not sure what test would make sense.

marcoabreu · 2018-04-12T16:25:38Z

What do you think about something along the lines of:

try
    assert get_num_cpu() > 0, "Expected exception on a CPU only build" # GPU instance with 1 or more GPUs. 
catch CUDA_NOT_PRESENT
    pass # We're on a CPU only build

This will allow to distinguish between a CPU and GPU build as well as a CPU and GPU instance.

tdomhan · 2018-04-18T19:32:11Z

so for CPU tests we run MXNet without CUDA built in, meaning that we compile once for CPU tests and then also for GPU tests with different flags?

marcoabreu · 2018-04-18T23:52:25Z

Exactly

marcoabreu · 2018-04-18T23:54:17Z

include/mxnet/base.h

+  CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);
+  return count;
+#else
+  LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs.";


I think it is valid to do this call since the frontend apis do not know whether the underlying library was compiled with or without CUDA support. Instead, we should also have a second API which provides that information.

piiswrong · 2018-04-19T17:45:30Z

So the current issue is if mxnet is compiled with GPU but run on gpu-less machine, cudaGetDeviceCount will return 30 instead of cudaErrorNoDevice?

Any idea why this happens? If we can't fix this I think its good enough to raise an error saying "failed getting number of gpus"

tdomhan · 2018-04-20T07:49:09Z

that is correct. I'm not sure why it happens or how it can be fixed. With the recent changes that is what will happen now that it will raise an error in this case and I tried documenting this in the doc string.

piiswrong · 2018-04-20T17:50:57Z

include/mxnet/base.h

+  if (e == cudaErrorNoDevice) {
+    return 0;
+  }
+  CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);


Failed querying the number of GPUs. CUDA Error: xxx

piiswrong · 2018-04-20T17:52:36Z

include/mxnet/base.h

+  CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);
+  return count;
+#else
+  LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs.";


I agree with @marcoabreu
it's probably better to just return 0 when mxnet is not compiled with gpu.

that's what I had originally before you suggested I should raise an error. Quoting:

piiswrong reviewed 22 days ago Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.

I will happily change this back though.

Yes please. Now I think marco has a point.

Sorry for the back and forth

no worries. will change this back.

tdomhan · 2018-04-24T00:42:04Z

so any test in the unittest directory is run with a version of MXNet that is not compiled with CUDA, is that correct?

piiswrong · 2018-04-24T17:44:20Z

they are run with both. gpu/* tests are run with only gpu

szha · 2018-05-03T04:12:56Z

Ping

tdomhan · 2018-05-11T17:02:08Z

Finally got around updating this. I reverted the logic to what I had originally (as requested) and added a CPU test.

szha · 2018-05-11T18:56:36Z

This PR needs a rebase.

marcoabreu · 2018-05-12T13:24:22Z

python/mxnet/context.py

+    Raises
+    ------
+    Will raise an exception on any CUDA error or in case MXNet was not
+    compiled with CUDA support.


I think we changed the part about throwing an exception if compiled without cuda, right? It should only throw an exception in case of an actual error

good point. updated.

tdomhan · 2018-05-14T17:15:49Z

rebased. let me know if there are any remaining concerns or if we can merge this.

tdomhan · 2018-05-15T15:18:15Z

So it seems to be the case that test_operator.py is indeed also run with GPUs. See (all GPU tests failed for this reason):


    assert mx.context.num_gpus() == 0

AssertionError

I will modify the test to check for >= 0, unless someone has a better suggestion.

* Expose the number of GPUs. * Added GPU test. * Removed trailing whitespace. * making the compiler happy * Reverted CPU only logic and added CPU test. * Updated python docs. * Removing break from test. * no longer assert on 0 gpus

tdomhan requested review from cjolivier01 and szha as code owners March 31, 2018 14:38

piiswrong reviewed Apr 1, 2018

View reviewed changes

szha reviewed Apr 2, 2018

View reviewed changes

tdomhan force-pushed the num_gpus branch from a13ef48 to 08270f0 Compare April 10, 2018 20:14

tdomhan force-pushed the num_gpus branch from 62ac832 to 8b25025 Compare April 10, 2018 20:52

anirudh2290 reviewed Apr 10, 2018

View reviewed changes

python/mxnet/context.py Outdated

-------

count : int

The number of GPUs.

Copy link

Member

anirudh2290 Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove whitespace

marcoabreu reviewed Apr 18, 2018

View reviewed changes

piiswrong reviewed Apr 20, 2018

View reviewed changes

marcoabreu reviewed May 12, 2018

View reviewed changes

tdomhan added 6 commits May 14, 2018 17:14

Expose the number of GPUs.

3c1576d

Added GPU test.

a84eb86

Removed trailing whitespace.

525b002

making the compiler happy

0f4aa8d

Reverted CPU only logic and added CPU test.

249bd50

Updated python docs.

047a50b

tdomhan force-pushed the num_gpus branch from 8dd5df7 to 047a50b Compare May 14, 2018 17:15

Removing break from test.

289e56f

no longer assert on 0 gpus

9b061c8

piiswrong merged commit 1214205 into apache:master May 15, 2018

tdomhan deleted the num_gpus branch May 28, 2018 09:06

		@@ -212,6 +216,14 @@ def gpu(device_id=0):
		return Context('gpu', device_id)


		def num_gpus():

Expose the number of GPUs. #10354

Expose the number of GPUs. #10354

Conversation

tdomhan commented Mar 31, 2018

Description

Checklist

Essentials

marcoabreu commented Apr 1, 2018

tdomhan commented Apr 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asitstands commented Apr 3, 2018

KellenSunderland commented Apr 3, 2018

tdomhan commented Apr 3, 2018

anirudh2290 commented Apr 9, 2018

tdomhan commented Apr 10, 2018

cjolivier01 commented Apr 10, 2018

marcoabreu commented Apr 10, 2018

tdomhan commented Apr 10, 2018

marcoabreu commented Apr 10, 2018 • edited Loading

anirudh2290 commented Apr 10, 2018

Choose a reason for hiding this comment

marcoabreu commented Apr 10, 2018

tdomhan commented Apr 12, 2018

marcoabreu commented Apr 12, 2018

tdomhan commented Apr 18, 2018

marcoabreu commented Apr 18, 2018

Choose a reason for hiding this comment

piiswrong commented Apr 19, 2018

tdomhan commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan commented Apr 24, 2018

piiswrong commented Apr 24, 2018

szha commented May 3, 2018

tdomhan commented May 11, 2018

szha commented May 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan commented May 14, 2018

tdomhan commented May 15, 2018

tdomhan Apr 3, 2018 •

edited

Loading

marcoabreu commented Apr 10, 2018 •

edited

Loading