-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey Tobias, definitely helpful - I've seen myself requiring that information quite a few times :) I'm currently thinking about a way how to test this properly, but the only way I can think of is calling nvidia-smi and that's pretty ugly... I think we could go with a test in https://github.com/apache/incubator-mxnet/tree/master/tests/python/gpu which checks for |
yeah, I think that makes sense. I will add that test. |
include/mxnet/base.h
Outdated
int32_t count; | ||
cudaError_t e = cudaGetDeviceCount(&count); | ||
if (e != cudaSuccess) { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should result in an error in front end. Use use cuda_call checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it like this is that on a CPU host, where I would expect to get a device count of 0, I got error 30 when calling cudaGetDeviceCount. This is probably why we have the same logic in storage.cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, it is probably indeed cleaner to raise an exception in case cuda reports one. The only downside is that this way we wouldn't be able to add a test as suggested by Marco, as we then shouldn't be calling num_gpus
on CPU only hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why cuda reports error? Maybe we can special case error 30 and raise the other errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I'm not sure why this happens. The error message and error code are unfortunately not very informative.
With respect to CUDA_CALL, unfortunately it is defined in src/common/cuda_utils.h
so that I can't import it from base.h
.
It is probably cleanest to bubble up any CUDA errors, so I changed the code to do so accordingly to your suggestion.
include/mxnet/base.h
Outdated
} | ||
return count; | ||
#else | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you would raise an exception saying that MXNet was not compiled with CUDA support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use LOG(FALTA) << "xxx"
python/mxnet/context.py
Outdated
check_call(_LIB.MXGetGPUCount(ctypes.byref(count))) | ||
return count.value | ||
|
||
def gpus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be a core API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do you suggest I should add this, or would you like me to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove for now.
@@ -212,6 +216,14 @@ def gpu(device_id=0): | |||
return Context('gpu', device_id) | |||
|
|||
|
|||
def num_gpus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/mxnet/context.py
Outdated
return count.value | ||
|
||
def gpus(): | ||
return [gpu(idx) for idx in range(0, num_gpus())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can there ever be a case where the gpu device indices are not consecutive or not starting from 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb this can not be the case: Sets device as the current device for the calling host thread. Valid device id's are 0 to (cudaGetDeviceCount() - 1).
This is great. I needed this, thank you. And, If possible, could I ask you a favor? If you can add functions to query the amount of global/shared memory on a cuda device, it would be very helpful. I'm asking because their implementations might be almost parallel to the changes in this PR. |
This has been really useful in other projects. I think it'd be great to have a few utility functions exposed in python to tell you a bit about the build you're using and the environment you're running in. Good first step. Maybe a new non-core API could be exposed for this? |
@KellenSunderland I also wasn't entirely sure whether this deserves a place in the core API, but also wasn't sure where else to put it. What would be a good alternative place? The idea of exposing other device information such as memory also makes sense to me. I guess we can do that separately. |
@tdomhan any updates ? |
Tried to address all comments. I did not add a test, as CUDA seems to run into an issue on a CPU only machine, namely raising the following:
Unfortunately error 30 is an internal CUDA error, which is not very informative. |
test_operator_gpu.py only runs on GPU machines, so you can put a test there. |
I think the right place would be in the general unit test directory as well as in the GPU directory. Otherwise, we don't have coverage on CPU-only instances. Considering this API should be callable in any case and report the necessary information, I think we should test on CPU as well as on GPU. In the general test, you coukd make sure that API is callable in general and only throwing the expected errors - since we can't know on which instance we're running, it's hard to validate the actual return value. In the GPU test, could you simply check whether the return value is > 0. |
I added a test to For CPU only hosts I'm not sure what to test, as a CUDA internal error isn't exactly the expected behavior, but is what is happening currently. However, a different version of CUDA may behave differently. |
I think that we should be returning 0 since that's the actual number of GPUs present. Additionally, we could return a special value like -1 or an exception to indicate the binary was built without CUDA support. In that case, a test would either expect an exception for our cpu builds or a positive number for our gpu builds. |
Can we check |
python/mxnet/context.py
Outdated
------- | ||
count : int | ||
The number of GPUs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove whitespace
I think we can't use assertRaises because we don't have CPU-only tests (besides the MKL ones) as of now. Instead, we'd have to go with a try catch approach. If it stays in the try clause, we're probably on a GPU instance and it should be > 0 and in the catch clause we should be on a CPU instance. |
I changed the code now to raise an exception when MXNet is built without CUDA support as per suggestion by @piiswrong. For the CPU tests I'm still not sure what test would make sense. |
What do you think about something along the lines of:
This will allow to distinguish between a CPU and GPU build as well as a CPU and GPU instance. |
so for CPU tests we run MXNet without CUDA built in, meaning that we compile once for CPU tests and then also for GPU tests with different flags? |
Exactly |
include/mxnet/base.h
Outdated
CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e); | ||
return count; | ||
#else | ||
LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is valid to do this call since the frontend apis do not know whether the underlying library was compiled with or without CUDA support. Instead, we should also have a second API which provides that information.
So the current issue is if mxnet is compiled with GPU but run on gpu-less machine, cudaGetDeviceCount will return 30 instead of cudaErrorNoDevice? Any idea why this happens? If we can't fix this I think its good enough to raise an error saying "failed getting number of gpus" |
that is correct. I'm not sure why it happens or how it can be fixed. With the recent changes that is what will happen now that it will raise an error in this case and I tried documenting this in the doc string. |
if (e == cudaErrorNoDevice) { | ||
return 0; | ||
} | ||
CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed querying the number of GPUs. CUDA Error: xxx
include/mxnet/base.h
Outdated
CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e); | ||
return count; | ||
#else | ||
LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @marcoabreu
it's probably better to just return 0 when mxnet is not compiled with gpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I had originally before you suggested I should raise an error. Quoting:
piiswrong reviewed 22 days ago
Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.
I will happily change this back though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. Now I think marco has a point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back and forth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries. will change this back.
so any test in the |
they are run with both. gpu/* tests are run with only gpu |
Ping |
Finally got around updating this. I reverted the logic to what I had originally (as requested) and added a CPU test. |
This PR needs a rebase. |
python/mxnet/context.py
Outdated
Raises | ||
------ | ||
Will raise an exception on any CUDA error or in case MXNet was not | ||
compiled with CUDA support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we changed the part about throwing an exception if compiled without cuda, right? It should only throw an exception in case of an actual error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. updated.
rebased. let me know if there are any remaining concerns or if we can merge this. |
So it seems to be the case that test_operator.py is indeed also run with GPUs. See (all GPU tests failed for this reason):
I will modify the test to check for |
* Expose the number of GPUs. * Added GPU test. * Removed trailing whitespace. * making the compiler happy * Reverted CPU only logic and added CPU test. * Updated python docs. * Removing break from test. * no longer assert on 0 gpus
* Expose the number of GPUs. * Added GPU test. * Removed trailing whitespace. * making the compiler happy * Reverted CPU only logic and added CPU test. * Updated python docs. * Removing break from test. * no longer assert on 0 gpus
* Expose the number of GPUs. * Added GPU test. * Removed trailing whitespace. * making the compiler happy * Reverted CPU only logic and added CPU test. * Updated python docs. * Removing break from test. * no longer assert on 0 gpus
Description
Exposes the number of GPUs available on the system as reported by
cudaGetDeviceCount
. Right now for Sockeye we resort to calling (see) nvidia-smi, which is far from optimal. With this change we export the number of GPUs through both the C and the python API.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.