Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Expose the number of GPUs. #10354

Merged
merged 8 commits into from
May 15, 2018
Merged

Expose the number of GPUs. #10354

merged 8 commits into from
May 15, 2018

Conversation

tdomhan
Copy link
Contributor

@tdomhan tdomhan commented Mar 31, 2018

Description

Exposes the number of GPUs available on the system as reported by cudaGetDeviceCount. Right now for Sockeye we resort to calling (see) nvidia-smi, which is far from optimal. With this change we export the number of GPUs through both the C and the python API.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@marcoabreu
Copy link
Contributor

Hey Tobias, definitely helpful - I've seen myself requiring that information quite a few times :)

I'm currently thinking about a way how to test this properly, but the only way I can think of is calling nvidia-smi and that's pretty ugly... I think we could go with a test in https://github.com/apache/incubator-mxnet/tree/master/tests/python/gpu which checks for num_gpu > 0 and leave out CPU tests as we don't have cpu-only tests so far. What do you think?

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 1, 2018

yeah, I think that makes sense. I will add that test.

int32_t count;
cudaError_t e = cudaGetDeviceCount(&count);
if (e != cudaSuccess) {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should result in an error in front end. Use use cuda_call checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did it like this is that on a CPU host, where I would expect to get a device count of 0, I got error 30 when calling cudaGetDeviceCount. This is probably why we have the same logic in storage.cc.

Copy link
Contributor Author

@tdomhan tdomhan Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, it is probably indeed cleaner to raise an exception in case cuda reports one. The only downside is that this way we wouldn't be able to add a test as suggested by Marco, as we then shouldn't be calling num_gpus on CPU only hosts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why cuda reports error? Maybe we can special case error 30 and raise the other errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I'm not sure why this happens. The error message and error code are unfortunately not very informative.

With respect to CUDA_CALL, unfortunately it is defined in src/common/cuda_utils.h so that I can't import it from base.h.

It is probably cleanest to bubble up any CUDA errors, so I changed the code to do so accordingly to your suggestion.

}
return count;
#else
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would raise an exception saying that MXNet was not compiled with CUDA support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use LOG(FALTA) << "xxx"

check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
return count.value

def gpus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be a core API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you suggest I should add this, or would you like me to remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove for now.

@@ -212,6 +216,14 @@ def gpu(device_id=0):
return Context('gpu', device_id)


def num_gpus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return count.value

def gpus():
return [gpu(idx) for idx in range(0, num_gpus())]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there ever be a case where the gpu device indices are not consecutive or not starting from 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb this can not be the case: Sets device as the current device for the calling host thread. Valid device id's are 0 to (cudaGetDeviceCount() - 1).

@asitstands
Copy link
Contributor

This is great. I needed this, thank you. And, If possible, could I ask you a favor? If you can add functions to query the amount of global/shared memory on a cuda device, it would be very helpful. I'm asking because their implementations might be almost parallel to the changes in this PR.

@KellenSunderland
Copy link
Contributor

This has been really useful in other projects. I think it'd be great to have a few utility functions exposed in python to tell you a bit about the build you're using and the environment you're running in. Good first step. Maybe a new non-core API could be exposed for this?

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 3, 2018

@KellenSunderland I also wasn't entirely sure whether this deserves a place in the core API, but also wasn't sure where else to put it. What would be a good alternative place?

The idea of exposing other device information such as memory also makes sense to me. I guess we can do that separately.

@anirudh2290
Copy link
Member

@tdomhan any updates ?

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 10, 2018

Tried to address all comments. I did not add a test, as CUDA seems to run into an issue on a CPU only machine, namely raising the following:

MXNetError: [20:09:38] include/mxnet/base.h:328: Check failed: e == cudaSuccess (30 vs. 0)  CUDA: unknown error

Unfortunately error 30 is an internal CUDA error, which is not very informative.

@cjolivier01
Copy link
Member

test_operator_gpu.py only runs on GPU machines, so you can put a test there.

@marcoabreu
Copy link
Contributor

I think the right place would be in the general unit test directory as well as in the GPU directory. Otherwise, we don't have coverage on CPU-only instances. Considering this API should be callable in any case and report the necessary information, I think we should test on CPU as well as on GPU.

In the general test, you coukd make sure that API is callable in general and only throwing the expected errors - since we can't know on which instance we're running, it's hard to validate the actual return value. In the GPU test, could you simply check whether the return value is > 0.

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 10, 2018

I added a test to test_operator_gpu.py.

For CPU only hosts I'm not sure what to test, as a CUDA internal error isn't exactly the expected behavior, but is what is happening currently. However, a different version of CUDA may behave differently.

@marcoabreu
Copy link
Contributor

marcoabreu commented Apr 10, 2018

I think that we should be returning 0 since that's the actual number of GPUs present.

Additionally, we could return a special value like -1 or an exception to indicate the binary was built without CUDA support. In that case, a test would either expect an exception for our cpu builds or a positive number for our gpu builds.

@anirudh2290
Copy link
Member

Can we check assertRaises for MXNetError for CPU ?

-------
count : int
The number of GPUs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove whitespace

@marcoabreu
Copy link
Contributor

I think we can't use assertRaises because we don't have CPU-only tests (besides the MKL ones) as of now. Instead, we'd have to go with a try catch approach. If it stays in the try clause, we're probably on a GPU instance and it should be > 0 and in the catch clause we should be on a CPU instance.

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 12, 2018

I changed the code now to raise an exception when MXNet is built without CUDA support as per suggestion by @piiswrong.

For the CPU tests I'm still not sure what test would make sense.

@marcoabreu
Copy link
Contributor

What do you think about something along the lines of:

try
    assert get_num_cpu() > 0, "Expected exception on a CPU only build" # GPU instance with 1 or more GPUs. 
catch CUDA_NOT_PRESENT
    pass # We're on a CPU only build

This will allow to distinguish between a CPU and GPU build as well as a CPU and GPU instance.

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 18, 2018

so for CPU tests we run MXNet without CUDA built in, meaning that we compile once for CPU tests and then also for GPU tests with different flags?

@marcoabreu
Copy link
Contributor

Exactly

CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);
return count;
#else
LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is valid to do this call since the frontend apis do not know whether the underlying library was compiled with or without CUDA support. Instead, we should also have a second API which provides that information.

@piiswrong
Copy link
Contributor

So the current issue is if mxnet is compiled with GPU but run on gpu-less machine, cudaGetDeviceCount will return 30 instead of cudaErrorNoDevice?

Any idea why this happens? If we can't fix this I think its good enough to raise an error saying "failed getting number of gpus"

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 20, 2018

that is correct. I'm not sure why it happens or how it can be fixed. With the recent changes that is what will happen now that it will raise an error in this case and I tried documenting this in the doc string.

if (e == cudaErrorNoDevice) {
return 0;
}
CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed querying the number of GPUs. CUDA Error: xxx

CHECK_EQ(e, cudaSuccess) << " CUDA: " << cudaGetErrorString(e);
return count;
#else
LOG(FATAL) << "Please compile with CUDA support to query the number of GPUs.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @marcoabreu
it's probably better to just return 0 when mxnet is not compiled with gpu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I had originally before you suggested I should raise an error. Quoting:

piiswrong reviewed 22 days ago
Machine has no gpu and mxnet not compiled with gpu should be treated separately. This should raise an error too.

I will happily change this back though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. Now I think marco has a point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back and forth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries. will change this back.

@tdomhan
Copy link
Contributor Author

tdomhan commented Apr 24, 2018

so any test in the unittest directory is run with a version of MXNet that is not compiled with CUDA, is that correct?

@piiswrong
Copy link
Contributor

they are run with both. gpu/* tests are run with only gpu

@szha
Copy link
Member

szha commented May 3, 2018

Ping

@tdomhan
Copy link
Contributor Author

tdomhan commented May 11, 2018

Finally got around updating this. I reverted the logic to what I had originally (as requested) and added a CPU test.

@szha
Copy link
Member

szha commented May 11, 2018

This PR needs a rebase.

Raises
------
Will raise an exception on any CUDA error or in case MXNet was not
compiled with CUDA support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we changed the part about throwing an exception if compiled without cuda, right? It should only throw an exception in case of an actual error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. updated.

@tdomhan
Copy link
Contributor Author

tdomhan commented May 14, 2018

rebased. let me know if there are any remaining concerns or if we can merge this.

@tdomhan
Copy link
Contributor Author

tdomhan commented May 15, 2018

So it seems to be the case that test_operator.py is indeed also run with GPUs. See (all GPU tests failed for this reason):


    assert mx.context.num_gpus() == 0

AssertionError

I will modify the test to check for >= 0, unless someone has a better suggestion.

@piiswrong piiswrong merged commit 1214205 into apache:master May 15, 2018
@tdomhan tdomhan deleted the num_gpus branch May 28, 2018 09:06
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request May 29, 2018
* Expose the number of GPUs.

* Added GPU test.

* Removed trailing whitespace.

* making the compiler happy

* Reverted CPU only logic and added CPU test.

* Updated python docs.

* Removing break from test.

* no longer assert on 0 gpus
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Expose the number of GPUs.

* Added GPU test.

* Removed trailing whitespace.

* making the compiler happy

* Reverted CPU only logic and added CPU test.

* Updated python docs.

* Removing break from test.

* no longer assert on 0 gpus
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Expose the number of GPUs.

* Added GPU test.

* Removed trailing whitespace.

* making the compiler happy

* Reverted CPU only logic and added CPU test.

* Updated python docs.

* Removing break from test.

* no longer assert on 0 gpus
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants