-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove direct UCX and NCCL dependencies #5038
Conversation
CC @cjnolet I've targeted 23.02 just in case there are any surprises that pop up, don't want to break us right before code freeze. |
This looks good, but I think we will also need to remove the kmeans mg test from the build, which is trying to use std comms directly. It should instead be using the mpi comms for integration tests. |
Changes look good pending conflict resolution. Also @vyasr could you open an issue to fix the issue with the |
@@ -16,7 +16,7 @@ | |||
|
|||
function(ConfigureTest) | |||
|
|||
set(options OPTIONAL NCCL CUMLPRIMS MPI ML_INCLUDE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, I believe we are still going to need to link against NCCL to use raft/comms/mpi_comms.hpp
but only for the mg tests and nowhere else. The mpi_comms.hpp
are essentially creating a NCCL cluster using MPI and then handing off to NCCL everywhere (think of it like replacing Dask w/ MPI).
Maybe we could use the raft::distributed
COMPONENT for that? (or maybe not, since it also brings in UCX...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Components are neat, added raft::distributed
linkage and notes of added dependencies (MPI, NCCL) for C++ MG tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are cuml's MG tests using libraft for NCCL in the same way that raft-dask uses libraft for UCX? I'm not quite sure how this code works; does cuml's MG code not directly use NCCL, but the MG tests do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuml's MG googletests should be using MPI, however they should still be tied to the comms interface so shouldn't need to know NCCL exists at all (this NCCL can be provided transitively through the raft::distributed component). Note also that wherher the underlying MPI configuration has been configured with UCX underneath is up to the user running the tests (so the raft MPI communicator will use ucx by proxy and not directly)
Cuml's mg pytests all use raft-dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right the python tests are fine. I guess the difference is that with the pytests raft-dask is a Python package whose underlying extension modules are linked to NCCL, and then cuml Python only interacts with it at the Python layer, whereas the cuml C++ tests are either linking directly to libraft or building against raft's headers, and in either case the distributed components are not precompiled anyway so there's always NCCL linking required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using PIMPL for the comms, the mg tests shouldnt need nccl during compile time. But yes, the mg tests binary will need to have raft::distributed on the link libraries (which should pass nccl through)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup OK cool. makes sense. Dante has already made the necessary change here by adding the distributed component in get_raft.cmake, so I think we're good here.
cpp/test/CMakeLists.txt
Outdated
@@ -197,14 +196,15 @@ endif() | |||
|
|||
if(BUILD_CUML_MG_TESTS) | |||
|
|||
ConfigureTest(PREFIX MG NAME KMEANS_TEST PATH mg/kmeans_test.cu OPTIONAL NCCL CUMLPRIMS ML_INCLUDE) | |||
# This test needs to be rewritten to use the MPI comms, not the std comms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to comment this out, but can you also create a github issue just so it doesn't get forgotten and reference the issue # here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link to GH issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
@cjnolet looks like there's still an open question re:NCCL linking here, once we resolve that this PR should be good to go. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #5038 +/- ##
===============================================
Coverage ? 79.21%
===============================================
Files ? 192
Lines ? 12377
Branches ? 0
===============================================
Hits ? 9805
Misses ? 2572
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@gpucibot merge |
these probably should've been removed in rapidsai#5038
libcuml does not compile against UCX and NCCL, it relies on raft for those communicators. Moreover, because of raft's design cuML does not even need to link against those libraries. It should be sufficient to simply use `raft-dask`'s Python API since that package is what will link to UCX and NCCL. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#5038
libcuml does not compile against UCX and NCCL, it relies on raft for those communicators. Moreover, because of raft's design cuML does not even need to link against those libraries. It should be sufficient to simply use
raft-dask
's Python API since that package is what will link to UCX and NCCL.