Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove direct UCX and NCCL dependencies #5038

Merged
merged 9 commits into from
Dec 17, 2022
Merged

Remove direct UCX and NCCL dependencies #5038

merged 9 commits into from
Dec 17, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Nov 30, 2022

libcuml does not compile against UCX and NCCL, it relies on raft for those communicators. Moreover, because of raft's design cuML does not even need to link against those libraries. It should be sufficient to simply use raft-dask's Python API since that package is what will link to UCX and NCCL.

@vyasr vyasr added 2 - In Progress Currenty a work in progress Tech Debt Issues related to debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 30, 2022
@vyasr vyasr requested a review from a team as a code owner November 30, 2022 18:28
@vyasr vyasr self-assigned this Nov 30, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Nov 30, 2022

CC @cjnolet

I've targeted 23.02 just in case there are any surprises that pop up, don't want to break us right before code freeze.

@cjnolet
Copy link
Member

cjnolet commented Nov 30, 2022

This looks good, but I think we will also need to remove the kmeans mg test from the build, which is trying to use std comms directly. It should instead be using the mpi comms for integration tests.

@vyasr vyasr added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Dec 1, 2022
@dantegd
Copy link
Member

dantegd commented Dec 8, 2022

Changes look good pending conflict resolution. Also @vyasr could you open an issue to fix the issue with the kmeans_mg test? It is not being built currently, but we should fix (or move) it later.

@dantegd dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Dec 8, 2022
@dantegd dantegd requested a review from a team as a code owner December 12, 2022 19:44
@github-actions github-actions bot added the conda conda issue label Dec 12, 2022
@@ -16,7 +16,7 @@

function(ConfigureTest)

set(options OPTIONAL NCCL CUMLPRIMS MPI ML_INCLUDE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, I believe we are still going to need to link against NCCL to use raft/comms/mpi_comms.hpp but only for the mg tests and nowhere else. The mpi_comms.hpp are essentially creating a NCCL cluster using MPI and then handing off to NCCL everywhere (think of it like replacing Dask w/ MPI).

Maybe we could use the raft::distributed COMPONENT for that? (or maybe not, since it also brings in UCX...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Components are neat, added raft::distributed linkage and notes of added dependencies (MPI, NCCL) for C++ MG tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are cuml's MG tests using libraft for NCCL in the same way that raft-dask uses libraft for UCX? I'm not quite sure how this code works; does cuml's MG code not directly use NCCL, but the MG tests do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuml's MG googletests should be using MPI, however they should still be tied to the comms interface so shouldn't need to know NCCL exists at all (this NCCL can be provided transitively through the raft::distributed component). Note also that wherher the underlying MPI configuration has been configured with UCX underneath is up to the user running the tests (so the raft MPI communicator will use ucx by proxy and not directly)

Cuml's mg pytests all use raft-dask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the python tests are fine. I guess the difference is that with the pytests raft-dask is a Python package whose underlying extension modules are linked to NCCL, and then cuml Python only interacts with it at the Python layer, whereas the cuml C++ tests are either linking directly to libraft or building against raft's headers, and in either case the distributed components are not precompiled anyway so there's always NCCL linking required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using PIMPL for the comms, the mg tests shouldnt need nccl during compile time. But yes, the mg tests binary will need to have raft::distributed on the link libraries (which should pass nccl through)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup OK cool. makes sense. Dante has already made the necessary change here by adding the distributed component in get_raft.cmake, so I think we're good here.

@@ -197,14 +196,15 @@ endif()

if(BUILD_CUML_MG_TESTS)

ConfigureTest(PREFIX MG NAME KMEANS_TEST PATH mg/kmeans_test.cu OPTIONAL NCCL CUMLPRIMS ML_INCLUDE)
# This test needs to be rewritten to use the MPI comms, not the std comms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to comment this out, but can you also create a github issue just so it doesn't get forgotten and reference the issue # here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to GH issue

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@dantegd dantegd requested a review from a team as a code owner December 12, 2022 21:51
@vyasr
Copy link
Contributor Author

vyasr commented Dec 16, 2022

@cjnolet looks like there's still an open question re:NCCL linking here, once we resolve that this PR should be good to go.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@38dea2f). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.02    #5038   +/-   ##
===============================================
  Coverage                ?   79.21%           
===============================================
  Files                   ?      192           
  Lines                   ?    12377           
  Branches                ?        0           
===============================================
  Hits                    ?     9805           
  Misses                  ?     2572           
  Partials                ?        0           
Flag Coverage Δ
dask 45.81% <0.00%> (?)
non-dask 69.23% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cjnolet
Copy link
Member

cjnolet commented Dec 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1ea431d into rapidsai:branch-23.02 Dec 17, 2022
ajschmidt8 added a commit to csadorf/cuml that referenced this pull request Dec 17, 2022
these probably should've been removed in rapidsai#5038
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
libcuml does not compile against UCX and NCCL, it relies on raft for those communicators. Moreover, because of raft's design cuML does not even need to link against those libraries. It should be sufficient to simply use `raft-dask`'s Python API since that package is what will link to UCX and NCCL.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Dante Gama Dessavre (https://github.com/dantegd)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#5038
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Author Waiting for author to respond to review CMake conda conda issue CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Tech Debt Issues related to debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants