Skip to content

Conversation

jnke2016
Copy link
Contributor

@jnke2016 jnke2016 commented Mar 11, 2025

This PR leverages the CAPI to optimize the check for vertex existence which can lead to +100x speedup compared to the cudf based version as it can be noticed from the performance figure below

Screenshot 2025-03-17 at 9 47 37 AM

closes #4956

@jnke2016 jnke2016 requested a review from a team as a code owner March 11, 2025 13:56
@rlratzel rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 11, 2025
@rlratzel
Copy link
Contributor

@jnke2016 does this close #4956 by itself or are there other changes to the Python layer needed?

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR ready for review?

I see just debug print statements and commented out code. If the PR is not ready for review, you may better create a draft PR (this also saves CI resources).

@@ -132,6 +132,7 @@ struct two_hop_neighbors_functor : public cugraph::c_api::abstract_functor {
bool multi_gpu>
void operator()()
{
printf("\nin two_hop_neighbors \n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete debug print statements.

@jnke2016 jnke2016 requested a review from ChuckHastings March 17, 2025 15:11
Comment on lines 300 to 316
rmm::device_uvector<vertex_t> vertex_array(1, handle_.get_stream());

cugraph::detail::sequence_fill(
handle_.get_stream(), vertex_array.data(), vertex_array.size(), vertex_t(vertex_));

if constexpr (multi_gpu) {
vertex_array = cugraph::shuffle_ext_vertices(handle_, std::move(vertex_array));
}

cugraph::renumber_ext_vertices<vertex_t, multi_gpu>(
handle_,
vertex_array.data(),
vertex_array.size(),
number_map->data(),
graph_view.local_vertex_partition_range_first(),
graph_view.local_vertex_partition_range_last(),
do_expensive_check_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the same vertex_t value is passed to every GPU in multi-GPU, right? Or are you assuming that each GPU can call this function with different vertex_ values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the former is the case, you don't need to really shuffle vertices, just need to call compute_gpu_id_from_ext_vertex_t to find the owning GPU. The owning GPU performs the check and broadcast the check result.

Comment on lines 269 to 291
raft::copy<vertex_t>(
vertices.data(), edgelist_srcs.data(), edgelist_srcs.size(), handle_.get_stream());

cugraph::detail::sort_ints(handle_.get_stream(),
raft::device_span<vertex_t>{vertices.data(), vertices.size()});

size_t unique_vertices_size = cugraph::detail::unique_ints(
handle_.get_stream(), raft::device_span<vertex_t>{vertices.data(), vertices.size()});

vertices.resize(unique_vertices_size + edgelist_dsts.size(), handle_.get_stream());

raft::copy<vertex_t>(vertices.data() + unique_vertices_size,
edgelist_dsts.data(),
edgelist_dsts.size(),
handle_.get_stream());

cugraph::detail::sort_ints(handle_.get_stream(),
raft::device_span<vertex_t>{vertices.data(), vertices.size()});

unique_vertices_size = cugraph::detail::unique_ints(
handle_.get_stream(), raft::device_span<vertex_t>{vertices.data(), vertices.size()});

vertices.resize(unique_vertices_size, handle_.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce memory footprint, you can sort & unique sources & destinations separately and merge instead of sort & unique sources, then copy all the destinations, and sort & unique.

You may further cut the memory footprint by first performing hash based group by (https://github.com/rapidsai/cugraph/blob/branch-25.04/cpp/include/cugraph/utilities/shuffle_comm.cuh#L761 with mem_frugal_threshold set), but this might be little bit of an overkill and you may defer this till this actually becomes a bottlenck (especially considering that renumber = false is here mainly for historical/debugging reasons).

If you just want to put working simple implementation here (as this won't be a case actual users will heavily use), just copy all sources & destinations to a single array and sort & unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't forget to call shrink_to_fit() to actually release memory after resize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce memory footprint, you can sort & unique sources & destinations separately and

Right. I now sort & unique sources & destinations separately. I also call shrink_to_fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may further cut the memory footprint by first performing hash based group by ...

Right, we can add this optimization if this becomes a bottleneck which might not be for now. And as you mention, the case this code is targeting is for historical/debugging reasons

*/
cugraph_error_code_t cugraph_has_vertex(const cugraph_resource_handle_t* handle,
cugraph_graph_t* graph,
const int vertex,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use an int here. A vertex can be int32_t or int64_t.

For SSSP, we pass in a size_t and cast it to either int32_t or int64_t inside the C API implementation.

struct has_vertex_functor : public cugraph::c_api::abstract_functor {
raft::handle_t const& handle_{};
cugraph::c_api::cugraph_graph_t* graph_{nullptr};
const int vertex_{};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be an int.


has_vertex_functor(::cugraph_resource_handle_t const* handle,
::cugraph_graph_t* graph,
const int vertex,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be an int.

cdef cugraph_error_code_t cugraph_has_vertex(
const cugraph_resource_handle_t* handle,
const cugraph_graph_t* graph,
const int vertex,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be an int

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

I just have a few minor requests.

@@ -145,6 +147,10 @@ def bfs(ResourceHandle handle, _GPUGraph graph,

assert_CAI_type(sources, "sources")

# Check if sources are valid
for v in sources.values_host:
if not pylibcugraph.has_vertex(handle, graph, v, do_expensive_check):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just import has_vertex from pylibcugraph, you can change this:

Suggested change
if not pylibcugraph.has_vertex(handle, graph, v, do_expensive_check):
if not has_vertex(handle, graph, v, do_expensive_check):

@@ -20,6 +20,7 @@ from libc.stdint cimport uintptr_t
from libc.stdint cimport int32_t
from libc.limits cimport INT_MAX

import pylibcugraph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this if you import has_vertex directly from pylibcugraph, which you're doing below.

Comment on lines 90 to 92


return True if result else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only have one blank line here.
Also, will this work instead?

Suggested change
return True if result else False
return bool(result)

Comment on lines 28 to 31
from pylibcugraph._cugraph_c.array cimport (
cugraph_type_erased_device_array_view_t,
cugraph_type_erased_device_array_view_free,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these being used here, can they be removed?

Comment on lines +46 to +47
copy_to_cupy_array,
create_cugraph_type_erased_device_array_view_from_py_obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these being used here, can they be removed?

Comment on lines 119 to 133
/**
* @ingroup utility_wrappers_cpp
* @brief Update a value in a device span to 0 if it matches the target_value or 1
*
* @tparam value_t type of the value to operate on. Must be either int32_t or int64_t.
*
* @param[out] values device span to update
* @param[in] target_value value to be querried
* @param[in] stream_view stream view
*
*/
template <typename value_t>
void transform_binary(raft::device_span<value_t> values,
value_t target_value,
raft::handle_t const& handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look generic enough to be included as a utility function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we at least need a better name for this if you need this to call thrust function from .cpp file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better store/return results in bool array than using value_t.

Something like

rmm::device_uvector<bool>
elementwise_not_equal(
  raft::device_span<value_t> values,
  value_t compare,
  rmm::cuda_stream_view const& stream_view);

or transform_not_equal or transform_compare_not_equal? I won't sure about the best name but transform_binary doesn't match very well with what this function does.

Copy link
Contributor

@seunghwak seunghwak Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI: I asked this to chatgpt and got this.

A good name for this function should align with STL naming conventions, which are often verb-based and descriptive of their operation. Some well-known STL functions that perform element-wise transformations or checks include std::transform, std::count_if, and std::any_of.

Since your function checks whether each element differs from a given value and returns a boolean result for each, a name that follows STL style could be:

Suggested names:
mismatch_mask – Inspired by std::mismatch, emphasizing that it creates a mask of mismatches.
not_equal_mask – Mirrors std::not_equal_to, clearly indicating the operation.
transform_not_equal – Follows the std::transform pattern, emphasizing element-wise transformation.
compare_not_equal – Explicit about the comparison operation.
Recommended choice:
not_equal_mask
This name aligns well with STL conventions, is concise, and clearly communicates that the function generates a mask indicating inequality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I might go for chatgpt next time I am trying to find a name for my thrust-like function. I chose transform_not_equal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... you may try cursor chatting as well. Haven't compared but in theory, it can better consider the other code in cugraph codebase.

Comment on lines 305 to 309
/*
if constexpr (multi_gpu) {
vertices = cugraph::shuffle_ext_vertices(handle_, std::move(vertices));
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete dead code.

Comment on lines 320 to 323
cugraph::detail::transform_binary(
raft::device_span<vertex_t>{vertices.data(), vertices.size()},
cugraph::invalid_vertex_id<vertex_t>::value,
handle_.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comparison result should better be stored in rmm::device_uvector<bool>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a rmm::device_uvector<bool> to store the result of the comparison.


vertices.resize(unique_edgelist_srcs_size + unique_edgelist_dsts_size,
handle_.get_stream());
vertices.shrink_to_fit(handle_.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shrink_to_fit is unnecessary. You are not shrinking a vector here. You just increased the size of the vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this is addressed

return;
}

*number_map = std::move(vertices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this unnecessary. number_map already holds consecutive integers starting from 0.

void transform_not_equal(raft::device_span<value_t> values,
raft::device_span<bool> result,
value_t compare,
raft::handle_t const& handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our convention is to pass handle as the first input argument (and stream as the last).

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (except for unnecessary empty lines).

@@ -292,11 +294,34 @@ struct create_graph_functor : public cugraph::c_api::abstract_functor {
if (renumber_) {
*number_map = std::move(new_number_map.value());
} else {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty line is unnecessary.

Comment on lines 322 to 323


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These empty lines are unnecessary.

Comment on lines 266 to 267


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for 3 empty lines. 1 is sufficient.

@rlratzel rlratzel added this to the 25.04 milestone Mar 26, 2025
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent changes look good but I have a question.

Comment on lines 100 to 107
cdef cugraph_type_erased_device_array_view_t* \
result_view_ptr = \
cugraph_type_erased_device_array_view(
result_ptr)

cupy_has_vertex = copy_to_cupy_array(c_resource_handle_ptr, result_view_ptr)

return True if result else False
return cupy_has_vertex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this copy operation added, and are we no longer returning a bool from this function as implied in the docstring? Also, do the benchmarks in the PR description include this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to @jnke2016 offline: reason for returning the array is to facilitate passing/querying multiple vertices and getting individual bools back for each. This also helps facilitate MG use cases. The benchmarks were not impacted significantly with this change.
@jnke2016 is going to update the docstring and example now that the return value is different.

@rlratzel
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 96e275f into rapidsai:branch-25.04 Mar 27, 2025
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST]: Why are we experiencing high execution times with the BFS algorithm?
4 participants