-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Arrow Interop Benchmarks #17194
base: branch-24.12
Are you sure you want to change the base?
Conversation
Could you post some results? I'm expecting to_arrow_device to be almost free and to_arrow_host to be bounded by memcpy basically. |
Can you post the benchmark comparison output by NVBENCH compare tool please? We cannot see anything from just the numbers like these. |
sure! |
…/cudf into arrow-interop-benchmarks
I think this work is great to merge! Nice work Some follow-ups I would consider:
|
cpp/benchmarks/interop/interop.cpp
Outdated
void BM_to_arrow_device(nvbench::state& state, nvbench::type_list<nvbench::enum_type<data_type>>) | ||
{ | ||
auto const num_rows = static_cast<cudf::size_type>(state.get_int64("num_rows")); | ||
int32_t const num_columns = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend making this an axis parameter with a default value of 1.
Example:
.add_int64_axis("num_cols", {1});
Then it could be modified from the command-line without changing the source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add support for multiple columns initially. I've added that now and also updated the benchmarks with numbers for DECIMAL32 and DECIMAL64. thanks!
cc: @GregoryKimball
…w interop benchmarks
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
@karthikeyann would you please share your review? |
Including extensive benchmark results in the PR description may not be ideal, as they will become part of the squash commit message. Also, @lamarrr could you please update the PR description to outline the changes introduced in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will approve it once the PR description is updated.
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
std::vector<cudf::type_id> types; | ||
|
||
std::fill_n(std::back_inserter(types), num_columns, data_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: similar suggestion at other places too.
std::vector<cudf::type_id> types; | |
std::fill_n(std::back_inserter(types), num_columns, data_type); | |
std::vector<cudf::type_id> types(num_columns, data_type); |
is this better than using fill_n
with back_inserter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't trust std::vector's constructors, they easily become ambiguous and do the wrong thing
std::vector<cudf::column_metadata> children_metadata; | ||
std::fill_n(std::back_inserter(children_metadata), | ||
table->get_column(column).num_children(), | ||
cudf::column_metadata{""}); | ||
column_metadata.children_meta = children_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar suggestion here to use vector constructor.
std::vector<cudf::column_metadata> children_metadata; | |
std::fill_n(std::back_inserter(children_metadata), | |
table->get_column(column).num_children(), | |
cudf::column_metadata{""}); | |
column_metadata.children_meta = children_metadata; | |
auto num_children = table->get_column(column).num_children(); | |
column_metadata.children_meta = std::vector<cudf::column_metadata>(num_children, cudf::column_metadata{""}); |
static char const* stringify_type(cudf::type_id value) | ||
{ | ||
switch (value) { | ||
case cudf::type_id::INT8: return "INT8"; | ||
case cudf::type_id::INT16: return "INT16"; | ||
case cudf::type_id::INT32: return "INT32"; | ||
case cudf::type_id::INT64: return "INT64"; | ||
case cudf::type_id::UINT8: return "UINT8"; | ||
case cudf::type_id::UINT16: return "UINT16"; | ||
case cudf::type_id::UINT32: return "UINT32"; | ||
case cudf::type_id::UINT64: return "UINT64"; | ||
case cudf::type_id::FLOAT32: return "FLOAT32"; | ||
case cudf::type_id::FLOAT64: return "FLOAT64"; | ||
case cudf::type_id::BOOL8: return "BOOL8"; | ||
case cudf::type_id::TIMESTAMP_DAYS: return "TIMESTAMP_DAYS"; | ||
case cudf::type_id::TIMESTAMP_SECONDS: return "TIMESTAMP_SECONDS"; | ||
case cudf::type_id::TIMESTAMP_MILLISECONDS: return "TIMESTAMP_MILLISECONDS"; | ||
case cudf::type_id::TIMESTAMP_MICROSECONDS: return "TIMESTAMP_MICROSECONDS"; | ||
case cudf::type_id::TIMESTAMP_NANOSECONDS: return "TIMESTAMP_NANOSECONDS"; | ||
case cudf::type_id::DURATION_DAYS: return "DURATION_DAYS"; | ||
case cudf::type_id::DURATION_SECONDS: return "DURATION_SECONDS"; | ||
case cudf::type_id::DURATION_MILLISECONDS: return "DURATION_MILLISECONDS"; | ||
case cudf::type_id::DURATION_MICROSECONDS: return "DURATION_MICROSECONDS"; | ||
case cudf::type_id::DURATION_NANOSECONDS: return "DURATION_NANOSECONDS"; | ||
case cudf::type_id::DICTIONARY32: return "DICTIONARY32"; | ||
case cudf::type_id::STRING: return "STRING"; | ||
case cudf::type_id::LIST: return "LIST"; | ||
case cudf::type_id::DECIMAL32: return "DECIMAL32"; | ||
case cudf::type_id::DECIMAL64: return "DECIMAL64"; | ||
case cudf::type_id::DECIMAL128: return "DECIMAL128"; | ||
case cudf::type_id::STRUCT: return "STRUCT"; | ||
default: return "unknown"; | ||
} | ||
} | ||
|
||
NVBENCH_DECLARE_ENUM_TYPE_STRINGS(cudf::type_id, stringify_type, stringify_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be used by other benchmarks too.
Please consider moving these to benchmark common header file or common utilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. which header would you suggest that be moved to? same header as type_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a good idea. Many times, I had to implement this function for my debugging. Let's make a follow up PR, moving this into types.hpp
(same header as type_id
), or cudf_test/debug_utilities.hpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a similar file here: https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/benchmarks/io/nvbench_helpers.hpp
Worth considering a new file in https://github.com/rapidsai/cudf/tree/branch-24.12/cpp/benchmarks/common.
Declaring an NVBench-specific utility in a non-cuDF namespace within types.hpp
is inappropriate, as it also forces types.hpp
to include an NVBench header, which is something we want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with a new header in the benchmarks/common.
We could refactor some of the elements from io/nvbench_helpers.hpp
with the new header as well.
Though the refactor may be too much for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That utility function will be used not just in benchmark but also testing/debugging.
Could you elaborate on this further? How does this feature specifically benefit unit tests?
Benchmarks and unit tests are treated as "external" users of libcudf, which is why they are not included within the cudf::
namespace. The cudf/utilities
directory is reserved for use cases within libcudf or those under the cudf::
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this feature specifically benefit unit tests?
I didn't mean it will benefit unit tests, but meant it will benefit developers in debugging/testing.
I usually debug cudf APIs by printing out the type id of the input column. For example:
printf("input type: %s\n", type_to_string(input.type().id()));
However, most of the time, for fast iteration with small issues, I just printed the type id integer value, and look up the enum, trying to identify the enum type from value. For example:
printf("input type: %d\n",(int)input.type().id());
Output: input type: 23
(Looking at the enum class type_id
, and count from the first to the 23rd line, I can know type 23 is STRING
)
As such, having such type_to_string
utility is very helpful. This debug print statement can be inserted anywhere. Most of the time I insert it into the middle of a cudf API. Sometimes I use it in unit tests too. It would be most beneficial to add it into libcudf's utilities module. We can either use it for debugging libcudf APIs or for external usage in benchmark/unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. It's a different feature request.
You are referring to the fact that stringify_type
can be a common utility for both benchmarks and tests. A host-device print utility like print_type_id
together with stringify_type
can be added to https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf_test/print_utilities.cuh. Once it's done, this file can simply include that print utility header and add a single line for nvbench readability enhancement:
NVBENCH_DECLARE_ENUM_TYPE_STRINGS(cudf::type_id, cudf::test::print::stringify_type, stringify_type)
I’m inclined to leave this refactoring for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #17376 to track this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for filing the issue 👍
Benchmark Resultsto_arrow_host[0] NVIDIA RTX A6000
to_arrow_device[0] NVIDIA RTX A6000
from_arrow_host[0] NVIDIA RTX A6000
from_arrow_device[0] NVIDIA RTX A6000
|
Description
This merge request adds benchmarks for the Arrow Interop APIs:
from_arrow_host
to_arrow_host
from_arrow_device
to_arrow_device
Closes #17104
Checklist