Skip to content

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open
@PointKernel

Description

Is your feature request related to a problem? Please describe.
cuco::static_map and cuco::static_multimap are used to perform hash-based operations in libcudf. Depending on NVIDIA/cuCollections#110, a lot of existing use cases could be replaced with cuco::static_set or cuco:static_multiset since the payload part in the hash map is not used.

Moreover, some libcudf implementations are still using concurrent_unordered_map or unordered_multiset and they should be refactored with cuco::static_set/multiset as well.

Describe the solution you'd like
Replace existing implementations with new data structures like cuco::static_set, cuco::static_multiset or cuco::static_map.

There should also be benchmarks showing the performance changes after replacement for most works listed below.
Note: under "Related APIs", the ref:: prefix refers to the cuco device-side API.

Part 1: Updates ready to be addressed in libcudf

Algorithm Desired Data Structure Related APIs PRs Notes
distinct_count cuco::static_set insert, insert_if #13343
orc dictionary encoding (issue #10495) cuco::static_map (legacy) ref::insert, ref::find #13580 Similar to parquet dictionary encoding but the current implementation is still using a custom dictionary
byte_pair_encoding cuco::static_map insert_async, ref::find #13807 uses heterogeneous lookup
json_tree cuco::static_set insert_if_async, ref::find #13928
search/contains_table cuco::static_set insert_async, insert_if_async, contains_async, contains_if_async #14064 Needs migration from cudf::detail::unordered_multiset
search/contains_column cuco::static_set insert_async, insert_if_async, contains_async #14238 Needs migration from cudf::detail::unordered_multiset. unordered_multiset can be removed once this is done.
hash-based groupby (issue #10401) cuco::static_set ref::insert_and_find, retrieve_all #14813 Needs migration from concurrent_unordered_map
legacy json reader cuco::static_map (legacy) insert_async, ref::find #15813 Needs migration from concurrent_unordered_map. Looks like a rational use of hash map. To be replaced by cuco::static_map Deleted in #15813
distinct cuco::static_set insert_async retrieve_all, ref::insert_and_find #15984 We have work in flight to move this to cuco::static_map, but it is blocked on performance concerns ⚠️ #16484
parquet dictionary encoding cuco::static_map (experimental) ref::insert, ref::find #16541 Needs migration from cuco::static_map (legacy)
mixed_semi_joins cuco::static_set insert, insert_if, ref::contains #16230 Needs migration from cuco::static_map (legacy)
semi/anti joins Refactoring not needed anymore, as they use contains
histogram cuco::static_set insert_async retrieve_all, ref::insert_and_find #16485
orc dictionary encoding cuco::static_map (experimental) ref::insert, ref::find #17049 Needs migration from cuco::static_map (legacy)
joins cuco::static_multiset count, count_outer, retrieve, retrieve_outer Needs migration from cuco::static_multimap (legacy) Related issues: #11176, #13116,
mixed joins cuco::static_multiset Needs migration from cuco::static_multimap (legacy)

Metadata

Assignees

No one assigned

    Labels

    2 - In ProgressCurrently a work in progressfeature requestNew feature or requestimprovementImprovement / enhancement to an existing functionlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    • Status

      Story Issue

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions