Description
Is your feature request related to a problem? Please describe.
cuco::static_map
and cuco::static_multimap
are used to perform hash-based operations in libcudf. Depending on NVIDIA/cuCollections#110, a lot of existing use cases could be replaced with cuco::static_set
or cuco:static_multiset
since the payload part in the hash map is not used.
Moreover, some libcudf implementations are still using concurrent_unordered_map
or unordered_multiset
and they should be refactored with cuco::static_set/multiset
as well.
Describe the solution you'd like
Replace existing implementations with new data structures like cuco::static_set
, cuco::static_multiset
or cuco::static_map
.
There should also be benchmarks showing the performance changes after replacement for most works listed below.
Note: under "Related APIs", the ref::
prefix refers to the cuco device-side API.
Part 1: Updates ready to be addressed in libcudf
Algorithm | Desired Data Structure | Related APIs | PRs | Notes |
---|---|---|---|---|
distinct_count | cuco::static_set |
insert , insert_if |
✅#13343 | |
orc dictionary encoding (issue #10495) | cuco::static_map (legacy) |
ref::insert , ref::find |
✅#13580 | Similar to parquet dictionary encoding but the current implementation is still using a custom dictionary |
byte_pair_encoding | cuco::static_map |
insert_async , ref::find |
✅#13807 | uses heterogeneous lookup |
json_tree | cuco::static_set |
insert_if_async , ref::find |
✅#13928 | |
search/contains_table | cuco::static_set |
insert_async , insert_if_async , contains_async , contains_if_async |
✅#14064 | Needs migration from cudf::detail::unordered_multiset |
search/contains_column | cuco::static_set |
insert_async , insert_if_async , contains_async |
✅#14238 | Needs migration from cudf::detail::unordered_multiset . unordered_multiset can be removed once this is done. |
hash-based groupby (issue #10401) | cuco::static_set |
ref::insert_and_find , retrieve_all |
✅#14813 | Needs migration from concurrent_unordered_map |
legacy json reader | cuco::static_map (legacy) |
insert_async , ref::find |
✅ #15813 | concurrent_unordered_map . Looks like a rational use of hash map. To be replaced by cuco::static_map |
distinct | cuco::static_set |
insert_async retrieve_all , ref::insert_and_find |
✅ #15984 | We have work in flight to move this to cuco::static_map , but it is blocked on performance concerns |
parquet dictionary encoding | cuco::static_map (experimental) |
ref::insert , ref::find |
✅ #16541 | Needs migration from cuco::static_map (legacy) |
mixed_semi_joins | cuco::static_set |
insert , insert_if , ref::contains |
✅#16230 | Needs migration from cuco::static_map (legacy) |
semi/anti joins | ✅ | Refactoring not needed anymore, as they use contains |
||
histogram | cuco::static_set |
insert_async retrieve_all , ref::insert_and_find |
✅ #16485 | |
orc dictionary encoding | cuco::static_map (experimental) |
ref::insert , ref::find |
✅ #17049 | Needs migration from cuco::static_map (legacy) |
joins | cuco::static_multiset |
count , count_outer , retrieve , retrieve_outer |
Needs migration from cuco::static_multimap (legacy) Related issues: #11176, #13116, |
|
mixed joins | cuco::static_multiset |
Needs migration from cuco::static_multimap (legacy) |
Metadata
Assignees
Labels
Type
Projects
Status
Story Issue