Releases: ROCm/rocPRIM
Releases · ROCm/rocPRIM
rocPRIM 3.2.1 for ROCm 6.2.2
rocPRIM code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
rocPRIM 3.2.1 for ROCm 6.2.1
Optimizations
- Improved performance of block_reduce_warp_reduce when warp size == block size.
rocPRIM 3.2.0 for ROCm 6.2.0
Additions
- New overloads for
warp_scan::exclusive_scan
that take no initial value. These new overloads will write an unspecified result to the first value of each warp. - The internal accumulator type of
inclusive_scan(_by_key)
andexclusive_scan(_by_key)
is now exposed as an optional type parameter.- The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
This is the same behaviour as before this change.
- The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
- New overload for
device_adjacent_difference_inplace
that allows separate input and output iterators, but allows them to point to the same element. - New public API for deriving resulting type on device-only functions:
rocprim::invoke_result
rocprim::invoke_result_t
rocprim::invoke_result_binary_op
rocprim::invoke_result_binary_op_t
- New
rocprim::batch_copy
function added. Similar torocprim::batch_memcpy
, but copies by element, not with memcpy. - Added more test cases, to better cover supported data types.
- Updated some tests to work with supported data types.
- An optional
decomposer
argument for all member functions ofrocprim::block_radix_sort
and all functions ofdevice_radix_sort
.
To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce arocprim::tuple
of references to arithmetic types from the key. - New
rocprim::predicate_iterator
which acts as a proxy for an underlying iterator based on a predicate.
It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate istrue
.
It can be instantiated with:rocprim::make_predicate_iterator
rocprim::make_mask_iterator
- Added custom radix sizes as the last parameter for
block_radix_sort
. The default value is 4, it can be a number between 0 and 32. - New
rocprim::radix_key_codec
, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
Optimizations
- Improved the performance of
warp_sort_shuffle
andblock_sort_bitonic
. - Created an optimized version of the
warp_exchange
functionsblocked_to_striped_shuffle
andstriped_to_blocked_shuffle
when the warpsize is equal to the items per thread.
Fixes
- Fixed incorrect results of
warp_exchange::blocked_to_striped_shuffle
andwarp_exchange::striped_to_blocked_shuffle
when the block size is
larger than the logical warp size. The test suite has been updated with such cases. - Fixed incorrect results returned when calling device
unique_by_key
with overlappingvalues_input
andvalues_output
. - Fixed incorrect output type used in
device_adjacent_difference
. - Hotfix for incorrect results on the GFX10 (Navi 10/RDNA1, Navi 20/RDNA2) ISA and GFX11 ISA (Navi 30 GPUs) on device scan algorithms
rocprim::inclusive_scan(_by_key)
androcprim::exclusive_scan(_by_key)
with large input types. device_adjacent_difference
now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.- Fixed incorrect data being loaded with
rocprim::thread_load
when compiling with-O0
. - Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
Deprecations
- The internal header
detail/match_result_type.hpp
has been deprecated. TwiddleIn
andTwiddleOut
have been deprecated in favor ofradix_key_codec
.- The internal
::rocprim::detail::radix_key_codec
has been deprecated in favor of the new public utility with the same name.
rocPRIM 3.1.0 for ROCm 6.1.2
rocPRIM code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.
rocPRIM 3.1.0 for ROCm 6.1.1
rocPRIM code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.
rocPRIM 3.1.0 for ROCm 6.1.0
Additions
- New primitive:
block_run_length_decode
- New primitive:
batch_memcpy
Changes
- Renamed:
scan_config_v2
toscan_config
scan_by_key_config_v2
toscan_by_key_config
radix_sort_config_v2
toradix_sort_config
reduce_by_key_config_v2
toreduce_by_key_config
radix_sort_config_v2
toradix_sort_config
- Removed support for custom config types for device algorithms
host_warp_size()
was moved intorocprim/device/config_types.hpp
; it now uses eitherdevice_id
or
astream
parameter to query the proper device and adevice_id
out parameter- The return type is
hipError_t
- The return type is
- Added support for
__int128_t
indevice_radix_sort
andblock_radix_sort
- Improved the performance of
match_any
, andblock_histogram
which uses it
Deprecations
- Removed
reduce_by_key_config
,MatchAny
,scan_config
,scan_by_key_config
, and
radix_sort_config
Fixes
- Build issues with
rmake.py
on Windows when using VS 2017 15.8 or later (due to a breaking fix with
extended aligned storage)
rocPRIM 3.0.0 for ROCm 6.0.2
rocPRIM code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.
rocPRIM 3.0.0 for ROCm 6.0.0
Added
block_sort::sort()
overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, allblock_sort::sort()
overloads with a dynamic size are now supported forblock_sort_algorithm::merge_sort
andblock_sort_algorithm::bitonic_sort
.- New two-way partition primitive
partition_two_way
which can write to two separate iterators.
Optimizations
- Improved the performance of
partition
.
Fixed
- Fixed
rocprim::MatchAny
for devices with 64-bit warp size. The functionrocprim::MatchAny
is deprecated androcprim::match_any
is preferred instead.
rocPRIM 2.13.1 for ROCm 5.7.1
rocPRIM code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.
rocPRIM 2.13.1 for ROCm 5.7.0
Changed
- Deprecated configuration
radix_sort_config
for device-level radix sort as it no longer matches the algorithm's parameters. New configurationradix_sort_config_v2
is preferred instead. - Removed erroneous implementation of device-level
inclusive_scan
andexclusive_scan
. The prior default implementation using lookback-scan now is the only available implementation. - The benchmark metric indicating the bytes processed for
exclusive_scan_by_key
andinclusive_scan_by_key
has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported asscan
andscan_by_key
instead ofscan_exclusive
andscan_inclusive
. - Deprecated configurations
scan_config
andscan_by_key_config
for device-level scans, as they no longer match the algorithm's parameters. New configurationsscan_config_v2
andscan_by_key_config_v2
are preferred instead.
Fixed
- Fixed build issue caused by missing header in
thread/thread_search.hpp
.