[FEA] Migrate to Thrust `nosync` stream policy for performance.

### Motivation & Description

Thrust 1.16 added an execution policy `thrust::cuda::par_nosync` that removes unnecessary internal stream synchronizations, except when required for correctness (e.g. if the algorithm returns a value to the host like `thrust::reduce`, a sync is required).

In PR #11577, I experimented with a bulk find-replace of `rmm::exec_policy` with `rmm::exec_policy_nosync`. This resulted in notable performance improvements, particularly for small data sizes. See: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70

However, a plain find-replace may leave issues with stream safety. For example, an internal detail API could be constructing host memory whose lifetime doesn't guarantee a safe copy to device without a stream sync before returning (thought experiment identified by @jrhemstad). Changing to `nosync` execution policies requires analysis of every use case individually, at both the detail and public API levels.

(As a reminder of the current stream policy: [libcudf APIs called on the host do not guarantee that the stream is synchronized before returning](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#treat-libcudf-apis-as-if-they-were-asynchronous), but this does not mean we can always use `nosync` safely.)

### Tasks

I am planning to open PRs to use `nosync` across the libcudf codebase. Below is a list of benchmarks that showed improvements. I will need to analyze each benchmark to identify the primary algorithms being called that should be refactored to use `nosync`. This list is loosely sorted by largest impact for small data sizes (or whatever fixed data sizes are in the benchmark), which indicates overhead that we can systematically eliminate. Note that these performance improvements use `nosync` everywhere, which may not be stream-safe in all cases, so the real performance gains may be lower if not all executions can use `nosync`. Additionally, the performance improvements for a given algorithm may rely on improvements in other algorithms, so the full improvement may not be achieved until all tasks are complete.

- [x] Gather/scatter. #12038
  - [Gather is 42% faster for 1024 rows, 1 column.](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L449)
  - [Scatter is 37% faster for 1024 rows, 1 column.](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2052)
- [ ] Search
  - [14% faster for Table, 1000 rows](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L4315)
  - [48% faster for ColumnContains_AllValid, 1024 rows](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L4339)
- [ ] ReductionScan
  - [39% faster for 10k rows of floats, no nulls](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2396)
- [ ] Rank
  - [37% faster for 1024 rows, no nulls](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2417)
  - [13% faster for 1024 rows, nulls](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2424)
- [ ] Sort
  - [31% faster for 1024 rows, unstable, no nulls](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2431)
- [ ] Repeat
  - [33% faster for 1024 rows, double, no nulls](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2508)
- [ ] Groupby
  - [28% faster for basic, 10k rows](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2776)
- [ ] Hash
  - [25% faster for Murmur3, no nulls, 16k rows](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2835)
- [ ] Compound reductions (like std, var)
  - [25% faster for std over 10k rows of floats](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2376)
- [ ] ReductionDictionary
  - [24% faster for 10k rows of floats.](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L2226)
- [ ] Quantiles
  - [15% faster for 65k rows](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L3642)
- [ ] Merge.
  - [8% faster for 512k rows, 2 tables.](https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70#file-nosync_benchmarks_all-txt-L222)

Other notes:
- I excluded I/O benchmarks from the prioritized list of algorithms above, because I/O benchmarks are a little noisy on the system I used for benchmarking.

### Further work

After addressing the major tasks above where we have clearly identified speedups resulting from `nosync` policies, I will re-assess the rest of the code base to evaluate a broader replacement to use `nosync` policies everywhere in libcudf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Migrate to Thrust `nosync` stream policy for performance. #12086

Motivation & Description

Tasks

Further work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Migrate to Thrust nosync stream policy for performance. #12086

Description