Skip to content

[FEA] Migrate to Thrust nosync stream policy for performance. #12086

Open
@bdice

Description

Motivation & Description

Thrust 1.16 added an execution policy thrust::cuda::par_nosync that removes unnecessary internal stream synchronizations, except when required for correctness (e.g. if the algorithm returns a value to the host like thrust::reduce, a sync is required).

In PR #11577, I experimented with a bulk find-replace of rmm::exec_policy with rmm::exec_policy_nosync. This resulted in notable performance improvements, particularly for small data sizes. See: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70

However, a plain find-replace may leave issues with stream safety. For example, an internal detail API could be constructing host memory whose lifetime doesn't guarantee a safe copy to device without a stream sync before returning (thought experiment identified by @jrhemstad). Changing to nosync execution policies requires analysis of every use case individually, at both the detail and public API levels.

(As a reminder of the current stream policy: libcudf APIs called on the host do not guarantee that the stream is synchronized before returning, but this does not mean we can always use nosync safely.)

Tasks

I am planning to open PRs to use nosync across the libcudf codebase. Below is a list of benchmarks that showed improvements. I will need to analyze each benchmark to identify the primary algorithms being called that should be refactored to use nosync. This list is loosely sorted by largest impact for small data sizes (or whatever fixed data sizes are in the benchmark), which indicates overhead that we can systematically eliminate. Note that these performance improvements use nosync everywhere, which may not be stream-safe in all cases, so the real performance gains may be lower if not all executions can use nosync. Additionally, the performance improvements for a given algorithm may rely on improvements in other algorithms, so the full improvement may not be achieved until all tasks are complete.

Other notes:

  • I excluded I/O benchmarks from the prioritized list of algorithms above, because I/O benchmarks are a little noisy on the system I used for benchmarking.

Further work

After addressing the major tasks above where we have clearly identified speedups resulting from nosync policies, I will re-assess the rest of the code base to evaluate a broader replacement to use nosync policies everywhere in libcudf.

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    • Status

      Story Issue

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions