[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620
Description
We have seen patterns where small cudaMemcpyAsync
collide with large cudaMemcpyAsync
being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).
The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.
We have a PoC that uses thrust::copy_n
to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.
This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.
This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).
Metadata
Assignees
Type
Projects
Status
Story Issue