Open
Description
While debugging another issue on our systems, I realized that the btl/self component is not accelerator aware. As of right now, a send-to-self operation from gpu memory to gpu memory, data is copied first to host memory and then back to gpu memory. We should probably fix that and perform a direct accelerator.memcpy() operation, at least for contiguous buffers