btl/self is not accelerator aware

While debugging another issue on our systems, I realized that the btl/self component is not accelerator aware.  As of right now, a send-to-self operation from gpu memory to gpu memory, data is copied first to host memory and then back to gpu memory. We should probably fix that and perform a direct accelerator.memcpy() operation, at least for contiguous buffers