[Core] InfiniBand and RDMA support for Ray object store #30094
Description
Description
I use Ray in an HPC cluster. The cluster has InfiniBand which has low latency and high bandwidth. Ray is based on gRPC and data transferring uses gRPC, too. I can use IPoIB(Internet Protocol over InfiniBand) in the cluster. In this way, I can not make full use of IB's bandwidth. It has the potential to get better performance.
Use case
I want to help ray to support RDMA for object transferring.
RDMA, which can reduce CPU interruptions for network processing and increase CPU utilization, is good at transferring memory data with a better performance. With the help of InfiniBand, I think ray's performance will be improved.
Tensorflow, which is also based on gRPC, has supported many ways for distributed environments such as grpc+verbs, and grpc+MPI, etc. The gRPC is used for controlling computing. The verbs and MPI are used for data transfer. As for Ray, I think they are familiar. I hope to separate the object store part and make it use RDMA as TensorFlow does.
I also do some tests on ray and MPI with different net environments. Ray and MPI used TCP over ethernet for data transfer as a baseline. MPI can speed up 16X with IPoIB and Ray can speed up about 10X. MPI can speed up 90X with RDMA and I think ray can get familiar improvement.
I have reviewed Ray’s code and it is a sophisticated project. I try to focus on the object store part in Ray’s withepaper. It help me but I still can find out where I can start with. I lost myself in the code ocean. And the whitepaper is different from the code. I cannot figure out how the data is transferred when the ray run. Are there any up-to-date documents I can refer to?
Where should I start if I want to contribute to the Ray project? Is any import class or file I should pay attention to?