Skip to content

[Core] InfiniBand and RDMA support for Ray object store #30094

Open
@hellofinch

Description

Description

I use Ray in an HPC cluster. The cluster has InfiniBand which has low latency and high bandwidth. Ray is based on gRPC and data transferring uses gRPC, too. I can use IPoIB(Internet Protocol over InfiniBand) in the cluster. In this way, I can not make full use of IB's bandwidth. It has the potential to get better performance.

Use case

I want to help ray to support RDMA for object transferring.

RDMA, which can reduce CPU interruptions for network processing and increase CPU utilization, is good at transferring memory data with a better performance. With the help of InfiniBand, I think ray's performance will be improved.

Tensorflow, which is also based on gRPC, has supported many ways for distributed environments such as grpc+verbs, and grpc+MPI, etc. The gRPC is used for controlling computing. The verbs and MPI are used for data transfer. As for Ray, I think they are familiar. I hope to separate the object store part and make it use RDMA as TensorFlow does.

I also do some tests on ray and MPI with different net environments. Ray and MPI used TCP over ethernet for data transfer as a baseline. MPI can speed up 16X with IPoIB and Ray can speed up about 10X. MPI can speed up 90X with RDMA and I think ray can get familiar improvement.

I have reviewed Ray’s code and it is a sophisticated project. I try to focus on the object store part in Ray’s withepaper. It help me but I still can find out where I can start with. I lost myself in the code ocean. And the whitepaper is different from the code. I cannot figure out how the data is transferred when the ray run. Are there any up-to-date documents I can refer to?

Where should I start if I want to contribute to the Ray project? Is any import class or file I should pay attention to?

Metadata

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray Corecore-object-storeenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions