UCX_MULTIRAIL is a proof-of-concept demonstrating multi-rail communication using the Unified Communication X (UCX) framework across multiple GPU nodes. It showcases how a single message can be split and transmitted over multiple network interface cards (communication rails) to increase effective bandwidth and overall throughput.
Multi-rail communication is achieved by dividing a message buffer into multiple chunks and distributing them across multiple GPUs using cudaMemcpy()
. Each chunk is transmitted via a distinct UCX endpoint, enabling parallel communication. On the receiving side, chunks are gathered from the respective GPUs and reassembled into on final receiving message buffer.
The project supports:
- Configurable number of communication rails (1, 2, 4)
- Pipelined communication to enhance overlap and throughput
- Benchmarking modes for evaluating bandwidth and scalability
Key parameters:
- Split ratio: Defines how the message is divided among communication rails
- Pipeline stages: Controls the number of overlapping communication steps
Note: Optimal settings depend on the message size and hardware. A parameter sweep is recommended to identify the best configuration.
Ensure the following dependencies are installed:
- CMake >= 3.21
- CUDA >= 12.0
- UCX >= 1.17
mkdir build
cd build
cmake ..
make
run_basic_
: Basic test to validate correctness of multi-rail communication. Sends a single message and prints the result.run_bench_
: Executes a single benchmark with configurable parameters.
Each benchmark supports the following flags:
-
-T
: Select test typeTEST
: Basic communication testSPLIT
: One Message is split across multiple railsPROFILE
: Enables profilingMR
: Multi-rail parallel sendSINGLE
: Single-rail performance
-
-n
: Number of communication rails (e.g., 1, 2, 4) -
-k
: Number of pipeline stages -
-r
: Split ratio (percentage of message sent to peer rails)
Note: Update the receiver's address in the sender script before running any test.
Benchmarks were conducted on Hawk-AI at the High-Performance Computing Center Stuttgart (HLRS).
Configuration | Message Size | Rails | Pipeline Stages | Split Ratio (%) | Observed Bandwidth |
---|---|---|---|---|---|
Baseline (Single-Rail) | 10 MB | 1 | - | - | ~20 GB/s |
Two-Rail | 20 MB | 2 | 1 | 50 | ~38 GB/s |
Four-Rail | 40 MB | 4 | 1 | 75 | ~63 GB/s |
If you use this work in academic or scientific contexts, please cite:
M. Rose, S. Homes, L. Ramsperger, J. Gracia, C. Niethammer, and J. Vrabec.
Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics.
HeteroPar 2025, accepted.