UCX_MULTIRAIL - Proof of Concept

Overview

UCX_MULTIRAIL is a proof-of-concept demonstrating multi-rail communication using the Unified Communication X (UCX) framework across multiple GPU nodes. It showcases how a single message can be split and transmitted over multiple network interface cards (communication rails) to increase effective bandwidth and overall throughput.

Concept

Multi-rail communication is achieved by dividing a message buffer into multiple chunks and distributing them across multiple GPUs using cudaMemcpy(). Each chunk is transmitted via a distinct UCX endpoint, enabling parallel communication. On the receiving side, chunks are gathered from the respective GPUs and reassembled into on final receiving message buffer.

The project supports:

Configurable number of communication rails (1, 2, 4)
Pipelined communication to enhance overlap and throughput
Benchmarking modes for evaluating bandwidth and scalability

Key parameters:

Split ratio: Defines how the message is divided among communication rails
Pipeline stages: Controls the number of overlapping communication steps

Note: Optimal settings depend on the message size and hardware. A parameter sweep is recommended to identify the best configuration.

Build Instructions

Prerequisites

Ensure the following dependencies are installed:

CMake >= 3.21
CUDA >= 12.0
UCX >= 1.17

Build

mkdir build
cd build
cmake ..
make

Usage

Executables

run_basic_: Basic test to validate correctness of multi-rail communication. Sends a single message and prints the result.
run_bench_: Executes a single benchmark with configurable parameters.

Example CLI Options

Each benchmark supports the following flags:

-T: Select test type
- TEST: Basic communication test
- SPLIT: One Message is split across multiple rails
- PROFILE: Enables profiling
- MR: Multi-rail parallel send
- SINGLE: Single-rail performance
-n: Number of communication rails (e.g., 1, 2, 4)
-k: Number of pipeline stages
-r: Split ratio (percentage of message sent to peer rails)

Note: Update the receiver's address in the sender script before running any test.

Benchmark Results

Benchmarks were conducted on Hawk-AI at the High-Performance Computing Center Stuttgart (HLRS).

Configuration	Message Size	Rails	Pipeline Stages	Split Ratio (%)	Observed Bandwidth
Baseline (Single-Rail)	10 MB	1	-	-	~20 GB/s
Two-Rail	20 MB	2	1	50	~38 GB/s
Four-Rail	40 MB	4	1	75	~63 GB/s

Citation

If you use this work in academic or scientific contexts, please cite:

M. Rose, S. Homes, L. Ramsperger, J. Gracia, C. Niethammer, and J. Vrabec.
Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics.
HeteroPar 2025, accepted.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
apps		apps
include		include
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Readme.md		Readme.md
run_basic_recv.sh		run_basic_recv.sh
run_basic_send.sh		run_basic_send.sh
run_bench_recv.sh		run_bench_recv.sh
run_bench_send.sh		run_bench_send.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UCX_MULTIRAIL - Proof of Concept

Overview

Concept

Build Instructions

Prerequisites

Build

Usage

Executables

Example CLI Options

Benchmark Results

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

DSEAframework/ucx_multirail

Folders and files

Latest commit

History

Repository files navigation

UCX_MULTIRAIL - Proof of Concept

Overview

Concept

Build Instructions

Prerequisites

Build

Usage

Executables

Example CLI Options

Benchmark Results

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages