Automated Inter-node bandwidth testing and visualization for GPU clusters using NCCL.
Example topology graph of an 8-node H100 cluster, with 8 GPUs per node. (alltoall_perf)
Key Features:
- Run single-node NCCL tests for intra-node performance evaluation
- Run pairwise NCCL tests across all node combinations for inter-node evaluation
- Parse logs and generate summary reports (CSV/Markdown)
- Visualize network topology with bandwidth graphs
- Support for SLURM clusters
Testing Strategy:
- Single-node tests: Evaluate intra-node GPU communication performance on each node individually
- Pairwise tests: For N nodes, test all pairs (e.g., 4 nodes → 6 pairs: A-B, A-C, A-D, B-C, B-D, C-D) to evaluate inter-node communication
- Motivation
- Limitations
- Project Structure
- Prerequisites
- Usage
- Useful Links
- Troubleshooting
- Known Issues
NVIDIA’s NCCL Tests already provides a reliable and privilege-free way to benchmark GPU communication performance. However, in real-world HPC or cloud environments, users without administrative access often face limited visibility into the system. Tools such as NVIDIA DCGM or low-level network profilers are typically unavailable, making it difficult to obtain a clear picture of inter-node communication performance.
This raises a practical challenge:
How can we systematically evaluate and visualize inter-node bandwidth patterns in a GPU cluster without relying on administrative tools or privileges?
NCCL Tests Cluster bridges this gap by automating and extending NCCL Tests for scalable, user-level performance evaluation. It enables users to:
- Automatically run pairwise NCCL benchmarks across all node combinations
- Parse and summarize logs into structured CSV/Markdown reports
- Visualize network topology with bandwidth-based heatmaps and graphs
Together, these capabilities extend NCCL testing into a fully automated and scalable workflow—making it easier to verify cluster health, identify communication bottlenecks, and optimize resource allocation even without system-level monitoring tools.
- Scheduler: Only SLURM is supported currently
- GPU/NIC Selection:
- No automatic testing of all GPU/NIC combinations
- Manual configuration via environment variables (e.g.,
CUDA_VISIBLE_DEVICES, NCCL variables) is possible - GPU/NIC details are only visible in debug logs (
--debugenablesNCCL_DEBUG=INFO)
- Test Configuration: Only pairwise (N=2) tests supported
This structure allows users to easily manage benchmark results across multiple clusters, with each cluster maintaining its own NCCL test logs and summaries. Additional documents or scripts related to specific clusters—such as hardware specifications or other types of benchmarks—can also be included as needed.
benchmarks/
{cluster_name}/ # e.g., cluster01: 8 nodes × 8 H100 GPUs each
nccl-benchmark-results/
single-node/ # Single node test results
with-debug/
logs/
summary.csv
summary.md
without-debug/
(same as above)
multi-node/ # Multi-node test results
(same as above)
pairwise/ # Pairwise test results
with-debug/
logs/
topology/
summary.csv
summary.md
without-debug/
(same as above)
# ... others documents/scripts of this cluster
{cluster_name2}/
nccl/
build/ # Compiled NCCL library (NCCL_HOME)
nccl-tests/
build/ # Compiled NCCL test binaries (NCCL_TEST)
# ... scriptsFor convenience, it is recommended to clone this repository into $HOME/ by default. Otherwise, you might need to modify the paths in sbatch_run_nccl_tests_pairs.sh accordingly.
cd $HOME
git clone https://github.com/xxrjun/nccl-tests-cluster.git
cd nccl-tests-clusterTip
This project is build on NVIDIA/nccl and NVIDIA/nccl-tests. Please refer to their README files for more information about NCCL and NCCL tests.
Or you can run with the provided build script build_nccl_and_tests.sh to build NCCL and NCCL tests automatically.
bash build_nccl_and_tests.shInstall required packages for log parsing and topology visualization.
Option 1: Using uv (recommended)
If you don't have uv installed, you can install it via
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envCreate and activate a virtual environment, then install the required packages
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtOption 2: Using pip:
pip install -r requirements.txtPlotting feature is ongoing for single-node tests.
Test intra-node GPU communication performance on individual nodes.
View help:
bash sbatch_run_nccl_tests_single.sh --helpBasic usage:
# Test all nodes in a partition with default GPU counts (4, 8)
bash sbatch_run_nccl_tests_single.sh -p gpu-partition -c cluster01
# Test specific nodes
bash sbatch_run_nccl_tests_single.sh -p gpu-partition -c cluster01 -n "cnode-[001-004]"
# Custom GPU counts
bash sbatch_run_nccl_tests_single.sh -p gpu-partition -c cluster01 --gpn "2 4 8"
# Dry run (preview without submitting)
bash sbatch_run_nccl_tests_single.sh -p gpu-partition -c cluster01 --dry-run
# Enable debug mode
bash sbatch_run_nccl_tests_single.sh -p gpu-partition -c cluster01 --debugExample output:
Submitting 4 single-node jobs...
cnode-001
cnode-002
cnode-003
cnode-004
Submit: NCCL_N1_G4_cnode-001 --nodelist=cnode-001 --gpus-per-node=4
Submitted batch job 1234
# ...
==========================================
Submission Summary
==========================================
Total nodes: 4
Jobs per node: 2
Total jobs: 8
Submitted: 8
Skipped: 0
DRY RUN: 0
NCCL DEBUG: 0
==========================================Test inter-node GPU communication performance across all node pairs.
View help:
bash sbatch_run_nccl_tests_pairs.sh --helpBasic usage:
# Test all node pairs in a partition with default GPU counts (1, 2, 4, 8)
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01
# Test specific nodes
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01 -n "cnode-[001-004]"
# Custom GPU counts
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01 --gpn "2 4 8"
# Dry run (preview without submitting)
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01 --dry-run
# Enable debug mode
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01 --debugTip
It is highly recommended to first test with only two nodes to verify that your NCCL environment is working correctly:
bash sbatch_run_nccl_tests_pairs.sh -p gpu-partition -c cluster01 -n "cnode-[001-002]"Example output:
Submitting 6 pairs...
cnode-001,cnode-002
cnode-001,cnode-003
# ...
==========================================
Submission Summary
==========================================
Total pairs: 6
Jobs per pair: 4
Total jobs: 24
Submitted: 24
Skipped: 0
DRY RUN: 0
NCCL DEBUG: 0
==========================================Cancel jobs if needed:
scancel -u $USERCommon CLI Options:
| Option | Description | Default |
|---|---|---|
-p, --partition |
SLURM partition name | Required |
-c, --cluster |
Cluster name for log organization | cluster01 |
-n, --nodelist |
Compressed nodelist (e.g., "cnode-[001-004]") |
All nodes in partition |
-l, --log-dir |
Custom log directory | benchmarks/<CLUSTER>/nccl-tests-{single|pairs}/without-debug/logs |
--gpn |
Space-separated GPU counts | Single: "4 8", Pairs: "1 2 4 8" |
--dry-run |
Preview commands without submitting | false |
--debug |
Enable NCCL debug mode (affects performance) | false |
Parse NCCL test logs and generate summary reports (CSV + Markdown).
# Process single-node test logs
python summarize_nccl_logs.py --input benchmarks/cluster01/nccl-benchmark-results/single-node/without-debug/logs
# Process pairwise test logs
python summarize_nccl_logs.py --input benchmarks/cluster01/nccl-benchmark-results/pairwise/without-debug/logs
# Batch mode: process both with-debug/ and without-debug/
python summarize_nccl_logs.py --input benchmarks/cluster01/nccl-benchmark-results/pairwise/
# Custom output paths
python summarize_nccl_logs.py \
--input benchmarks/.../logs \
--save-csv /path/to/summary.csv \
--save-md /path/to/summary.mdFilename Format:
- Single-node:
..._N1_G{G}_node.log(e.g.,nccl_N1_G8_cnode-001.log) - Pairs:
..._N2_G{G}_node1_node2.log(e.g.,nccl_N2_G8_cnode-005_cnode-006.log) - The
_debugsuffix is automatically ignored
Visualize network topology with bandwidth heatmaps from summary.csv.
# Process all tests and G values (recommended)
python generate_topology.py --csv benchmarks/cluster01/nccl-benchmark-results/pairwise/without-debug/summary.csv --all
# Single test, all G values
python generate_topology.py --csv ./summary.csv --test alltoall_perf
# With custom styling
python generate_topology.py --csv ./summary.csv --all \
--vmin 0 --vmax 80 --layout shell --adjust-labelsOutput: topology/{test_name}/G{n}.png + allG.png (combined grid)
Key Options:
--all: Process all tests and G values--test NAME: Process specific test only--adjust-labels: Auto-adjust overlapping labels (useful for dense graphs)--layout: Algorithm (kamada,shell,spring,circular,bipartite,cluster)--vmin/--vmax: Bandwidth color scale range--dpi: Resolution (default: 300)
Run python generate_topology.py --help for all options.
- [docs] NVIDIA NCCL Documentation
- [paper] [2507.04786] Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
Tip
If you encounter issues related to NCCL, it is highly recommended to search for or post your questions on NCCL GitHub Issues and NCCL Tests GitHub Issues.
-
If single-node tests succeed but multi-node tests fail, try specifying the network interface used for communication:
export NCCL_SOCKET_IFNAME=<iface>
-
If average bus bandwidth is significantly below theoretical limits when using small transfer sizes (e.g., 32 MB), consider increasing
MINIMUM_TRANSFER_SIZEin scripts (default: 32M). Larger transfer sizes typically achieve higher sustained bandwidth. -
If you see red lines in the topology graphs, they indicate failed tests or missing data. Check the corresponding log files for detailed error messages.
Example topology graph of a 17-node H100 cluster, with 8 GPUs per node. (sendrecv_perf)
-
If you see multiple processes using the same Rank in the logs, ensure that you compile NCCL Tests with MPI support enabled.
# Using devices # nccl-tests version 2.17.6 nccl-headers=22807 nccl-library=22807 # Collective test starting: alltoall_perf # nThread 1 nGpus 1 minBytes 33554432 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # Using devices # Rank 0 Group 0 Pid 223120 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 223121 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256267 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 223123 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256268 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 223122 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 223125 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 223124 on cnode2-002 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256264 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256265 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256266 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3 # Rank 0 Group 0 Pid 256269 on cnode2-001 device 0 [0000:1b:00] NVIDIA H100 80GB HBM3
Try loading the MPI module and recompile NCCL Tests:
module load openmpi cd nccl/nccl-tests make clean make MPI=1
- For large clusters, the topology becomes too crowded and hard to identify the performance differences between groups (if any). Consider splitting the cluster into smaller sub-clusters for better visualization. It is be planned to add better support for large clusters in future releases.
