pplx-garden

Perplexity AI open source garden for inference technology

Research Paper

RDMA Point-to-Point Communication for LLM Systems

P2P MoE dispatch/combine kernel

Support both NVIDIA ConnectX-7 and AWS EFA (potentially other RDMA NICs as wel)
Use NVLink for intra-node data transfer and RDMA for inter-node
Optimize for decode, while also support prefill
Split send and recv stages for both dispatch and combine, allow micro-batching
SM-free while RDMA transfer
Support CUDA Graph

RDMA TransferEngine library

Support both NVIDIA ConnectX-7 and AWS EFA (potentially other RDMA NICs as well)
Support aggregation of multiple NICs per GPU
Support reliable unordered transport protocol

System requirements

(Recommended) Linux Kernel 5.12 or higher (for DMA-BUF support)
CUDA 12.8 or higher
libfabric
libibverbs
GDRCopy
SYS_PTRACE and SYS_ADMIN Linux capabilities for pidfd_getfd. You can obtain these by running as root, with sudo, or inside docker with --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN.
RDMA network with GPUDirect RDMA support. Each GPU should have at least one dedicated RDMA NIC.

Docker dev image

We provide a docker image for the convenience of development. You can build it with the following command:

docker build -t pplx-garden-dev - < docker/dev.Dockerfile

Run the container with the following command:

./scripts/run-docker.sh

Run fabric-debug

This is the benchmark for our network library.

Build the benchmark binary:

cd /app
cargo build --release --bin fabric-debug

Usage:

Server: fabric-debug [GPUs separated by comma] [NICs per GPU]
Client: fabric-debug [GPUs separated by comma] [NICs per GPU] [server address] where the server address is the one printed by the server.

Example:

server$ /app/target/release/fabric-debug 0,1,2,3,4,5,6,7 2
client$ /app/target/release/fabric-debug 0,1,2,3,4,5,6,7 2 fe80xxxx

Build and Install Python Wheel

cd /app
export TORCH_CMAKE_PREFIX_PATH=$(python3 -c "import torch; print(torch.utils.cmake_prefix_path)")
python3 -m build --wheel
python3 -m pip install /app/dist/*.whl

Run All-to-All Benchmark

# Environment variables
NUM_NODES=...
NODE_RANK=...  # [0, NUM_NODES)
MASTER_IP=...

# Run on all nodes
cd /app
python3 -m benchmarks.bench_all_to_all \
    --world-size $((NUM_NODES * 8)) --nets-per-gpu 2 --init-method=tcp://$MASTER_IP:29500 \
    --node-rank=$NODE_RANK --nvlink=8

Note:

Remove --nvlink flag if you want to use RDMA only.
Set --nets-per-gpu accordingly based on the VM instance type.

All-to-All Performance Results

Decode (128 tokens) Dispatch and Combine:

	pplx-EFA	pplx-CX7	DeepEP-CX7	x	pplx-EFA	pplx-CX7	DeepEP-CX7
EP64	266.7 μs	187.5 μs	177.9 μs	x	391.2 μs	309.1 μs	325.0 μs
EP32	229.1 μs	153.9 μs	159.1 μs	x	335.0 μs	266.3 μs	285.0 μs
EP16	214.8 μs	110.2 μs	123.9 μs	x	241.5 μs	185.5 μs	203.0 μs
EP8	49.7 μs	50.5 μs	42.6 μs	x	64.2 μs	65.3 μs	72.0 μs

Prefill (4096 tokens) Dispatch and Combine:

x	pplx-EFA	pplx-CX7	DeepEP-CX7	x	pplx-EFA	pplx-CX7	DeepEP-CX7
EP64	5334.3 μs	4665.2 μs	5071.6 μs	x	9779.3 μs	8771.1 μs	5922.7 μs
EP32	4619.0 μs	4011.8 μs	3680.2 μs	x	8271.5 μs	7526.8 μs	3565.4 μs
EP16	3196.7 μs	2734.8 μs	2481.9 μs	x	5379.1 μs	1062.2 μs	1863.9 μs
EP8	1052.4 μs	5071.1 μs	1810.3 μs	x	1396.7 μs	1405.1 μs	962.9 μs

Directory Structure

fabric-lib/: RDMA TransferEngine library
p2p-all-to-all/: P2P MoE All-to-All implementation
python-ext/: Python extension module from Rust code
python/pplx_garden/: Python code for the pplx_garden package
rust/: Rust utility libraries

Acknowledgments

Our RDMA library is inspired by MoonCake. Our MoE kernel is inspired by DeepEP.

Citation

If you find this work useful, please cite:

@misc{pplx-rdma-p2p,
      title={RDMA Point-to-Point Communication for LLM Systems}, 
      author={Nandor Licker and Kevin Hu and Vladimir Zaytsev and Lequn Chen},
      year={2025},
      eprint={2510.27656},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2510.27656}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
docker		docker
fabric-lib		fabric-lib
p2p-all-to-all		p2p-all-to-all
python-ext		python-ext
python/pplx_garden		python/pplx_garden
rust		rust
scripts		scripts
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pplx-garden

Research Paper

P2P MoE dispatch/combine kernel

RDMA TransferEngine library

System requirements

Docker dev image

Run fabric-debug

Build and Install Python Wheel

Run All-to-All Benchmark

All-to-All Performance Results

Directory Structure

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

perplexityai/pplx-garden

Folders and files

Latest commit

History

Repository files navigation

pplx-garden

Research Paper

P2P MoE dispatch/combine kernel

RDMA TransferEngine library

System requirements

Docker dev image

Run fabric-debug

Build and Install Python Wheel

Run All-to-All Benchmark

All-to-All Performance Results

Directory Structure

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages