PyTorch w/ MPI Backend for ANL Aurora

This repository contains files and instructions to compile PyTorch on ANL Aurora with the MPI distributed backend.

To compile PyTorch, follow these instructions on a compute node:

Start an interactive job and setup access to the internet.
Load the frameworks module: module load frameworks
Clone PyTorch from Github (tested with v2.8).
Add "xpu" to list of devices supported by the MPI backend here.
Copy the pytorch_build.sh script from this repository into the root of the PyTorch directory and run it to compile PyTorch (./pytorch_build.sh).
Install your PyTorch build as a user package in the frameworks module: pip install --user dist/*.
To test your installation, run run_dist_test.py on a multinode interactive job.
To understand how to use the MPI backend in your application, use dist_test.py as an example.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
dist_test.py		dist_test.py
pytorch_build.sh		pytorch_build.sh
run_dist_test.sh		run_dist_test.sh

Provide feedback