|
1 | 1 | # torchrunx 🔥
|
2 | 2 |
|
3 |
| -By [Apoorv Khandelwal](http://apoorvkh.com) and [Peter Curtin](https://github.com/pmcurtin) |
4 |
| - |
5 | 3 | [](https://github.com/apoorvkh/torchrunx/blob/main/pyproject.toml)
|
6 | 4 | [](https://pypi.org/project/torchrunx/)
|
7 | 5 | 
|
8 | 6 | [](https://torchrunx.readthedocs.io)
|
9 | 7 | [](https://github.com/apoorvkh/torchrunx/blob/main/LICENSE)
|
10 | 8 |
|
11 |
| -Automatically launch PyTorch functions onto multiple machines or GPUs |
| 9 | +By [Apoorv Khandelwal](http://apoorvkh.com) and [Peter Curtin](https://github.com/pmcurtin) |
| 10 | + |
| 11 | +**Automatically distribute PyTorch functions onto multiple machines or GPUs** |
12 | 12 |
|
13 | 13 | ## Installation
|
14 | 14 |
|
15 | 15 | ```bash
|
16 | 16 | pip install torchrunx
|
17 | 17 | ```
|
18 | 18 |
|
19 |
| -Requirements: |
20 |
| -- Operating System: Linux |
21 |
| -- Python >= 3.8.1 |
22 |
| -- PyTorch >= 2.0 |
23 |
| -- Shared filesystem & SSH between hosts |
| 19 | +Requires: Linux, Python >= 3.8.1, PyTorch >= 2.0 |
| 20 | + |
| 21 | +Shared filesystem & SSH access if using multiple machines |
| 22 | + |
| 23 | +## Why should I use this? |
| 24 | + |
| 25 | +[`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) is a hammer. `torchrunx` is a chisel. |
24 | 26 |
|
25 |
| -## Features |
| 27 | +Whether you have 1 GPU, 8 GPUs, or 8 machines: |
26 | 28 |
|
27 |
| -- Distribute PyTorch functions to multiple GPUs or machines |
28 |
| -- `torchrun` with the convenience of a Python function |
29 |
| -- Integration with SLURM |
| 29 | +Convenience: |
30 | 30 |
|
31 |
| -Advantages: |
| 31 | +- If you don't want to set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself |
| 32 | +- If you want to run `python myscript.py` instead of `torchrun myscript.py` |
| 33 | +- If you don't want to manually SSH and run `torchrun --master-ip --master-port ...` on every machine (and if you don't want to babysit these machines for hanging failures) |
32 | 34 |
|
33 |
| -- Self-cleaning: avoid memory leaks! |
34 |
| -- Better for complex workflows |
35 |
| -- Doesn't parallelize the whole script: just what you want |
36 |
| -- Run distributed functions from Python Notebooks |
| 35 | +Robustness: |
| 36 | + |
| 37 | +- If you want to run a complex, _modular_ workflow in one script |
| 38 | + - no worries about memory leaks or OS failures |
| 39 | + - don't parallelize your entire script: just the functions you want |
| 40 | + |
| 41 | +Features: |
| 42 | + |
| 43 | +- Our launch utility is super _Pythonic_ |
| 44 | +- If you want to run distributed PyTorch functions from Python Notebooks. |
| 45 | +- Automatic integration with SLURM |
| 46 | + |
| 47 | +Why not? |
| 48 | + |
| 49 | +- We don't support fault tolerance via torch elastic. Probably only useful if you are using 1000 GPUs. Maybe someone can make a PR. |
37 | 50 |
|
38 | 51 | ## Usage
|
39 | 52 |
|
@@ -99,4 +112,4 @@ accuracy = trx.launch(
|
99 | 112 | )["localhost"][0]
|
100 | 113 |
|
101 | 114 | print(f'Accuracy: {accuracy}')
|
102 |
| -``` |
| 115 | +``` |
0 commit comments