You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-56Lines changed: 15 additions & 56 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ By [Apoorv Khandelwal](https://apoorvkh.com) and [Peter Curtin](https://github.c
13
13
14
14
---
15
15
16
-
**`torchrunx`** is a *functional* utility for distributing PyTorch code across devices. This is a [more convenient, robust, and featureful](#torchrunx-uniquely-offers) alternative to CLI-based launchers, like `torchrun`, `accelerate launch`, and `deepspeed`.
16
+
**`torchrunx`** is a *functional* utility for distributing PyTorch code across devices. This is a [more convenient, robust, and featureful](https://torchrun.xyz/features.html) alternative to CLI-based launchers, like `torchrun`, `accelerate launch`, and `deepspeed`.
17
17
18
18
It enables complex workflows within a single script and has useful features even if only using 1 GPU.
19
19
@@ -29,20 +29,13 @@ Requires: Linux. If using multiple machines: SSH & shared filesystem.
29
29
30
30
Suppose we have some distributed training function (which needs to run on every GPU):
**Refer to our [API](https://torchrun.xyz/api.html) and [Usage](https://torchrun.xyz/usage/general.html) for many more capabilities!**
106
-
107
-
---
108
-
109
-
## `torchrunx` uniquely offers
110
-
111
-
1.**An automatic launcher that "just works" for everyone** 🚀
112
-
113
-
> `torchrunx` is an SSH-based, pure-Python library that is universally easy to install.<br>
114
-
> No system-specific dependencies and orchestration for *automatic* multi-node distribution.
115
-
116
-
2.**Conventional CLI commands** 🖥️
117
-
118
-
> Run familiar commands, like `python my_script.py ...`, and customize arguments as you wish.
119
-
>
120
-
> Other launchers override `python` in a cumbersome way: e.g. `torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=100.43.331.111 --master_port=1234 my_script.py ...`.
121
-
122
-
3.**Support for more complex workflows in a single script** 🎛️
123
-
124
-
> Your workflow may have steps that are complex (e.g. pre-train, fine-tune, test) or may different parallelizations (e.g. training on 8 GPUs, testing on 1 GPU). In these cases, CLI-based launchers require each step to live in its own script. Our library treats these steps in a modular way, so they can cleanly fit together in a single script!
125
-
>
126
-
>
127
-
> We clean memory leaks as we go, so previous steps won't crash or adversely affect future steps.
128
-
129
-
4.**Better handling of system failures. No more zombies!** 🧟
130
-
131
-
> With `torchrun`, your "work" is inherently coupled to your main Python process. If the system kills one of your workers (e.g. due to RAM OOM or segmentation faults), there is no way to fail gracefully in Python. Your processes might hang for 10 minutes (the NCCL timeout) or become perpetual zombies.
132
-
>
133
-
>
134
-
> `torchrunx` decouples "launcher" and "worker" processes. If the system kills a worker, our launcher immediately raises a `WorkerFailure` exception, which users can handle as they wish. We always clean up all nodes, so no more zombies!
135
-
136
-
5.**Bonus features** 🎁
137
-
138
-
> - Return objects from distributed functions.
139
-
> -[Automatic detection of SLURM environments.](https://torchrun.xyz/usage/slurm.html)
140
-
> - Start multi-node training from Python notebooks!
141
-
> - Our library is fully typed!
142
-
> - Custom, fine-grained handling of [logging](https://torchrun.xyz/usage/logging.html), [environment variables](https://torchrun.xyz/usage/general.html#environment-variables), and [exception propagation](https://torchrun.xyz/usage/general.html#exceptions). We have nice defaults too: no more interleaved logs and irrelevant exceptions!
143
-
144
-
**On our [roadmap](https://github.com/apoorvkh/torchrunx/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement): higher-order parallelism, support for debuggers, and more!**
103
+
**Refer to our [API](https://torchrun.xyz/api.html), [Features](https://torchrun.xyz/features.html), and [Usage](https://torchrun.xyz/usage/general.html) for many more capabilities!**
Copy file name to clipboardExpand all lines: docs/source/artifacts/deepspeed_help.txt
+5-3Lines changed: 5 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
-
[2025-02-23 16:02:38,914] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
2
-
[2025-02-23 16:02:38,930] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
1
+
[2025-06-25 15:33:02,489] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2
+
Warning: The cache directory for DeepSpeed Triton autotune, /users/akhand10/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
0 commit comments