Skip to content

Commit 1f04a00

Browse files
kwen2501pytorchmergebot
authored andcommitted
[PyTorch Distributed] Update documentation about NCCL environment variables (#74006)
Summary: Pull Request resolved: #74006 updated recommendations about environment variables to use during debug and performance tuning Test Plan: `make html` Reviewed By: rohan-varma Differential Revision: D34767454 fbshipit-source-id: 08cd58469bf72b58702e50e82020fa19b43b5911 (cherry picked from commit ac7e663)
1 parent 56aa1ab commit 1f04a00

File tree

1 file changed

+18
-8
lines changed

1 file changed

+18
-8
lines changed

docs/source/distributed.rst

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -123,14 +123,24 @@ It is imperative that all processes specify the same number of interfaces in thi
123123
Other NCCL environment variables
124124
""""""""""""""""""""""""""""""""
125125

126-
NCCL has also provided a number of environment variables for fine-tuning purposes.
127-
128-
Commonly used ones include the following for debugging purposes:
129-
130-
- ``export NCCL_DEBUG=INFO``
131-
- ``export NCCL_DEBUG_SUBSYS=ALL``
132-
133-
For the full list of NCCL environment variables, please refer to
126+
**Debugging** - in case of NCCL failure, you can set ``NCCL_DEBUG=INFO`` to print an explicit
127+
warning message as well as basic NCCL initialization information.
128+
129+
You may also use ``NCCL_DEBUG_SUBSYS`` to get more details about a specific
130+
aspect of NCCL. For example, ``NCCL_DEBUG_SUBSYS=COLL`` would print logs of
131+
collective calls, which may be helpful when debugging hangs, especially those
132+
caused by collective type or message size mismatch. In case of topology
133+
detection failure, it would be helpful to set ``NCCL_DEBUG_SUBSYS=GRAPH``
134+
to inspect the detailed detection result and save as reference if further help
135+
from NCCL team is needed.
136+
137+
**Performance tuning** - NCCL performs automatic tuning based on its topology detection to save users'
138+
tuning effort. On some socket-based systems, users may still try tuning
139+
``NCCL_SOCKET_NTHREADS`` and ``NCCL_NSOCKS_PERTHREAD`` to increase socket
140+
network bandwidth. These two environment variables have been pre-tuned by NCCL
141+
for some cloud providers, such as AWS or GCP.
142+
143+
For a full list of NCCL environment variables, please refer to
134144
`NVIDIA NCCL's official documentation <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html>`_
135145

136146

0 commit comments

Comments
 (0)