Skip to content

training-checkpoints-develop

@binzily binzily tagged this 08 Apr 15:19
## Description

Adds a troubleshooting note to the multi-GPU training docs for Linux
systems
where distributed training may fail with `CUDA error: an illegal memory
access was encountered`
reported by `ProcessGroupNCCL`.

This PR is documentation-only. It does not change the default
distributed training behavior
in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable
workarounds that were
observed to restore stability on some affected systems:

- `NCCL_SHM_DISABLE=1`
- `NCCL_IB_DISABLE=1`
- `NCCL_ALGO=Ring`

The motivation for this change is to provide an official troubleshooting
path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In
our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred
in the distributed
training stack when using NCCL with humanoid locomotion workloads.

Dependencies: none.

Refs #4011
Refs #2756

## Type of change

- Documentation update

## Screenshots

N/A

## Checklist

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

## Context

Local reproduction environment:
- Ubuntu 22.04.5
- RTX 5090 x2
- Isaac Sim / IsaacLab multi-GPU training
- official distributed minimal reproduction with
`Isaac-Velocity-Flat-G1-v0`

Observed behavior:
- the default distributed launch failed with NCCL illegal memory access
- `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU
minimal reproduction pass
- `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored
stability in a longer validation run

This PR documents those workarounds without changing defaults, since the
NCCL transport/algo
selection is handled below the IsaacLab task layer.

---------

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Assets 2
Loading