NCCL is installed on cluster GPU hosts by default

`NCCL` is installed on cluster GPU hosts by default (e.g on `a3-megagpu-8g` clusters) while this may not be the most desirable default behaviour.

User-containers (the '`gpu-image`' in `xpk`) typically bundle their own `NCCL`  version that user code depends on and expects to use. However, in GKE we have found that this expectation is broken because `NCCL`  binaries from GPU hosts are mounted into the containers and take precedence in the `LD_LIBRARY_PATH` by default. This could cause user code to break if there are discrepancies between the container and host NCCL versions.

It would be helpful to be able to easily disable `NCCL`  install on GPU hosts by the `nccl-tcpxo-installer` pods. The manifest provided to `xpk` from `container-engine-accelerators` provides a way to do this by removing the `-install-nccl` flag ([nccl-tcpxo-installer.yaml#L88](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/gpudirect-tcpxo/nccl-tcpxo-installer.yaml#L88))

With `xpk` cluster creation I was only able to do this by manually editing the deployment blueprints after being generated by xpk (and then manually deploying that). I was not able to get an end-to-end cluster deployment with `xpk` by, for example, repointing the hardcoded paths to the manifest ([e.g.](https://github.com/AI-Hypercomputer/xpk/blob/release-1.0/src/xpk/core/system_characteristics.py#L28)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL is installed on cluster GPU hosts by default #1014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL is installed on cluster GPU hosts by default #1014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions