Skip to content

NCCL is installed on cluster GPU hosts by default #1014

@aybchan

Description

@aybchan

NCCL is installed on cluster GPU hosts by default (e.g on a3-megagpu-8g clusters) while this may not be the most desirable default behaviour.

User-containers (the 'gpu-image' in xpk) typically bundle their own NCCL version that user code depends on and expects to use. However, in GKE we have found that this expectation is broken because NCCL binaries from GPU hosts are mounted into the containers and take precedence in the LD_LIBRARY_PATH by default. This could cause user code to break if there are discrepancies between the container and host NCCL versions.

It would be helpful to be able to easily disable NCCL install on GPU hosts by the nccl-tcpxo-installer pods. The manifest provided to xpk from container-engine-accelerators provides a way to do this by removing the -install-nccl flag (nccl-tcpxo-installer.yaml#L88)

With xpk cluster creation I was only able to do this by manually editing the deployment blueprints after being generated by xpk (and then manually deploying that). I was not able to get an end-to-end cluster deployment with xpk by, for example, repointing the hardcoded paths to the manifest (e.g.).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions