Skip to content

Prevent user from defining NCCL_TOPO_FILE when topologyFileConfingMp is set #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions tools/pytorchjob-generator/chart/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,12 @@ env:
fieldPath: metadata.labels['sakkara.member.rank']
{{- end }}
{{- if .Values.topologyFileConfigMap }}
{{- range $variable := .Values.environmentVariables }}
{{- if eq $variable.name "NCCL_TOPO_FILE" }}
{{ required "If topologyFileConfigMap is defined, environment variables must not define NCCL_TOPO_FILE" nil }}
{{- end }}
{{- end }}
# Put the path to virtualTopology.xml file that was volume-mounted into the expected environment variable for CUDA
- name: NCCL_TOPO_FILE
value: /var/run/nvidia-topologyd/virtualTopology.xml
{{- end }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1362,6 +1362,160 @@ Enabling sshGitConfig injects the envvars, volumes, and volumeMounts:
- emptyDir:
medium: Memory
name: dshm
Harmless environment variables can be set when topologyFileConfigMap is provided:
1: |
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
annotations:
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.9
labels:
kueue.x-k8s.io/queue-name: default-queue
name: my-job
namespace: my-namespace
spec:
components:
- template:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: my-job
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autopilot.ibm.com/gpuhealth
operator: NotIn
values:
- ERR
- TESTING
- EVICT
containers:
- command:
- sh
- -c
- |
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
echo "Other injected environment variables:"
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
#
# User commands
#
git clone https://github.com/dbarnett/python-helloworld
cd python-helloworld
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
env:
- name: NCCL_TOPO_FILE
value: /var/run/nvidia-topologyd/virtualTopology.xml
- name: EXAMPLE_VAR1
value: "42"
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
imagePullPolicy: IfNotPresent
name: pytorch
resources:
limits:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
volumeMounts:
- mountPath: /var/run/nvidia-topologyd
name: topology-volume
- mountPath: /dev/shm
name: dshm
imagePullSecrets: []
priorityClassName: default-priority
volumes:
- configMap:
name: nvidia-topo-gdr
name: topology-volume
- emptyDir:
medium: Memory
name: dshm
Worker:
replicas: 3
restartPolicy: Never
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autopilot.ibm.com/gpuhealth
operator: NotIn
values:
- ERR
- TESTING
- EVICT
containers:
- command:
- sh
- -c
- |
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
echo "Other injected environment variables:"
echo "NVME_MOUNT_PATH: "${NVME_MOUNT_PATH}
#
# User commands
#
git clone https://github.com/dbarnett/python-helloworld
cd python-helloworld
echo executing: torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
torchrun --nnodes=${WORLD_SIZE} --node_rank=${RANK} --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" helloworld.py
env:
- name: NCCL_TOPO_FILE
value: /var/run/nvidia-topologyd/virtualTopology.xml
- name: EXAMPLE_VAR1
value: "42"
image: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
imagePullPolicy: IfNotPresent
name: pytorch
resources:
limits:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
requests:
cpu: 500m
memory: 1Gi
nvidia.com/gpu: 8
nvidia.com/roce_gdr: 0
volumeMounts:
- mountPath: /var/run/nvidia-topologyd
name: topology-volume
- mountPath: /dev/shm
name: dshm
imagePullSecrets: []
priorityClassName: default-priority
volumes:
- configMap:
name: nvidia-topo-gdr
name: topology-volume
- emptyDir:
medium: Memory
name: dshm
scheduler can be set:
1: |
apiVersion: workload.codeflare.dev/v1beta2
Expand Down
19 changes: 19 additions & 0 deletions tools/pytorchjob-generator/chart/tests/helloworld_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -270,3 +270,22 @@ tests:
asserts:
- matchSnapshot:
path: spec.components[0].template

- it: Harmless environment variables can be set when topologyFileConfigMap is provided
set:
topologyFileConfigMap: nvidia-topo-gdr
environmentVariables:
- name: EXAMPLE_VAR1
value: 42
asserts:
- matchSnapshot:
path: spec.components[0].template

- it: NCCL_TOPO_FILE environment variables cannot be set when topologyFileConfigMap is provided
set:
topologyFileConfigMap: nvidia-topo-gdr
environmentVariables:
- name: NCCL_TOPO_FILE
value: myFile
asserts:
- failedTemplate: {}