Open
Description
Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes?
kind: MPIJob
metadata:
name: xxx
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: xxx
# env:
# - name: TF_USE_LEGACY_KERAS
# value: "1"
# resources:
# limits:
# nvidia.com/gpu: 1 # Request 1 GPU
# requests:
# nvidia.com/gpu: 1 # Optionally set requests equal to limits
name: mpi-launcher
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /nvidia-examples/movielens-1m-keras-with-horovod.py
- --mode=train
- --model_dir="./model_dir"
- --export_dir="./export_dir"
Worker:
replicas: 2
template:
spec:
containers:
- image: xxx
name: mpi-worker
# env:
# - name: TF_USE_LEGACY_KERAS
# value: "1"
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
requests:
nvidia.com/gpu: 1 # Optionally set requests equal to limits
Also I am curious on where is the code pointer to start the worker. Thanks!
Metadata
Metadata
Assignees
Labels
No labels