Issue connecting to nodes that are not within the same cluster

Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes? 
```apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: xxx
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: xxx
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            # resources:
            #   limits:
            #     nvidia.com/gpu: 1  # Request 1 GPU
            #   requests:
            #     nvidia.com/gpu: 1  # Optionally set requests equal to limits
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /nvidia-examples/movielens-1m-keras-with-horovod.py
            - --mode=train
            - --model_dir="./model_dir" 
            - --export_dir="./export_dir"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: xxx
            name: mpi-worker
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            resources:
              limits:
                nvidia.com/gpu: 1  # Request 1 GPU
              requests:
                nvidia.com/gpu: 1  # Optionally set requests equal to limits

```

Also I am curious on where is the code pointer to start the worker. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue connecting to nodes that are not within the same cluster #658

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue connecting to nodes that are not within the same cluster #658

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions