Skip to content

Issue connecting to nodes that are not within the same cluster #658

Open
@yxusnapchat

Description

@yxusnapchat

Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes?

kind: MPIJob
metadata:
  name: xxx
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: xxx
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            # resources:
            #   limits:
            #     nvidia.com/gpu: 1  # Request 1 GPU
            #   requests:
            #     nvidia.com/gpu: 1  # Optionally set requests equal to limits
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /nvidia-examples/movielens-1m-keras-with-horovod.py
            - --mode=train
            - --model_dir="./model_dir" 
            - --export_dir="./export_dir"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: xxx
            name: mpi-worker
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            resources:
              limits:
                nvidia.com/gpu: 1  # Request 1 GPU
              requests:
                nvidia.com/gpu: 1  # Optionally set requests equal to limits

Also I am curious on where is the code pointer to start the worker. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions