Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each

Based on [Multi-Gaudi Workloads Example](https://docs.habana.ai/en/v1.17.0/Orchestration/Gaudi_Kubernetes/MPI_Operator_and_Helm_Chart_for_Kubernetes.html#running-multi-gaudi-workloads-example), I am trying to run an MPIJob with the following configuration:

```yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mpijob
spec:
  slotsPerWorker: 2
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;

                  HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                  echo "HOSTSFILE=${HOSTSFILE}";
                  MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
                  echo "MASTER_ADDR=${MASTER_ADDR}";
                  NUM_NODES=$(wc -l < $HOSTSFILE);
                  echo "NUM_NODES=${NUM_NODES}";
                  CARDS_PER_NODE=2;
                  N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
                  echo "N_CARDS=${N_CARDS}";

                  SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
                             pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
                             pip install --no-cache-dir optimum-habana==1.15.0; \
                             pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";

                  eval $SETUP_CMD;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install -r optimum-habana/examples/language-modeling/requirements.txt;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir optimum-habana==1.15.0;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;

                  MODEL_PATH=/optimum-habana/examples/language-modeling;
                  cd $MODEL_PATH;
                  mpirun -np ${N_CARDS} \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    -x MASTER_ADDR=$MASTER_ADDR \
                    -mca btl_tcp_if_include eth0 \
                    -mca oob_tcp_if_include eth0 \
                    -mca plm_rsh_no_tree_spawn 1 \
                    python $MODEL_PATH/run_lora_clm.py \
                    --model_name_or_path huggyllama/llama-7b \
                    --dataset_name tatsu-lab/alpaca \
                    --bf16 \
                    --output_dir /tmp/pvc-mount \
                    --num_train_epochs 1 \
                    --per_device_train_batch_size 12 \
                    --evaluation_strategy no \
                    --save_strategy no \
                    --learning_rate 1e-4 \
                    --warmup_ratio 0.03 \
                    --lr_scheduler_type constant \
                    --max_grad_norm 0.3 \
                    --logging_steps 1 \
                    --do_train \
                    --do_eval \
                    --use_habana \
                    --use_lazy_mode \
                    --throughput_warmup_steps 3 \
                    --lora_rank 8 \
                    --lora_alpha 16 \
                    --lora_dropout 0.05 \
                    --lora_target_modules q_proj v_proj \
                    --dataset_concatenation \
                    --max_seq_length 512 \
                    --low_cpu_mem_usage True \
                    --validation_split_percentage 4 \
                    --adam_epsilon 1e-08;
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage
    Worker:
      replicas: 2
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;
                  sleep 365d;
              resources:
                limits:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage
``` 

When I run this configuration, I encounter the following error:

```shell
There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.
``` 

Observations:

1. The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
2. Using the --oversubscribe flag results in the following error:

```shell
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions