Skip to content

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Open
@gera-aldama

Description

@gera-aldama

Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mpijob
spec:
  slotsPerWorker: 2
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;

                  HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                  echo "HOSTSFILE=${HOSTSFILE}";
                  MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
                  echo "MASTER_ADDR=${MASTER_ADDR}";
                  NUM_NODES=$(wc -l < $HOSTSFILE);
                  echo "NUM_NODES=${NUM_NODES}";
                  CARDS_PER_NODE=2;
                  N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
                  echo "N_CARDS=${N_CARDS}";

                  SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
                             pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
                             pip install --no-cache-dir optimum-habana==1.15.0; \
                             pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";

                  eval $SETUP_CMD;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install -r optimum-habana/examples/language-modeling/requirements.txt;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir optimum-habana==1.15.0;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;

                  MODEL_PATH=/optimum-habana/examples/language-modeling;
                  cd $MODEL_PATH;
                  mpirun -np ${N_CARDS} \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    -x MASTER_ADDR=$MASTER_ADDR \
                    -mca btl_tcp_if_include eth0 \
                    -mca oob_tcp_if_include eth0 \
                    -mca plm_rsh_no_tree_spawn 1 \
                    python $MODEL_PATH/run_lora_clm.py \
                    --model_name_or_path huggyllama/llama-7b \
                    --dataset_name tatsu-lab/alpaca \
                    --bf16 \
                    --output_dir /tmp/pvc-mount \
                    --num_train_epochs 1 \
                    --per_device_train_batch_size 12 \
                    --evaluation_strategy no \
                    --save_strategy no \
                    --learning_rate 1e-4 \
                    --warmup_ratio 0.03 \
                    --lr_scheduler_type constant \
                    --max_grad_norm 0.3 \
                    --logging_steps 1 \
                    --do_train \
                    --do_eval \
                    --use_habana \
                    --use_lazy_mode \
                    --throughput_warmup_steps 3 \
                    --lora_rank 8 \
                    --lora_alpha 16 \
                    --lora_dropout 0.05 \
                    --lora_target_modules q_proj v_proj \
                    --dataset_concatenation \
                    --max_seq_length 512 \
                    --low_cpu_mem_usage True \
                    --validation_split_percentage 4 \
                    --adam_epsilon 1e-08;
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage
    Worker:
      replicas: 2
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;
                  sleep 365d;
              resources:
                limits:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage

When I run this configuration, I encounter the following error:

There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.

Observations:

  1. The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
  2. Using the --oversubscribe flag results in the following error:
RuntimeError: synStatus=8 [Device not found] Device acquire failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions