Open
Description
Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: mpijob
spec:
slotsPerWorker: 2
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
hostIPC: true
containers:
- name: mpijob-container
image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
imagePullPolicy: Always
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
echo "HOSTSFILE=${HOSTSFILE}";
MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
echo "MASTER_ADDR=${MASTER_ADDR}";
NUM_NODES=$(wc -l < $HOSTSFILE);
echo "NUM_NODES=${NUM_NODES}";
CARDS_PER_NODE=2;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
echo "N_CARDS=${N_CARDS}";
SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
pip install --no-cache-dir optimum-habana==1.15.0; \
pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";
eval $SETUP_CMD;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install -r optimum-habana/examples/language-modeling/requirements.txt;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install --no-cache-dir optimum-habana==1.15.0;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;
MODEL_PATH=/optimum-habana/examples/language-modeling;
cd $MODEL_PATH;
mpirun -np ${N_CARDS} \
--allow-run-as-root \
--bind-to core \
--map-by ppr:4:socket:PE=6 \
-rank-by core --report-bindings \
--tag-output \
--merge-stderr-to-stdout --prefix $MPI_ROOT \
-x MASTER_ADDR=$MASTER_ADDR \
-mca btl_tcp_if_include eth0 \
-mca oob_tcp_if_include eth0 \
-mca plm_rsh_no_tree_spawn 1 \
python $MODEL_PATH/run_lora_clm.py \
--model_name_or_path huggyllama/llama-7b \
--dataset_name tatsu-lab/alpaca \
--bf16 \
--output_dir /tmp/pvc-mount \
--num_train_epochs 1 \
--per_device_train_batch_size 12 \
--evaluation_strategy no \
--save_strategy no \
--learning_rate 1e-4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--max_grad_norm 0.3 \
--logging_steps 1 \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--throughput_warmup_steps 3 \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules q_proj v_proj \
--dataset_concatenation \
--max_seq_length 512 \
--low_cpu_mem_usage True \
--validation_split_percentage 4 \
--adam_epsilon 1e-08;
resources:
limits:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
requests:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
volumeMounts:
- name: hf-token
mountPath: /tmp/hf_token
- name: pvc-storage
mountPath: /tmp/pvc-mount
volumes:
- name: hf-token
secret:
secretName: hf-token
- name: pvc-storage
persistentVolumeClaim:
claimName: pvc-storage
Worker:
replicas: 2
template:
spec:
hostIPC: true
containers:
- name: mpijob-container
image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
imagePullPolicy: Always
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
sleep 365d;
resources:
limits:
habana.ai/gaudi: 2
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
requests:
habana.ai/gaudi: 2
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
volumeMounts:
- name: hf-token
mountPath: /tmp/hf_token
- name: pvc-storage
mountPath: /tmp/pvc-mount
volumes:
- name: hf-token
secret:
secretName: hf-token
- name: pvc-storage
persistentVolumeClaim:
claimName: pvc-storage
When I run this configuration, I encounter the following error:
There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.
Observations:
- The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
- Using the --oversubscribe flag results in the following error:
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
Metadata
Metadata
Assignees
Labels
No labels