Skip to content

Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules #3

Closed
@achew010

Description

Problem

Distributed experiments in the benchmarks fail when using BNB's nf4 QLoRA with unsloth fused module optimizations.

Cause

Distributed experiments for BNB's nf4 QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.

Stacktrace from test repo:

 File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/utils.py", line 235, in matmul_lora
    out = torch.matmul(X, W, out = out)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  0%|                                                                                                                  | 0/100 [00:01<?, ?it/s]

Setting debug environment var CUDA_LAUNCH_BLOCKING=1 produces this Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu. This is traced to the dequantizeBlockwise CUDA function.

Reproduce

accelerate launch \
    --config_file ./accelerate.yaml \
    --num_processes=2 \
    --main_process_port=29500  -m tuning.sft_trainer \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v0.3 \
    --acceleration_framework_config_file ./sample-configurations/accelerated-peft-bnb-nf4-unsloth-sample-configuration.yaml \
    --packing True \
    --max_seq_len 2048 \
    --fp16 True \
    --learning_rate 2e-4 \
    --torch_dtype float16 \
    --peft_method lora \
    --r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.0 \
    --target_modules q_proj k_proj v_proj o_proj \
    --use_flash_attn True \
    --response_template \n### Response: \
    --dataset_text_field output \
    --include_tokens_per_second True \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy no \
    --save_strategy no \
    --weight_decay 0.01 \
    --warmup_steps 10 \
    --adam_epsilon 1e-4 \
    --lr_scheduler_type linear \
    --logging_strategy steps \
    --logging_steps 10 \
    --max_steps 100 \
    --training_data_path ./data/benchmark_data.json \
    --per_device_train_batch_size 2 \
    --output_dir results/exp_5/hf

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions