Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules #3
Closed
Description
Problem
Distributed experiments in the benchmarks fail when using BNB's nf4
QLoRA with unsloth fused module optimizations.
Cause
Distributed experiments for BNB's nf4
QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.
Stacktrace from test repo:
File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/fast_lora.py", line 227, in forward
Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/utils.py", line 235, in matmul_lora
out = torch.matmul(X, W, out = out)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
0%| | 0/100 [00:01<?, ?it/s]
Setting debug environment var CUDA_LAUNCH_BLOCKING=1
produces this Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu
. This is traced to the dequantizeBlockwise
CUDA function.
Reproduce
accelerate launch \
--config_file ./accelerate.yaml \
--num_processes=2 \
--main_process_port=29500 -m tuning.sft_trainer \
--model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v0.3 \
--acceleration_framework_config_file ./sample-configurations/accelerated-peft-bnb-nf4-unsloth-sample-configuration.yaml \
--packing True \
--max_seq_len 2048 \
--fp16 True \
--learning_rate 2e-4 \
--torch_dtype float16 \
--peft_method lora \
--r 16 \
--lora_alpha 16 \
--lora_dropout 0.0 \
--target_modules q_proj k_proj v_proj o_proj \
--use_flash_attn True \
--response_template \n### Response: \
--dataset_text_field output \
--include_tokens_per_second True \
--num_train_epochs 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--evaluation_strategy no \
--save_strategy no \
--weight_decay 0.01 \
--warmup_steps 10 \
--adam_epsilon 1e-4 \
--lr_scheduler_type linear \
--logging_strategy steps \
--logging_steps 10 \
--max_steps 100 \
--training_data_path ./data/benchmark_data.json \
--per_device_train_batch_size 2 \
--output_dir results/exp_5/hf
Metadata
Assignees
Labels
No labels