Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules

### Problem
Distributed experiments in the benchmarks fail when using BNB's `nf4` QLoRA with unsloth fused module optimizations. 

### Cause
Distributed experiments for BNB's `nf4` QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.

Stacktrace from test repo:
```
 File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/utils.py", line 235, in matmul_lora
    out = torch.matmul(X, W, out = out)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  0%|                                                                                                                  | 0/100 [00:01<?, ?it/s]
```

Setting debug environment var `CUDA_LAUNCH_BLOCKING=1` produces this `Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu`. This is traced to the `dequantizeBlockwise` CUDA function.

### Reproduce
```
accelerate launch \
    --config_file ./accelerate.yaml \
    --num_processes=2 \
    --main_process_port=29500  -m tuning.sft_trainer \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v0.3 \
    --acceleration_framework_config_file ./sample-configurations/accelerated-peft-bnb-nf4-unsloth-sample-configuration.yaml \
    --packing True \
    --max_seq_len 2048 \
    --fp16 True \
    --learning_rate 2e-4 \
    --torch_dtype float16 \
    --peft_method lora \
    --r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.0 \
    --target_modules q_proj k_proj v_proj o_proj \
    --use_flash_attn True \
    --response_template \n### Response: \
    --dataset_text_field output \
    --include_tokens_per_second True \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy no \
    --save_strategy no \
    --weight_decay 0.01 \
    --warmup_steps 10 \
    --adam_epsilon 1e-4 \
    --lr_scheduler_type linear \
    --logging_strategy steps \
    --logging_steps 10 \
    --max_steps 100 \
    --training_data_path ./data/benchmark_data.json \
    --per_device_train_batch_size 2 \
    --output_dir results/exp_5/hf
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules #3

Problem

Cause

Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development