memory issue, prepare_model_for_kbit_training

I am using a quantized base model (NF4) and do LoRA fine-tuning.
When I call prepare_model_for_kbit_training to wrap the model, the memory consumption is significantly higher than the bf16 counterpart, especially when the line backpropagation was called.