Are there any plans to support this? Reading some of the past issues, seems the main thing blocking is that CUDA uses f32 whilst LORA uses f16 tensors. Is that still the case? I can give a shot at implementing this if someone can give me a rough rundown on all the hurdles.