Environment info
adapters version: 1.0.0.dev0 (latest main)
- Platform: Debian
- Python version: 3.10
- PyTorch version (GPU?): 2.3.1
- Using GPU in script?: Yes, A100
- Using distributed or parallel set-up in script?: torch DDP using accelerate
Information
Model I am using: Llama-3-8B
Language I am using the model on: English
Adapter setup I am using: LoRA
The LoRA implementation for Llama only works with the "eager" attention implementation. With "sdpa" or "flash_attention_2", it does not converge or throws an exception, respectively.