Llama LoRA training not working with sdpa and flash attention

## Environment info

     
- `adapters` version: 1.0.0.dev0 (latest main)
- Platform: Debian
- Python version: 3.10
- PyTorch version (GPU?): 2.3.1
- Using GPU in script?: Yes, A100
- Using distributed or parallel set-up in script?: torch DDP using accelerate

## Information

Model I am using: Llama-3-8B

Language I am using the model on: English

Adapter setup I am using: LoRA

The LoRA implementation for Llama only works with the "eager" attention implementation. With "sdpa" or "flash_attention_2", it does not converge or throws an exception, respectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama LoRA training not working with sdpa and flash attention #721

Environment info

Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama LoRA training not working with sdpa and flash attention #721

Description

Environment info

Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions