Skip to content

Llama LoRA training not working with sdpa and flash attention #721

@calpt

Description

@calpt

Environment info

  • adapters version: 1.0.0.dev0 (latest main)
  • Platform: Debian
  • Python version: 3.10
  • PyTorch version (GPU?): 2.3.1
  • Using GPU in script?: Yes, A100
  • Using distributed or parallel set-up in script?: torch DDP using accelerate

Information

Model I am using: Llama-3-8B

Language I am using the model on: English

Adapter setup I am using: LoRA

The LoRA implementation for Llama only works with the "eager" attention implementation. With "sdpa" or "flash_attention_2", it does not converge or throws an exception, respectively.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions