Fully transition from `flash-attn` to `kernels`

The new recommended way to use flash attention is to use kernels. We should update our tests, and documentation to use `kernels` instead of "flash_attention2". Eg

https://github.com/huggingface/trl/blob/1eb561c3e9133892a2e907d84123b46e40cbc5a0/docs/source/reducing_memory_usage.md#L149

```diff
- training_args = DPOConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention_2"}) 
+ training_args = DPOConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "kernels-community/flash-attn2"}) 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fully transition from `flash-attn` to `kernels` #4380

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fully transition from flash-attn to kernels #4380

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Fully transition from `flash-attn` to `kernels` #4380