Flash-attn performance: remove cuda sync during inference #33570

Cyrilvallez · 2024-09-18T16:32:23Z

What does this PR do?

#31629 & #32241 introduced a functionality in FA2 intended for training efficiency. However, it adds unnecessary cuda synchronization at inference time in every forward pass due to always checking (torch.diff(position_ids, dim=-1) >= 0).all() in the elif condition. This PR fixes the performance issue by simply switching the order of the different checks in the elif condition, to make good use of Python's default short-circuit evaluation. Indeed, at inference time, query_length will always be 1 except during prefill, thus we will short-circuit torch synchronization all the time.

Performance degradation was not so significant, but this PR allows to win back around 5-10% speed at inference time from the quick tests I ran.

HuggingFaceDocBuilderDev · 2024-09-18T16:56:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez · 2024-09-20T11:02:57Z

cc @ArthurZucker, forgot to ping you

ArthurZucker

👀 nice hack, awesome that you found about it!

…e#33570) Switch conditions to use short-circuit during inference

Switch conditions to use short-circuit during inference

a55d5f2

ArthurZucker approved these changes Sep 20, 2024

View reviewed changes

Cyrilvallez merged commit 1f33023 into huggingface:main Oct 7, 2024
17 checks passed

NielsRogge pushed a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024

Flash-attn performance: remove cuda sync during inference (huggingfac…

18bae9f

…e#33570) Switch conditions to use short-circuit during inference

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Flash-attn performance: remove cuda sync during inference (huggingfac…

803e2e1

…e#33570) Switch conditions to use short-circuit during inference

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

Flash-attn performance: remove cuda sync during inference (huggingfac…

5ce351a

…e#33570) Switch conditions to use short-circuit during inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash-attn performance: remove cuda sync during inference #33570

Flash-attn performance: remove cuda sync during inference #33570

Cyrilvallez commented Sep 18, 2024

HuggingFaceDocBuilderDev commented Sep 18, 2024

Cyrilvallez commented Sep 20, 2024

ArthurZucker left a comment

Flash-attn performance: remove cuda sync during inference #33570

Flash-attn performance: remove cuda sync during inference #33570

Conversation

Cyrilvallez commented Sep 18, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 18, 2024

Cyrilvallez commented Sep 20, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment