update Bark FA2 docs #27400

ylacombe · 2023-11-09T14:35:16Z

What does this PR do?

Following @ArthurZucker's review in #27634, I've added a section on FA2 in Bark readme and a mention of Bark FA2 support in a section on FA2!

Note that this comment #27364 (comment) is already addressed since self.dropout is already a float!

Before submitting

[w] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

cc @amyeroberts and @ArthurZucker

HuggingFaceDocBuilderDev · 2023-11-09T14:50:31Z

The documentation is not available anymore as the PR was closed or merged.

amyeroberts

Thanks for adding!

Overall looks good but the comparison values should be updated to make this more useful for readers.

amyeroberts · 2023-11-09T15:39:59Z

docs/source/en/model_doc/bark.md

+Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase.
+
+To put this into perspective, you can generate 17 times more text and still be 2s faster than the unoptimized version. At batch size 8, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.


This isn't really a comparison: to do this the specific tests, hardware and runtime numbers should be shown to be useful. "17 times more text" is underfined - do you mean tokens?

You're right oc, I'll be more specific

It's difficult to be specific without taking too much time since the tokenizer by default pads the input to 256 tokens, but I'll be more specific in my terms whatsoever!

Updated! what do you think of this ?

Better - but still needs some more information and structure.

For example the graphs we see here for mistral. At the moment you're writing a sentence which is giving one data point on the graph without the context of the test being run, and so can't be used to inform any decisions.

It's difficult to be specific without taking too much time since the tokenizer by default pads the input to 256 tokens, but I'll be more specific in my terms whatsoever!

Could you explain this further? What do you mean by too much time? As in to get the numbers? I don't understand the relation to the max length either. Couldn't you run on a dummy model of small context length and benchmark on that? The generations don't have to be good (they can be nonsense) this is just about providing experimental values.

I basically wanted to avoid making the readme more packed than it is already but a graph will do the trick!

Could you explain this further? What do you mean by too much time? As in to get the numbers? I don't understand the relation to the max length either. Couldn't you run on a dummy model of small context length and benchmark on that? The generations don't have to be good (they can be nonsense) this is just about providing experimental values.

I just meant that I've already run the benchmarks and that counting tokens (i.e getting the average context length) wasn't straightforward due to how the the tokenizer works

sanchit-gandhi

Just a few minor suggestions: would re-order the structure a bit, and maybe explain a bit more where the 17x number comes from.

Think for a high-level comparison these numbers are enough; the docs should serve as an indication of the performance gain we should expect to get, but they're not academic benchmarks, so don't have to be fully water tight.

docs/source/en/model_doc/bark.md

sanchit-gandhi · 2023-11-10T10:20:04Z

docs/source/en/model_doc/bark.md

+<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
+</div>
+
+To put this into perspective, on an NVIDIA A100, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than the unoptimized, non-batch version. 


I don't really follow where this 17x number comes from? Are we comparing FA2 batched vs un-optimised non-batched?

Exactly, I'll make it clearer

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

amyeroberts

The graph is great 📈 - thanks for adding!

ylacombe · 2023-11-10T13:40:06Z

Thanks! Merging!

* update Bark FA2 docs * update benchmark section * Update bark.md * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * rephrase --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

update Bark FA2 docs

8ab2a6f

amyeroberts reviewed Nov 9, 2023

View reviewed changes

ylacombe and others added 2 commits November 9, 2023 16:14

update benchmark section

1432fae

Update bark.md

1b2b793

sanchit-gandhi approved these changes Nov 10, 2023

View reviewed changes

ylacombe and others added 2 commits November 10, 2023 10:52

Apply suggestions from code review

7aacdcb

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

rephrase

f9d58ce

ylacombe requested a review from amyeroberts November 10, 2023 12:50

amyeroberts approved these changes Nov 10, 2023

View reviewed changes

ylacombe merged commit 9dd58c5 into huggingface:main Nov 10, 2023
8 checks passed

This was referenced Nov 14, 2023

Adding flash attention to GPT2 #27479

Closed

Persimmon fa2 attention4d #27052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update Bark FA2 docs #27400

update Bark FA2 docs #27400

ylacombe commented Nov 9, 2023

HuggingFaceDocBuilderDev commented Nov 9, 2023 •

edited

Loading

amyeroberts left a comment

amyeroberts Nov 9, 2023

ylacombe Nov 9, 2023

ylacombe Nov 9, 2023

ylacombe Nov 9, 2023

amyeroberts Nov 9, 2023

ylacombe Nov 9, 2023

sanchit-gandhi left a comment

sanchit-gandhi Nov 10, 2023

ylacombe Nov 10, 2023

amyeroberts left a comment

ylacombe commented Nov 10, 2023

		Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase.

		To put this into perspective, you can generate 17 times more text and still be 2s faster than the unoptimized version. At batch size 8, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.

update Bark FA2 docs #27400

update Bark FA2 docs #27400

Conversation

ylacombe commented Nov 9, 2023

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Nov 9, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

ylacombe commented Nov 10, 2023

HuggingFaceDocBuilderDev commented Nov 9, 2023 •

edited

Loading