Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable internal kv bucket in llama #720

Merged
merged 4 commits into from
Feb 23, 2024
Merged

enable internal kv bucket in llama #720

merged 4 commits into from
Feb 23, 2024

Conversation

xt574chen
Copy link
Contributor

What does this PR do?

To enhance throughput in scenarios with long new tokens, break down the KV cache into multiples of the bucket width. Use this to compute attention rather than using the entire KV cache.

LLaMA v2 70B (8x, max_input_tokens 128, max_new_tokens 2048, batch_size 240):
5528 tps (original performance) -> 6378 tps (w/ internal bucket size 128)

Add --bucket_size=128 --bucket_internal to the commands to enable the feature.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

if not generation_config.bucket_internal:
assert generation_config.bucket_size <= 0, "reuse_cache and bucketing flags set together"
else:
assert generation_config.bucket_size >= 0, "bucket_internal and bucket_size flags set together"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are in the case where generation_config.bucket_internal is True, so if this assert fails (i.e. generation_config.bucket_size < 0), it means that bucket_size is not set right? But the error message says otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@puneeshkhanna I see you've corrected some error messages. I hope this update won't cause conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xt574chen - I will update my PR once this gets merged first

@regisss
Copy link
Collaborator

regisss commented Feb 22, 2024

@xt574chen Can you run the following from the root of the repo to make the code style check pass please?

pip install -U ruff
make style

@regisss regisss added the run-test Run CI for PRs from external contributors label Feb 23, 2024
@regisss regisss merged commit e328e21 into huggingface:main Feb 23, 2024
11 of 12 checks passed
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
@xt574chen xt574chen deleted the bucket_internal branch March 1, 2024 01:03
HolyFalafel pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants