-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1][TPU] Fix TPU kv sharing tests #19155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @sarckk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, authored by @sarckk, aims to address and fix some 'soft failures' in the TPU KV sharing tests. According to the description, these failures were introduced by a previous PR (#18212) but were not causing the CI checks to fail outright (hence 'soft failures'). The changes primarily involve updating the test suite in test_tpu_model_runner.py
to correctly reflect expected behavior and memory configurations when testing KV sharing on TPUs.
Highlights
- Test Fixes: The core purpose of this PR is to fix existing, but non-blocking, failures in the TPU KV sharing tests.
- Test Parameter Updates: Several test cases related to KV cache initialization and memory estimation have been updated with new parameters, including increased simulated head sizes and available memory amounts.
- Test Setup Adjustment: Tests verifying error conditions for KV sharing initialization now use a context manager (
set_current_vllm_config
) to ensure the VLLM configuration is properly set during the test execution.
Changelog
- tests/v1/tpu/worker/test_tpu_model_runner.py
- Added
model_runner
fixture totest_init_kv_cache_with_kv_sharing_invalid_target_layer_order
,test_init_kv_cache_with_kv_sharing_target_layer_not_exist
, andtest_init_kv_cache_with_kv_sharing_target_same_as_current
. - Wrapped the
pytest.raises
blocks in the above tests withset_current_vllm_config(vllm_config)
. - Increased the simulated
head_size
from 64 to 128 in theAttention
object initialization across multiple test functions (test_init_kv_cache_with_kv_sharing_invalid_target_layer_order
,test_init_kv_cache_with_kv_sharing_target_layer_not_exist
,test_init_kv_cache_with_kv_sharing_target_same_as_current
,test_init_kv_cache_without_kv_sharing
,test_init_kv_cache_with_kv_sharing_valid
). - Increased the simulated
available_memory
from 20 GiB to 40 GiB intest_init_kv_cache_without_kv_sharing
andtest_init_kv_cache_with_kv_sharing_valid
. - Increased the memory argument for
estimate_max_model_len
from 5 GiB to 10 GiB intest_init_kv_cache_without_kv_sharing
andtest_init_kv_cache_with_kv_sharing_valid
.
- Added
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Tests were green, a lie,
Soft failures hid from view,
Code fix brings truth back.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix TPU test failures related to KV sharing, which were previously showing as soft failures. The changes involve updating test configurations, primarily by introducing the model_runner
fixture, using set_current_vllm_config
to ensure the correct vLLM configuration is active, and adjusting parameters like head_size
and available_memory
.
Overall, the changes appear logical and directly address the stated problem. The consistent updates across multiple tests are good. I found one minor issue regarding a stale comment that should be addressed for clarity.
Summary of Findings
- Stale Comment: In
test_init_kv_cache_without_kv_sharing
, the comment explaining the calculation ofnum_expected_blocks
is outdated due to changes inavailable_memory
andhead_size
. While the numerical value ofnum_expected_blocks
might remain correct due to proportional changes, the comment should be updated to reflect the new parameters (40GB available memory and 64KB page size) for accuracy and maintainability.
Merge Readiness
The pull request seems to effectively address the TPU test failures. However, there is a stale comment in test_init_kv_cache_without_kv_sharing
that should be updated for clarity. Once this minor issue is addressed, the PR should be ready for merge. I am unable to approve the pull request myself, so please ensure it is reviewed and approved by others before merging.
@@ -484,7 +491,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner): | |||
assert len(kv_cache_spec) == 2 | |||
assert len(model_runner.shared_kv_cache_layers) == 0 | |||
|
|||
available_memory = 20 * GiB_bytes | |||
available_memory = 40 * GiB_bytes | |||
# page size for layer 0's kv_cache_spec is 32KB | |||
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for num_expected_blocks
seems to be outdated. With available_memory
changed to 40 * GiB_bytes
and head_size
(which affects page_size
) changed to 128
, the calculation for num_expected_blocks
would be based on these new values.
Previously, with head_size=64
, page_size
was 32KB. num_blocks = (20GB) / (32KB) / 2 = 327680
.
Now, with head_size=128
, page_size
becomes 64KB. num_blocks = (40GB) / (64KB) / 2 = 327680
.
So, while the value 327680
for num_expected_blocks
might still be correct (as both available memory and page size effectively doubled), the comment # 20GB / 32KB / 2 (num layers)
is no longer accurate.
Could you update the comment to reflect the current parameters, for example: # 40GB / 64KB / 2 (num layers)
? This would improve clarity for future maintainers.
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers) | |
num_expected_blocks = 327680 # 40GB / 64KB / 2 (num layers) |
This pull request has merge conflicts that must be resolved before it can be |
Hi @sarckk, thank you for the follow up! I tested it on TPU, the vllm_config issue is gone. But I'm seeing these 2 errors
Wonder if you have quick ideas on this? I will take a deeper look later. |
d1bed8f
to
c2b79b8
Compare
c2b79b8
to
61918a4
Compare
@lsy323 could you try again? I think it's because we weren't properly setting vllm config for the TPU model runner in those tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for didn't catch it in my review of #18212
kv_cache_spec = model_runner.get_kv_cache_spec() | ||
assert len(kv_cache_spec) == 2 | ||
assert len(model_runner.shared_kv_cache_layers) == 0 | ||
|
||
available_memory = 20 * GiB_bytes | ||
# page size for layer 0's kv_cache_spec is 32KB | ||
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers) | ||
# page size for layer 0's kv_cache_spec is 64KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these numbers are different from GPU model runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because pallas attention backend expects head_size to be divisible by 128. In the original PR, the test was using head size of 64 and in #19108 it was changed to 128. So the downstream calculations need to assume 2 x page_size_bytes it was before.
PS: I'm not 100% sure if the changes fix all the tests as I don't have a TPU, was hoping I could trigger the test in CI. Seems like we might need @lsy323 or someone from Google to confirm and stamp otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if the changes fix all the tests as I don't have a TPU
I've also triggered the TPU CI test. Let's see what will happen.
https://buildkite.com/vllm/fastcheck/builds/26383#01973df8-1422-4d9e-9957-9ecd999dcd6d/2005-2462 |
I think the issue is that TPU worker passes attn_module.dtype which defaults to float32 instead of using the model config dtype as is done in gpu model runner. Wonder if this is intentional? This would mean we need memory estimations would need to be 2x'd |
#14570 Seems that this PR gives a bug fix on GPU model runner but not synced with TPU model runner. I'm asking here https://vllm-dev.slack.com/archives/C07SN1S35FE/p1749176611991269. |
#19244 |
adding changes from #19244 to test fix, will rebase once that is merged. another issue was that pallas backend uses block size 128 instead of 16 |
fe7551a
to
9107b51
Compare
# page size for each layer KV can be calculated as | ||
# 2 (non-MLA) * 8 (num_heads) * 128 (head_dim) | ||
# * 2 (bfloat16, kv_cache dtype) * 128 (block_size) = 512KB | ||
num_expected_blocks = 20480 # 20GB / 512KB / 2 (num layers) | ||
kv_cache_config = get_kv_cache_config(vllm_config, kv_cache_spec, | ||
available_memory) | ||
assert kv_cache_config.num_blocks == num_expected_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like kv_cache_config.tensors[layer_0].size
should be kv_cache_config.kv_cache_tensors[0]
, there are several similar cases like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, this was changed in #17996
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Hi @sarckk, you may be able to run this test on CPU machine, since it doesn't involve TPU execution. You may do:
Please note that the installation may modify your torch version, so you may need to install the torch for your GPU machine after this. |
@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner): | |||
max_context_len =\ | |||
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why we are setting 5GB here, instead of the available memory 5GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im just an arbitrarily small memory (5GB) so we can test the logic that constraints the maximum possible model len (3M in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you for the explanation, that makes sense
@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner): | |||
max_context_len =\ | |||
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes) | |||
# max context len with KV sharing should be 2x as large as without | |||
assert max_context_len == 1310720 | |||
assert max_context_len == 81920 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment on how the expected number is calculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I ran locally and found max_context_len is 655360, and failed this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think max_context_len
should be 655360, here is how I calculated this.
From above lines, we calculated page_size
is 512KB, and block_size
is 128. Therefore each token takes 4KB in each kv cache.
Here we have 5GB, then the number of max content len is
5GB / 4KB (cache mem usage by each token) / 2 (2 layers, not sharing kv cache)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, your calculations are right.
@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner): | |||
max_context_len =\ | |||
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes) | |||
# max context len with KV sharing should be 2x as large as without | |||
assert max_context_len == 1310720 | |||
assert max_context_len == 81920 | |||
|
|||
# important: override tensor size to prevent large mem alloc during test | |||
# this will only allocate 2 block worth of memory (2 * 32kb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we update the comment to be 512kb, since page size is 512kb for tpu
Hi @sarckk, @heheda12345 I fixed it locally and put the fix in this commit, you can cherry pick onto this PR ef53a49. I verifeid the test passed with my fix. Since I'm not that familiar with the code around the test, ptal at my fix, thanks |
9107b51
to
d6e4c9c
Compare
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
d6e4c9c
to
d13b5e7
Compare
@lsy323 CI still fails. It is quite weird for me as I think both dtype should always be |
Hi @heheda12345 , I hit the same issue as well when debugging it earlier. I found the block table is a mocked object instaed of a real tensor, that’s why I removed the mocking. I verified fix in #19371 Could we remove the mocking in this PR, so that the issue will be fixed |
fixes TPU tests failures caused by #18212 (soft failures, so it shows up green in CI). note these are currently skipped in CI
Key differences with GPU runner: