Skip to content

[V1][TPU] Fix TPU kv sharing tests #19155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

sarckk
Copy link
Collaborator

@sarckk sarckk commented Jun 4, 2025

fixes TPU tests failures caused by #18212 (soft failures, so it shows up green in CI). note these are currently skipped in CI

Key differences with GPU runner:

  • pallas attention expects head size to be divisible by 128
  • it also makes block size 128 in the test

Copy link

github-actions bot commented Jun 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @sarckk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, authored by @sarckk, aims to address and fix some 'soft failures' in the TPU KV sharing tests. According to the description, these failures were introduced by a previous PR (#18212) but were not causing the CI checks to fail outright (hence 'soft failures'). The changes primarily involve updating the test suite in test_tpu_model_runner.py to correctly reflect expected behavior and memory configurations when testing KV sharing on TPUs.

Highlights

  • Test Fixes: The core purpose of this PR is to fix existing, but non-blocking, failures in the TPU KV sharing tests.
  • Test Parameter Updates: Several test cases related to KV cache initialization and memory estimation have been updated with new parameters, including increased simulated head sizes and available memory amounts.
  • Test Setup Adjustment: Tests verifying error conditions for KV sharing initialization now use a context manager (set_current_vllm_config) to ensure the VLLM configuration is properly set during the test execution.

Changelog

  • tests/v1/tpu/worker/test_tpu_model_runner.py
    • Added model_runner fixture to test_init_kv_cache_with_kv_sharing_invalid_target_layer_order, test_init_kv_cache_with_kv_sharing_target_layer_not_exist, and test_init_kv_cache_with_kv_sharing_target_same_as_current.
    • Wrapped the pytest.raises blocks in the above tests with set_current_vllm_config(vllm_config).
    • Increased the simulated head_size from 64 to 128 in the Attention object initialization across multiple test functions (test_init_kv_cache_with_kv_sharing_invalid_target_layer_order, test_init_kv_cache_with_kv_sharing_target_layer_not_exist, test_init_kv_cache_with_kv_sharing_target_same_as_current, test_init_kv_cache_without_kv_sharing, test_init_kv_cache_with_kv_sharing_valid).
    • Increased the simulated available_memory from 20 GiB to 40 GiB in test_init_kv_cache_without_kv_sharing and test_init_kv_cache_with_kv_sharing_valid.
    • Increased the memory argument for estimate_max_model_len from 5 GiB to 10 GiB in test_init_kv_cache_without_kv_sharing and test_init_kv_cache_with_kv_sharing_valid.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Tests were green, a lie,
Soft failures hid from view,
Code fix brings truth back.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jun 4, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix TPU test failures related to KV sharing, which were previously showing as soft failures. The changes involve updating test configurations, primarily by introducing the model_runner fixture, using set_current_vllm_config to ensure the correct vLLM configuration is active, and adjusting parameters like head_size and available_memory.

Overall, the changes appear logical and directly address the stated problem. The consistent updates across multiple tests are good. I found one minor issue regarding a stale comment that should be addressed for clarity.

Summary of Findings

  • Stale Comment: In test_init_kv_cache_without_kv_sharing, the comment explaining the calculation of num_expected_blocks is outdated due to changes in available_memory and head_size. While the numerical value of num_expected_blocks might remain correct due to proportional changes, the comment should be updated to reflect the new parameters (40GB available memory and 64KB page size) for accuracy and maintainability.

Merge Readiness

The pull request seems to effectively address the TPU test failures. However, there is a stale comment in test_init_kv_cache_without_kv_sharing that should be updated for clarity. Once this minor issue is addressed, the PR should be ready for merge. I am unable to approve the pull request myself, so please ensure it is reviewed and approved by others before merging.

@@ -484,7 +491,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner):
assert len(kv_cache_spec) == 2
assert len(model_runner.shared_kv_cache_layers) == 0

available_memory = 20 * GiB_bytes
available_memory = 40 * GiB_bytes
# page size for layer 0's kv_cache_spec is 32KB
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for num_expected_blocks seems to be outdated. With available_memory changed to 40 * GiB_bytes and head_size (which affects page_size) changed to 128, the calculation for num_expected_blocks would be based on these new values.

Previously, with head_size=64, page_size was 32KB. num_blocks = (20GB) / (32KB) / 2 = 327680.
Now, with head_size=128, page_size becomes 64KB. num_blocks = (40GB) / (64KB) / 2 = 327680.

So, while the value 327680 for num_expected_blocks might still be correct (as both available memory and page size effectively doubled), the comment # 20GB / 32KB / 2 (num layers) is no longer accurate.

Could you update the comment to reflect the current parameters, for example: # 40GB / 64KB / 2 (num layers)? This would improve clarity for future maintainers.

Suggested change
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers)
num_expected_blocks = 327680 # 40GB / 64KB / 2 (num layers)

Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sarckk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 4, 2025
@lsy323
Copy link
Collaborator

lsy323 commented Jun 4, 2025

Hi @sarckk, thank you for the follow up!

I tested it on TPU, the vllm_config issue is gone. But I'm seeing these 2 errors

FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_without_kv_sharing - ValueError: To serve at least one request with the models's max seq len (3000000), (45.78 GiB KV cache is needed, which is larger than the available KV cache memory (40.00 GiB). Based on the available memory, t...
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_with_kv_sharing_valid - AssertionError: assert 40960 == 655360

Wonder if you have quick ideas on this? I will take a deeper look later.

@sarckk sarckk force-pushed the fix-tpu-kv-sharing-tests branch from d1bed8f to c2b79b8 Compare June 5, 2025 00:08
@mergify mergify bot removed the needs-rebase label Jun 5, 2025
@sarckk sarckk force-pushed the fix-tpu-kv-sharing-tests branch from c2b79b8 to 61918a4 Compare June 5, 2025 00:10
@sarckk
Copy link
Collaborator Author

sarckk commented Jun 5, 2025

@lsy323 could you try again? I think it's because we weren't properly setting vllm config for the TPU model runner in those tests

@sarckk sarckk marked this pull request as ready for review June 5, 2025 01:03
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for didn't catch it in my review of #18212

kv_cache_spec = model_runner.get_kv_cache_spec()
assert len(kv_cache_spec) == 2
assert len(model_runner.shared_kv_cache_layers) == 0

available_memory = 20 * GiB_bytes
# page size for layer 0's kv_cache_spec is 32KB
num_expected_blocks = 327680 # 20GB / 32KB / 2 (num layers)
# page size for layer 0's kv_cache_spec is 64KB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these numbers are different from GPU model runner?

Copy link
Collaborator Author

@sarckk sarckk Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because pallas attention backend expects head_size to be divisible by 128. In the original PR, the test was using head size of 64 and in #19108 it was changed to 128. So the downstream calculations need to assume 2 x page_size_bytes it was before.

PS: I'm not 100% sure if the changes fix all the tests as I don't have a TPU, was hoping I could trigger the test in CI. Seems like we might need @lsy323 or someone from Google to confirm and stamp otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if the changes fix all the tests as I don't have a TPU
I've also triggered the TPU CI test. Let's see what will happen.

@sarckk sarckk changed the title Fix TPU kv sharing tests [V1] Fix TPU kv sharing tests Jun 5, 2025
@sarckk sarckk changed the title [V1] Fix TPU kv sharing tests [V1][TPU] Fix TPU kv sharing tests Jun 5, 2025
@heheda12345
Copy link
Collaborator

https://buildkite.com/vllm/fastcheck/builds/26383#01973df8-1422-4d9e-9957-9ecd999dcd6d/2005-2462
CI get error like this. If the calculated number is correct, maybe you can just reduce max_model_len.

@sarckk
Copy link
Collaborator Author

sarckk commented Jun 6, 2025

I think the issue is that TPU worker passes attn_module.dtype which defaults to float32 instead of using the model config dtype as is done in gpu model runner. Wonder if this is intentional?

This would mean we need memory estimations would need to be 2x'd

@heheda12345
Copy link
Collaborator

#14570 Seems that this PR gives a bug fix on GPU model runner but not synced with TPU model runner. I'm asking here https://vllm-dev.slack.com/archives/C07SN1S35FE/p1749176611991269.
For this PR, I think we can force kv_cache_dtype to be float32 by setting cache_config.cache_dtype (and leave a comment), so that the test can pass regardless of whether this bug is fixed.

@heheda12345
Copy link
Collaborator

#19244
The pr for TPU kv cahe dtype problem

@sarckk
Copy link
Collaborator Author

sarckk commented Jun 6, 2025

adding changes from #19244 to test fix, will rebase once that is merged. another issue was that pallas backend uses block size 128 instead of 16

@sarckk sarckk force-pushed the fix-tpu-kv-sharing-tests branch from fe7551a to 9107b51 Compare June 6, 2025 18:27
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025
# page size for each layer KV can be calculated as
# 2 (non-MLA) * 8 (num_heads) * 128 (head_dim)
# * 2 (bfloat16, kv_cache dtype) * 128 (block_size) = 512KB
num_expected_blocks = 20480 # 20GB / 512KB / 2 (num layers)
kv_cache_config = get_kv_cache_config(vllm_config, kv_cache_spec,
available_memory)
assert kv_cache_config.num_blocks == num_expected_blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like kv_cache_config.tensors[layer_0].size should be kv_cache_config.kv_cache_tensors[0], there are several similar cases like this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this was changed in #17996

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lsy323
Copy link
Collaborator

lsy323 commented Jun 6, 2025

Hi @sarckk, you may be able to run this test on CPU machine, since it doesn't involve TPU execution.

You may do:

  1. pip install -r requirements/tpu.txt
  2. PJRT_DEVICE=CPU python3 -m pytest -s -v tests/v1/tpu/worker/test_tpu_model_runner.py -k test_init_kv_cache_without_kv_sharing

Please note that the installation may modify your torch version, so you may need to install the torch for your GPU machine after this.

@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner):
max_context_len =\
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why we are setting 5GB here, instead of the available memory 5GB?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im just an arbitrarily small memory (5GB) so we can test the logic that constraints the maximum possible model len (3M in this case)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you for the explanation, that makes sense

@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner):
max_context_len =\
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes)
# max context len with KV sharing should be 2x as large as without
assert max_context_len == 1310720
assert max_context_len == 81920
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment on how the expected number is calculated?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I ran locally and found max_context_len is 655360, and failed this check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max_context_len should be 655360, here is how I calculated this.

From above lines, we calculated page_size is 512KB, and block_size is 128. Therefore each token takes 4KB in each kv cache.

Here we have 5GB, then the number of max content len is

5GB / 4KB (cache mem usage by each token) / 2 (2 layers, not sharing kv cache)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, your calculations are right.

@@ -501,7 +516,7 @@ def test_init_kv_cache_without_kv_sharing(model_runner):
max_context_len =\
estimate_max_model_len(vllm_config, kv_cache_spec, 5 * GiB_bytes)
# max context len with KV sharing should be 2x as large as without
assert max_context_len == 1310720
assert max_context_len == 81920

# important: override tensor size to prevent large mem alloc during test
# this will only allocate 2 block worth of memory (2 * 32kb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the comment to be 512kb, since page size is 512kb for tpu

@lsy323
Copy link
Collaborator

lsy323 commented Jun 6, 2025

Hi @sarckk, @heheda12345 I fixed it locally and put the fix in this commit, you can cherry pick onto this PR ef53a49. I verifeid the test passed with my fix.

Since I'm not that familiar with the code around the test, ptal at my fix, thanks

@sarckk sarckk force-pushed the fix-tpu-kv-sharing-tests branch from 9107b51 to d6e4c9c Compare June 7, 2025 04:22
sarckk added 7 commits June 7, 2025 08:23
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
@sarckk sarckk force-pushed the fix-tpu-kv-sharing-tests branch from d6e4c9c to d13b5e7 Compare June 7, 2025 15:24
@heheda12345
Copy link
Collaborator

@lsy323 CI still fails. It is quite weird for me as I think both dtype should always be torch.int32. Do you have any idea for that?
https://buildkite.com/vllm/ci/builds/21641#01974afe-cfbb-42ff-bc2c-726188a882ab/2531-2996

@lsy323
Copy link
Collaborator

lsy323 commented Jun 9, 2025

@lsy323 CI still fails. It is quite weird for me as I think both dtype should always be torch.int32. Do you have any idea for that? https://buildkite.com/vllm/ci/builds/21641#01974afe-cfbb-42ff-bc2c-726188a882ab/2531-2996

Hi @heheda12345 , I hit the same issue as well when debugging it earlier. I found the block table is a mocked object instaed of a real tensor, that’s why I removed the mocking. I verified fix in #19371

Could we remove the mocking in this PR, so that the issue will be fixed

@sarckk
Copy link
Collaborator Author

sarckk commented Jun 9, 2025

I found the block table is a mocked object instaed of a real tensor, that’s why I removed the mocking

I see, I kept it because I thought that was just for local testing.

Closing in favour of #19371. Thanks @lsy323

@sarckk sarckk closed this Jun 9, 2025
@sarckk sarckk deleted the fix-tpu-kv-sharing-tests branch June 12, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants