- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Core] direct indexing on self.block_table_np in compute_slot_mapping #22940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization in compute_slot_mapping by replacing PyTorch tensor operations with direct NumPy indexing. The change from self.get_cpu_tensor().flatten()[...].numpy() to self.block_table_np.ravel()[...] is more direct, avoids unnecessary function calls and tensor conversions, and leverages NumPy's efficiency for this operation. The provided benchmarks confirm the significant performance improvement. The change is correct and well-justified.
| One-liner with significant perf improvement, @heheda12345 , @LucasWilkinson @houseroad and @njhill , can you take a look? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice!
Signed-off-by: linzebing <linzebing1995@gmail.com>
| Nice! TIL | 
| block_table_indices = (req_indices * self.max_num_blocks_per_req + | ||
| positions // self.block_size) | ||
| block_table_cpu = self.get_cpu_tensor() | ||
| block_numbers = block_table_cpu.flatten()[block_table_indices].numpy() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the win is mainly coming from avoid tensor to np copy, while the tensor.flatten and np.ravel should have a similar performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch tensor's indexing is slow
Interesting. Ideally, the indexing cost should be similar as well. I guess the difference might come from
- Actually indexing performance gap
- Layout difference, so torch.Tensor invoked a copy while np still creating a view.
| # block_size. | ||
| block_table_indices = (req_indices * self.max_num_blocks_per_req + | ||
| positions // self.block_size) | ||
| block_table_cpu = self.get_cpu_tensor() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could go and check other get_cpu_tensors reference to see if there's any other opportunities :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest of the callsites are from tpu_model_runner.py, I don't spot anything obvious yet.
| @linzebing has imported this pull request. If you are a Meta employee, you can view this in D80367371. | 
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com> Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…vllm-project#22940) Signed-off-by: linzebing <linzebing1995@gmail.com>

Purpose
Streamline slot mapping computation by replacing Torch tensor flattening and conversion with direct NumPy indexing via
ravel, eliminating redundant copies and conversionsTest Plan
Also ran throughput benchmark test:
Test Result
Reduced
compute_slot_mappingfrom 400+μs to 15-30μs.Throughput improved 2.07% for opt-125m with input=800 and output=75
Before:
Throughput: 268.45 requests/s, 234888.85 total tokens/s, 20133.98 output tokens/s
After:

Throughput: 274.01 requests/s, 239750.19 total tokens/s, 20550.72 output tokens/s
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.