-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
WIP - Hack xpu memory - my attempt at fixing issue #20743 #22415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add s_aux parameter to ipex_ops.flash_attn_varlen_func to fix TypeError when running Qwen3 on Intel XPU. The parameter is accepted but ignored to maintain API compatibility. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add has_xmx_support() method to detect XMX capability on Intel GPUs - Add fallback to PyTorch scaled_dot_product_attention when XMX is not available - Handle "SDP kernel requires XMX" runtime errors gracefully - Enable Qwen3 to run on Intel Arc integrated GPUs without XMX support This fixes RuntimeError when running models on Intel integrated GPUs that lack XMX (Matrix Extensions) hardware acceleration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces fallback mechanisms to improve support for Intel XPU devices, particularly for integrated GPUs where features like XMX or certain memory APIs may be unavailable. The changes focus on attention operations and memory management within the XPU worker.
While the goal is valuable, I've found a critical issue in the paged attention fallback logic that will cause incorrect outputs. There is also a high-severity issue with overly broad exception handling that could hide bugs. My review provides details on these points and suggestions for how to address them to ensure the correctness and robustness of the XPU backend.
vllm/_ipex_ops.py
Outdated
if "XMX" in str(e) or "chunked_prefill" in str(e): | ||
# Fallback to basic attention implementation without XMX | ||
from vllm.platforms import current_platform | ||
import torch.nn.functional as F | ||
import warnings | ||
warnings.warn( | ||
f"XMX acceleration not available ({e}). " | ||
"Falling back to basic attention implementation. " | ||
"Performance will be reduced on Intel integrated GPUs.", | ||
UserWarning | ||
) | ||
|
||
# Basic scaled dot product attention fallback | ||
q = q.contiguous() | ||
k = k.contiguous() | ||
v = v.contiguous() | ||
|
||
# Reshape for batch processing | ||
batch_size = cu_seqlens_q.shape[0] - 1 | ||
outputs = [] | ||
|
||
for i in range(batch_size): | ||
start_q = cu_seqlens_q[i].item() | ||
end_q = cu_seqlens_q[i + 1].item() | ||
start_k = cu_seqlens_k[i].item() | ||
end_k = cu_seqlens_k[i + 1].item() | ||
|
||
q_seq = q[start_q:end_q].unsqueeze(0) | ||
k_seq = k[start_k:end_k].unsqueeze(0) | ||
v_seq = v[start_k:end_k].unsqueeze(0) | ||
|
||
# Use PyTorch's scaled dot product attention as fallback | ||
attn_out = F.scaled_dot_product_attention( | ||
q_seq, k_seq, v_seq, | ||
scale=softmax_scale, | ||
is_causal=causal | ||
) | ||
outputs.append(attn_out.squeeze(0)) | ||
|
||
result = torch.cat(outputs, dim=0) | ||
out.copy_(result) | ||
return out | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback attention implementation in this except
block is incorrect for paged attention, which will lead to incorrect model outputs. Here's why:
-
Ignoring
block_table
: The function signature includes ablock_table
parameter, which is essential for paged attention to map logical token blocks to physical memory blocks in the KV cache. The fallback implementation completely ignores thisblock_table
. -
Assuming Contiguous KV Cache: The fallback slices the
k
andv
tensors (which represent the KV caches in paged attention) directly, likek_seq = k[start_k:end_k]
. This assumes that the key/value tensors for all sequences are stored contiguously, which is not the case for paged attention. This will result in gathering incorrect key/value pairs and producing erroneous attention results.
To fix this, the fallback logic must be updated to correctly handle paged KV caches by using the block_table
to gather the appropriate KV blocks for each sequence before performing the attention computation.
Additionally, the import from vllm.platforms import current_platform
on line 315 appears to be unused within this block and can be removed.
except Exception: | ||
# Conservative default - assume no XMX support | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except Exception:
is too broad. It will catch any error that occurs within the try
block and silently return False
. This can mask underlying bugs (e.g., a typo in a method name, an unexpected error from get_device_name
) and make debugging very difficult, potentially leading to silent performance degradation on devices that should have XMX support.
It would be more robust to either catch more specific exceptions or, at a minimum, to log the exception that was caught. This would provide visibility into any unexpected failures during the check, similar to the logging you've added in vllm/v1/worker/xpu_worker.py
.
except Exception: | |
# Conservative default - assume no XMX support | |
return False | |
except Exception as e: | |
# Conservative default - assume no XMX support | |
logger.warning(f"Failed to determine XMX support: {e}. Assuming no XMX.") | |
return False |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
- Replace scaled_dot_product_attention with manual attention computation - Properly handle [total_tokens, num_heads, head_dim] tensor format - Add correct tensor transposition for multi-head attention - Implement causal masking and softmax manually - Handle variable-length sequences with proper reshaping Fixes "The size of tensor a (31) must match the size of tensor b (64)" error when falling back from XMX-accelerated attention on Intel integrated GPUs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This pull request has merge conflicts that must be resolved before it can be |
Add extensive debug prints to understand tensor shapes and sequence length handling in the XMX fallback path. This will help diagnose the "tensor size mismatch" error. - Debug tensor shapes for q, k, v - Debug cumulative sequence lengths - Debug per-batch sequence dimensions - Add tensor reshaping logic for 2D vs 3D tensors 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
may I know what intel GPU you are using? |
Replace complex variable-length attention processing with simple full-sequence attention computation. This fixes tensor dimension mismatch errors when Q and KV sequences have different lengths (common in KV-cached attention scenarios). Key changes: - Remove per-batch sequence processing that caused dimension issues - Use simple attention over full concatenated Q/K/V tensors - Handle 2D to 3D tensor reshaping cleanly - Apply causal masking only when Q and K lengths match - Support different query and key/value sequence lengths Fixes "The size of tensor a (16) must match the size of tensor b (31)" error in KV-cached attention on Intel integrated GPUs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix dimension mismatch in KV cache scenarios where query tokens (16) and cached key/value tokens (815) have different lengths. Restore proper variable-length sequence processing that handles each batch item separately. Key fixes: - Use cu_seqlens_q/cu_seqlens_k to process sequences individually - Handle different query and key/value sequence lengths properly - Apply causal masking only when Q and K lengths match - Add extensive debug logging for troubleshooting - Correct tensor concatenation for batch outputs This addresses the "tensor a (16) vs b (815)" dimension mismatch error in KV-cached attention on Intel integrated GPUs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix tensor dimension mismatch by properly handling paged K/V cache format [num_blocks, block_size, num_kv_heads, head_dim] vs flat query format [total_tokens, num_q_heads, head_dim]. Key fixes: - Detect 4D paged vs 3D flat K/V tensor formats - Flatten first blocks of paged K/V cache for attention computation - Handle Grouped Query Attention (GQA) with different head counts - Use dimension-safe operations to prevent tensor mismatches - Add padding for output shape consistency - Support both paged and flat K/V cache formats This enables functional (though simplified) attention computation on Intel integrated GPUs without XMX support, allowing models like Qwen3 to run despite reduced performance. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
#14295 (comment) |
for |
so no, seems this still need my workaround:
but I can try just that on top of this branch, let's see if it's better |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update