Skip to content

WIP - Hack xpu memory - my attempt at fixing issue #20743 #22415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

yuvalk
Copy link

@yuvalk yuvalk commented Aug 7, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

yuvalk and others added 3 commits August 7, 2025 01:25
Add s_aux parameter to ipex_ops.flash_attn_varlen_func to fix
TypeError when running Qwen3 on Intel XPU. The parameter is
accepted but ignored to maintain API compatibility.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add has_xmx_support() method to detect XMX capability on Intel GPUs
- Add fallback to PyTorch scaled_dot_product_attention when XMX is not available
- Handle "SDP kernel requires XMX" runtime errors gracefully
- Enable Qwen3 to run on Intel Arc integrated GPUs without XMX support

This fixes RuntimeError when running models on Intel integrated GPUs
that lack XMX (Matrix Extensions) hardware acceleration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fallback mechanisms to improve support for Intel XPU devices, particularly for integrated GPUs where features like XMX or certain memory APIs may be unavailable. The changes focus on attention operations and memory management within the XPU worker.

While the goal is valuable, I've found a critical issue in the paged attention fallback logic that will cause incorrect outputs. There is also a high-severity issue with overly broad exception handling that could hide bugs. My review provides details on these points and suggestions for how to address them to ensure the correctness and robustness of the XPU backend.

Comment on lines 313 to 356
if "XMX" in str(e) or "chunked_prefill" in str(e):
# Fallback to basic attention implementation without XMX
from vllm.platforms import current_platform
import torch.nn.functional as F
import warnings
warnings.warn(
f"XMX acceleration not available ({e}). "
"Falling back to basic attention implementation. "
"Performance will be reduced on Intel integrated GPUs.",
UserWarning
)

# Basic scaled dot product attention fallback
q = q.contiguous()
k = k.contiguous()
v = v.contiguous()

# Reshape for batch processing
batch_size = cu_seqlens_q.shape[0] - 1
outputs = []

for i in range(batch_size):
start_q = cu_seqlens_q[i].item()
end_q = cu_seqlens_q[i + 1].item()
start_k = cu_seqlens_k[i].item()
end_k = cu_seqlens_k[i + 1].item()

q_seq = q[start_q:end_q].unsqueeze(0)
k_seq = k[start_k:end_k].unsqueeze(0)
v_seq = v[start_k:end_k].unsqueeze(0)

# Use PyTorch's scaled dot product attention as fallback
attn_out = F.scaled_dot_product_attention(
q_seq, k_seq, v_seq,
scale=softmax_scale,
is_causal=causal
)
outputs.append(attn_out.squeeze(0))

result = torch.cat(outputs, dim=0)
out.copy_(result)
return out
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The fallback attention implementation in this except block is incorrect for paged attention, which will lead to incorrect model outputs. Here's why:

  1. Ignoring block_table: The function signature includes a block_table parameter, which is essential for paged attention to map logical token blocks to physical memory blocks in the KV cache. The fallback implementation completely ignores this block_table.

  2. Assuming Contiguous KV Cache: The fallback slices the k and v tensors (which represent the KV caches in paged attention) directly, like k_seq = k[start_k:end_k]. This assumes that the key/value tensors for all sequences are stored contiguously, which is not the case for paged attention. This will result in gathering incorrect key/value pairs and producing erroneous attention results.

To fix this, the fallback logic must be updated to correctly handle paged KV caches by using the block_table to gather the appropriate KV blocks for each sequence before performing the attention computation.

Additionally, the import from vllm.platforms import current_platform on line 315 appears to be unused within this block and can be removed.

Comment on lines +199 to +201
except Exception:
# Conservative default - assume no XMX support
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The except Exception: is too broad. It will catch any error that occurs within the try block and silently return False. This can mask underlying bugs (e.g., a typo in a method name, an unexpected error from get_device_name) and make debugging very difficult, potentially leading to silent performance degradation on devices that should have XMX support.

It would be more robust to either catch more specific exceptions or, at a minimum, to log the exception that was caught. This would provide visibility into any unexpected failures during the check, similar to the logging you've added in vllm/v1/worker/xpu_worker.py.

Suggested change
except Exception:
# Conservative default - assume no XMX support
return False
except Exception as e:
# Conservative default - assume no XMX support
logger.warning(f"Failed to determine XMX support: {e}. Assuming no XMX.")
return False

Copy link

github-actions bot commented Aug 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

- Replace scaled_dot_product_attention with manual attention computation
- Properly handle [total_tokens, num_heads, head_dim] tensor format
- Add correct tensor transposition for multi-head attention
- Implement causal masking and softmax manually
- Handle variable-length sequences with proper reshaping

Fixes "The size of tensor a (31) must match the size of tensor b (64)"
error when falling back from XMX-accelerated attention on Intel
integrated GPUs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

mergify bot commented Aug 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yuvalk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 7, 2025
Add extensive debug prints to understand tensor shapes and
sequence length handling in the XMX fallback path. This will
help diagnose the "tensor size mismatch" error.

- Debug tensor shapes for q, k, v
- Debug cumulative sequence lengths
- Debug per-batch sequence dimensions
- Add tensor reshaping logic for 2D vs 3D tensors

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jikunshang
Copy link
Collaborator

may I know what intel GPU you are using?

yuvalk and others added 3 commits August 7, 2025 03:10
Replace complex variable-length attention processing with simple
full-sequence attention computation. This fixes tensor dimension
mismatch errors when Q and KV sequences have different lengths
(common in KV-cached attention scenarios).

Key changes:
- Remove per-batch sequence processing that caused dimension issues
- Use simple attention over full concatenated Q/K/V tensors
- Handle 2D to 3D tensor reshaping cleanly
- Apply causal masking only when Q and K lengths match
- Support different query and key/value sequence lengths

Fixes "The size of tensor a (16) must match the size of tensor b (31)"
error in KV-cached attention on Intel integrated GPUs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix dimension mismatch in KV cache scenarios where query tokens (16)
and cached key/value tokens (815) have different lengths. Restore
proper variable-length sequence processing that handles each batch
item separately.

Key fixes:
- Use cu_seqlens_q/cu_seqlens_k to process sequences individually
- Handle different query and key/value sequence lengths properly
- Apply causal masking only when Q and K lengths match
- Add extensive debug logging for troubleshooting
- Correct tensor concatenation for batch outputs

This addresses the "tensor a (16) vs b (815)" dimension mismatch
error in KV-cached attention on Intel integrated GPUs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix tensor dimension mismatch by properly handling paged K/V cache
format [num_blocks, block_size, num_kv_heads, head_dim] vs flat
query format [total_tokens, num_q_heads, head_dim].

Key fixes:
- Detect 4D paged vs 3D flat K/V tensor formats
- Flatten first blocks of paged K/V cache for attention computation
- Handle Grouped Query Attention (GQA) with different head counts
- Use dimension-safe operations to prevent tensor mismatches
- Add padding for output shape consistency
- Support both paged and flat K/V cache formats

This enables functional (though simplified) attention computation
on Intel integrated GPUs without XMX support, allowing models
like Qwen3 to run despite reduced performance.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@yuvalk
Copy link
Author

yuvalk commented Aug 7, 2025

may I know what intel GPU you are using?

#14295 (comment)
:-(
(found this after you asked)

@jikunshang
Copy link
Collaborator

for torch.xpu.mem_get_info() this API, it should be fixed on oneapi 2025.1.3 + torch 2.8, we verified on BMG it works, I am not sure whether MTL also works.
if possible, can you use this branch to verify #22300?

@yuvalk
Copy link
Author

yuvalk commented Aug 7, 2025

for torch.xpu.mem_get_info() this API, it should be fixed on oneapi 2025.1.3 + torch 2.8, we verified on BMG it works, I am not sure whether MTL also works. if possible, can you use this branch to verify #22300?

so no, seems this still need my workaround:

(EngineCore_0 pid=119) ERROR 08-07 18:33:54 [core.py:683] RuntimeError: The device (Intel(R) Graphics) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation.

but I can try just that on top of this branch, let's see if it's better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants