Skip to content

[webgpu] support intel subgroup matrix on matmul_nbits #24898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xhcao
Copy link
Contributor

@xhcao xhcao commented May 29, 2025

The patch enables intel subgroup matrix on matmul_bits operator, and temporarily supports it on vulkan backend and xe-2lpg arch, we will extend the functions on more subgroup matrix configs and platforms.

Description

Motivation and Context

The patch enables intel subgroup matrix on matmul_bits operator,
and temporarily supports it on vulkan backend and xe-2lpg arch,
we will extend the functions on more subgroup matrix configs and
platforms.
@xhcao
Copy link
Contributor Author

xhcao commented May 29, 2025

  1. The subgroup matrix feature is very relevant with hardware vendors and their architectures, they have different subgroup matrix configs, and different vendors and architectures have the different best subgroup config,
    optimizing the algorithm on one hardware will easily hurt others hardwares, so we generate code separately for different vendors in the early stages of development.
  2. The PR currently only supports intel xe-2lpg architecture on vulkan, and the subgroup matrix config is f16(816) x f16(1616) = f32(8*16), we will extend the features when the dawn enables more configs.
  3. The current performance on intel xe-2lpg architecture is ~20% slower than dp4a, and ~10% faster than non-dp4a.

@jchen10 @daijh PTAL, thanks.

@xhcao
Copy link
Contributor Author

xhcao commented May 29, 2025

Screenshot 2025-05-29 161718
Currently, the subgroup matrix config (UINT8(832) x UINT8(328) = UINT32(8*8)) is implementing in the dawn, which the expected result may be better than dp4a.

@@ -150,6 +148,11 @@ void WebGpuContext::Initialize(const WebGpuBufferCacheConfig& buffer_cache_confi
for (size_t i = 0; i < supported_features.featureCount; i++) {
device_features_.insert(supported_features.features[i]);
}
// cache adapter info
if (DeviceHasFeature(wgpu::FeatureName::ChromiumExperimentalSubgroupMatrix)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this feature always available in all platforms? (win/linux/mac/wasm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants