A bit more text

raphlinus · raphlinus · commit 3c0ee434b094 · 2023-06-12T10:41:51.000-07:00
Clarify role of the barrier within the broader GPU ecosystem. Thanks Daniel for the comments!
diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
@@ -6,7 +6,7 @@ categories: [gpu]
 ---
 At WWDC, Apple introduced [Metal shader converter], a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.
 
-The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced.
+The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn't support it. That holds back WebGPU's performance (see [gpuweb#3935 for discussion]), as WebGPU must be portable across the major APIs.
 
 I've discussed the value of this barrier in my blog post [Prefix sum on portable compute shaders], but I'll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the [SAM prefix sum] algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically `2 * ceil(log(n) / log(threadgroup size))`). Prefix sum, in turn is an important primitive for advanced compute workloads. There are a number of instances of it in the [Vello] pipeline, and it's also commonly used in stream compaction, decoding of variable length data streams, and compression.
 
@@ -18,7 +18,9 @@ The GPU ecosystem exists at the knife edge of being strangled by complexity. A b
 
 The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it's no longer sufficient to consider [Shader Model 5] to be a baseline, but rather the developer needs to keep in mind all the features that don't translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors "count leading zeros" when the input is 0 vary), and in other cases, like these device scoped barriers.
 
-A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also not that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+
+I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.
 
 ## Onward
 
@@ -36,12 +38,13 @@ I have a recommendations for Apple as well. I hope that they document which HLSL
 [CDSChecker]: http://plrg.eecs.uci.edu/software_page/42-2/
 [loom]: https://github.com/tokio-rs/loom
 [OpenGL 3.1 support]: https://asahilinux.org/2023/06/opengl-3-1-on-asahi-linux/
+[gpuweb#2229]: https://github.com/gpuweb/gpuweb/issues/2229
 [gpuweb#2297]: https://github.com/gpuweb/gpuweb/pull/2297
+[gpuweb#3935]: https://github.com/gpuweb/gpuweb/discussions/3935
 [Metal Shading Language Specification]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
 [SAM prefix sum]: https://dl.acm.org/doi/10.1145/2980983.2908089
 [Vello]: https://github.com/linebender/vello
 [extension list at vulkan.gpuinfo.org]: https://vulkan.gpuinfo.org/listfeaturesextensions.php
 [Shader Model 5]: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3d11-graphics-reference-sm5
-[ghpuweb#2229]: https://github.com/gpuweb/gpuweb/issues/2229
 [std::atomic_ref]: https://en.cppreference.com/w/cpp/atomic/atomic_ref
 [reasonable IR]: https://gist.github.com/raphlinus/a8e0a3a3683127149b746eb37822bdc8