|
| 1 | ++++ |
| 2 | +date = "2023-06-06T22:00:00+09:00" |
| 3 | +draft = false |
| 4 | +title = "OpenGL 3.1 on Asahi Linux" |
| 5 | +slug = "opengl-3-1-on-asahi-linux" |
| 6 | +author = "Alyssa Rosenzweig" |
| 7 | ++++ |
| 8 | + |
| 9 | +Upgrade your [Asahi Linux](https://asahilinux.org/) systems, because your |
| 10 | +graphics drivers are getting a big boost: leapfrogging from OpenGL 2.1 over |
| 11 | +OpenGL 3.0 up to OpenGL 3.1! Similarly, the OpenGL ES 2.0 support is bumping up |
| 12 | +to OpenGL ES 3.0. That means more playable games and more functioning |
| 13 | +applications. |
| 14 | + |
| 15 | +Back in December, I teased an early screenshot of SuperTuxKart's deferred |
| 16 | +renderer working on Asahi, using OpenGL ES 3.0 features like multiple render |
| 17 | +targets and instancing. Now you too can enjoy SuperTuxKart with advanced |
| 18 | +lighting the way it's meant to be: |
| 19 | + |
| 20 | +{{< captioned caption="SuperTuxKart rendering with advanced light" >}} |
| 21 | +<img src="/img/blog/2023/06/STK-1080p.webp" alt="SuperTuxKart rendering with advanced light"> |
| 22 | +{{< /captioned >}} |
| 23 | + |
| 24 | +As before, these drivers are experimental and not yet conformant to the OpenGL |
| 25 | +or OpenGL ES specifications. For now, you'll need to run our `-edge` packages |
| 26 | +to opt-in to the work-in-progress drivers, understanding that there may be |
| 27 | +bugs. Please refer to [our previous |
| 28 | +post](https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/) |
| 29 | +explaining how to install the drivers and how to report bugs to help us |
| 30 | +improve. |
| 31 | + |
| 32 | +With that disclaimer out of the way, there's a LOT of new functionality packed |
| 33 | +into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make this release. Highlights |
| 34 | +include: |
| 35 | + |
| 36 | +* Multiple render targets |
| 37 | +* Multisampling |
| 38 | +* [Transform feedback](https://cgit.freedesktop.org/mesa/mesa/commit/?id=d72e1418ce4f66c42f20779f50f40091d3d310b0) |
| 39 | +* [Texture buffer objects](https://social.treehouse.systems/@alyssa/109542058314148170) |
| 40 | +* ..and more. |
| 41 | + |
| 42 | +For now, let's talk about... |
| 43 | + |
| 44 | +## Multisampling |
| 45 | + |
| 46 | +Vulkan and OpenGL support _multisampling_, short for _multisampled |
| 47 | +anti-aliasing_. In graphics, _aliasing_ causes jagged diagonal edges due to |
| 48 | +rendering at insufficient resolution. One solution to aliasing is rendering at |
| 49 | +higher resolutions and scaling down. Edges will be blurred, not jagged, which |
| 50 | +looks better. Multisampling is an efficient implementation of that idea. |
| 51 | + |
| 52 | +A _multisampled_ image contains multiple _samples_ for every pixel. After |
| 53 | +rendering, a multisampled image is _resolved_ to a regular image with one |
| 54 | +sample per pixel, typically by averaging the samples within a pixel. |
| 55 | + |
| 56 | +Apple GPUs support multisampled images and framebuffers. There's quite a bit of |
| 57 | +typing to plumb the programmer's view of multisampling into the form understood |
| 58 | +by the hardware, but there's no fundamental incompatibility. |
| 59 | + |
| 60 | +The trouble comes with _sample shading_. Recall that in modern graphics, the |
| 61 | +colour of each _fragment_ is determined by running a _fragment shader_ given by |
| 62 | +the programmer. If the fragments are pixels, then each sample within that pixel |
| 63 | +gets the same colour. Running the fragment shader once per pixel still benefits |
| 64 | +from multisampling thanks to higher quality rasterization, but it's not as good |
| 65 | +as *actually* rendering at a higher resolution. If instead the fragments are |
| 66 | +samples, each sample gets a unique colour, equivalent to rendering at a higher |
| 67 | +resolution (supersampling). In Vulkan and OpenGL, fragment shaders generally |
| 68 | +run per-pixel, but with "sample shading", the application can force the |
| 69 | +fragment shader to run per-sample. |
| 70 | + |
| 71 | +How does sample shading work from the drivers' perspective? On a typical GPU, |
| 72 | +it is simple: the driver compiles a fragment shader that calculates the colour |
| 73 | +of a single sample, and sets a hardware bit to execute it per-sample instead of |
| 74 | +per-pixel. There is only one bit of state associated with sample shading. The |
| 75 | +hardware will execute the fragment shader multiple times per pixel, writing out |
| 76 | +pixel colours independently. |
| 77 | + |
| 78 | +Easy, right? |
| 79 | + |
| 80 | +Alas, Apple's "AGX" GPU is not typical. |
| 81 | + |
| 82 | +AGX always executes the shader once per pixel, not once per sample, like older |
| 83 | +GPUs that did not support sample shading. AGX _does_ support it, though. |
| 84 | + |
| 85 | +How? The AGX instruction set allows pixel shaders to output different colours |
| 86 | +to each sample. The instruction used to output a colour[^1] takes a _set_ of samples to |
| 87 | +modify, encoded as a bit mask. The default all-1's mask writes the same value |
| 88 | +to all samples in a pixel, but a mask setting a single bit will write only the |
| 89 | +single corresponding sample. |
| 90 | + |
| 91 | +This design is unusual, and it requires driver backflips to translate "fragment |
| 92 | +shaders" into hardware pixel shaders. How do we do it? |
| 93 | + |
| 94 | +Physically, the hardware executes our shader once per pixel. Logically, we're |
| 95 | +supposed to execute the application's fragment shader once per sample. If we |
| 96 | +know the number of samples per pixel, then we can wrap the application's shader |
| 97 | +in a loop over each sample. So, if the original fragment shader is: |
| 98 | + |
| 99 | +``` |
| 100 | +interpolated colour = interpolate at current sample(input colour); |
| 101 | +output current sample(interpolated colour); |
| 102 | +``` |
| 103 | + |
| 104 | +then we will transform the program to the pixel shader: |
| 105 | + |
| 106 | +``` |
| 107 | +for (sample = 0; sample < number of samples; ++sample) { |
| 108 | + sample mask = (1 << sample); |
| 109 | + interpolated colour = interpolate at sample(input colour, sample); |
| 110 | + output samples(sample mask, interpolated colour); |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +The original fragment shader runs inside the loop, once per sample. Whenever it |
| 115 | +interpolates inputs at the current sample position, we change it to instead |
| 116 | +interpolate at a specific sample given by the loop counter `sample`. Likewise, |
| 117 | +when it outputs a colour for a sample, we change it to output the colour to the |
| 118 | +single sample given by the loop counter. |
| 119 | + |
| 120 | +If the story ended here, this mechanism would be silly. Adding |
| 121 | +sample masks to the instruction set is more complicated than a single bit to |
| 122 | +invoke the shader multiple times, as other GPUs do. Even Apple's own Metal |
| 123 | +driver has to implement this dance, because Metal has a similar approach to |
| 124 | +sample shading as OpenGL and Vulkan. With all this extra complexity, is there a |
| 125 | +benefit? |
| 126 | + |
| 127 | +If we generated that loop at the end, maybe not. But if we know at compile-time |
| 128 | +that sample shading is used, we can run our full optimizer on this sample loop. |
| 129 | +If there is an expression that is the same for all samples in a pixel, it can |
| 130 | +be hoisted out of the loop.[^3] Instead of |
| 131 | +calculating the same value multiple times, as other GPUs do, the value can be |
| 132 | +calculated just once and reused for each sample. Although it complicates the |
| 133 | +driver, this approach to sample shading isn't Apple cutting corners. If we |
| 134 | +slapped on the loop at the end and did no optimizations, the resulting code |
| 135 | +would be comparable to what other GPUs execute in hardware. There might be |
| 136 | +slight differences from spawning fewer threads but executing more control flow |
| 137 | +instructions[^2], but that's minor. Generating the loop early and running the optimizer |
| 138 | +enables better performance than possible on other GPUs. |
| 139 | + |
| 140 | +So is the mechanism only an optimization? Did Apple stumble on a better |
| 141 | +approach to sample shading that other GPUs should adopt? I wouldn't be so sure. |
| 142 | + |
| 143 | +Let's pull the curtain back. AGX has its roots as a _mobile_ GPU intended for |
| 144 | +iPhones, with significant PowerVR heritage. Even if it powers Mac Pros today, |
| 145 | +the mobile legacy means AGX prefers software implementations of many features |
| 146 | +that desktop GPUs implement with dedicated hardware. |
| 147 | + |
| 148 | +Yes, I'm talking about blending. |
| 149 | + |
| 150 | +Blending is an operation in graphics APIs to combine the fragment shader |
| 151 | +output colour with the existing colour in the framebuffer. It is usually used |
| 152 | +to implement [alpha blending](https://en.wikipedia.org/wiki/Alpha_compositing), |
| 153 | +to let the background poke through translucent objects. |
| 154 | + |
| 155 | +When multisampling is used _without_ sample shading, although the fragment |
| 156 | +shader only runs once per pixel, blending happens per-sample. Even if the |
| 157 | +fragment shader outputs the same colour to each sample, if the framebuffer |
| 158 | +already had different colours in different samples, blending needs to happen |
| 159 | +per-sample to avoid losing that information already in the framebuffer. |
| 160 | + |
| 161 | +A traditional desktop GPU blends with dedicated hardware. In the |
| 162 | +mobile space, there's a mix of dedicated hardware and software. On AGX, |
| 163 | +blending is purely software. Rather than configure blending hardware, the |
| 164 | +driver must produce _variants_ of the fragment shader that include |
| 165 | +instructions to implement the desired blend mode. With alpha |
| 166 | +blending, a fragment shader like: |
| 167 | + |
| 168 | +``` |
| 169 | +colour = calculate lighting(); |
| 170 | +output(colour); |
| 171 | +``` |
| 172 | + |
| 173 | +becomes: |
| 174 | + |
| 175 | +``` |
| 176 | +colour = calculate lighting(); |
| 177 | +dest = load destination colour; |
| 178 | +alpha = colour.alpha; |
| 179 | +blended = (alpha * colour) + ((1 - alpha) * dest)); |
| 180 | +output(blended); |
| 181 | +``` |
| 182 | + |
| 183 | +Where's the problem? |
| 184 | + |
| 185 | +Blending happens per sample. Even if the application intends to run |
| 186 | +the fragment shader per pixel, the shader _must_ run per sample for |
| 187 | +correct blending. Compared to other GPUs, this approach to blending would |
| 188 | +regress performance when blending and multisampling are enabled but sample |
| 189 | +shading is not. |
| 190 | + |
| 191 | +On the other hand, exposing multisample pixel shaders to the driver solves the |
| 192 | +problem neatly. If both the blending and the multisample state are known, we |
| 193 | +can first insert instructions for blending, and then wrap with the sample loop. |
| 194 | +The above program would then become: |
| 195 | + |
| 196 | +``` |
| 197 | +for (sample = 0; sample < number of samples; ++sample_id) { |
| 198 | + colour = calculate lighting(); |
| 199 | +
|
| 200 | + dest = load destination colour at sample (sample); |
| 201 | + alpha = colour.alpha; |
| 202 | + blended = (alpha * colour) + ((1 - alpha) * dest); |
| 203 | +
|
| 204 | + sample mask = (1 << sample); |
| 205 | + output samples(sample_mask, blended); |
| 206 | +} |
| 207 | +``` |
| 208 | + |
| 209 | +In this form, the fragment shader is asymptotically worse than the application |
| 210 | +wanted: the fragment shader is executed inside the loop, running per-sample |
| 211 | +unnecessarily. |
| 212 | + |
| 213 | +Have no fear, the optimizer is here. Since `colour` is the same for each sample |
| 214 | +in the pixel, it does not depend on the sample ID. The compiler can move the |
| 215 | +entire original fragment shader (and related expressions) out of the per-sample |
| 216 | +loop: |
| 217 | + |
| 218 | +``` |
| 219 | +colour = calculate lighting(); |
| 220 | +alpha = colour.alpha; |
| 221 | +inv_alpha = 1 - alpha; |
| 222 | +colour_alpha = alpha * colour; |
| 223 | +
|
| 224 | +for (sample = 0; sample < number of samples; ++sample_id) { |
| 225 | + dest = load destination colour at sample (sample); |
| 226 | + blended = colour_alpha + (inv_alpha * dest); |
| 227 | +
|
| 228 | + sample mask = (1 << sample); |
| 229 | + output samples(sample_mask, blended); |
| 230 | +} |
| 231 | +``` |
| 232 | + |
| 233 | +Now blending happens per sample but the application's fragment shader runs just |
| 234 | +once, matching the performance characteristics of traditional GPUs. Even |
| 235 | +better, all of this happens without any special work from the compiler. There's |
| 236 | +no magic multisampling optimization happening here: it's just a loop. |
| 237 | + |
| 238 | +By the way, what do we do if we _don't_ know the blending and multisample state |
| 239 | +at compile-time? Hope is not lost... |
| 240 | + |
| 241 | +...but that's a story for another day. |
| 242 | + |
| 243 | +## What's next? |
| 244 | + |
| 245 | +While OpenGL ES 3.0 is an improvement over ES 2.0, we're not done. In my |
| 246 | +work-in-progress branch, OpenGL ES 3.1 support is nearly finished, which will |
| 247 | +unlock compute shaders. |
| 248 | + |
| 249 | +The final goal is a Vulkan driver running modern games. We're a while away, but |
| 250 | +the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1, so our work |
| 251 | +translates to Vulkan. For example, the multisampling compiler passes described |
| 252 | +above are common code between the drivers. We've tested them against OpenGL, |
| 253 | +and now they're ready to go for Vulkan. |
| 254 | + |
| 255 | +And yes, [the team](https://github.com/ella-0) is already working on Vulkan. |
| 256 | + |
| 257 | +Until then, you're one `pacman -Syu` away from enjoying OpenGL 3.1! |
| 258 | + |
| 259 | +[^1]: Store a formatted value to local memory acting as a tilebuffer. |
| 260 | +[^2]: Since the number of samples is constant, all threads branch in the same direction so the usual "GPUs are bad at branching" advice does not apply. |
| 261 | +[^3]: Via [common subexpression |
| 262 | +elimination](https://en.wikipedia.org/wiki/Common_subexpression_elimination) if |
| 263 | +the [loop is unrolled](https://en.wikipedia.org/wiki/Loop_unrolling), otherwise |
| 264 | +via [code motion](https://en.wikipedia.org/wiki/Code_motion). |
0 commit comments