Skip to content

Commit 406754b

Browse files
committed
2023/06/06-opengl-3.1.md: New article
Signed-off-by: Hector Martin <marcan@marcan.st>
1 parent 85a2137 commit 406754b

File tree

2 files changed

+264
-0
lines changed

2 files changed

+264
-0
lines changed

content/blog/2023/06/06-opengl-3.1.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
+++
2+
date = "2023-06-06T22:00:00+09:00"
3+
draft = false
4+
title = "OpenGL 3.1 on Asahi Linux"
5+
slug = "opengl-3-1-on-asahi-linux"
6+
author = "Alyssa Rosenzweig"
7+
+++
8+
9+
Upgrade your [Asahi Linux](https://asahilinux.org/) systems, because your
10+
graphics drivers are getting a big boost: leapfrogging from OpenGL 2.1 over
11+
OpenGL 3.0 up to OpenGL 3.1! Similarly, the OpenGL ES 2.0 support is bumping up
12+
to OpenGL ES 3.0. That means more playable games and more functioning
13+
applications.
14+
15+
Back in December, I teased an early screenshot of SuperTuxKart's deferred
16+
renderer working on Asahi, using OpenGL ES 3.0 features like multiple render
17+
targets and instancing. Now you too can enjoy SuperTuxKart with advanced
18+
lighting the way it's meant to be:
19+
20+
{{< captioned caption="SuperTuxKart rendering with advanced light" >}}
21+
<img src="/img/blog/2023/06/STK-1080p.webp" alt="SuperTuxKart rendering with advanced light">
22+
{{< /captioned >}}
23+
24+
As before, these drivers are experimental and not yet conformant to the OpenGL
25+
or OpenGL ES specifications. For now, you'll need to run our `-edge` packages
26+
to opt-in to the work-in-progress drivers, understanding that there may be
27+
bugs. Please refer to [our previous
28+
post](https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/)
29+
explaining how to install the drivers and how to report bugs to help us
30+
improve.
31+
32+
With that disclaimer out of the way, there's a LOT of new functionality packed
33+
into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make this release. Highlights
34+
include:
35+
36+
* Multiple render targets
37+
* Multisampling
38+
* [Transform feedback](https://cgit.freedesktop.org/mesa/mesa/commit/?id=d72e1418ce4f66c42f20779f50f40091d3d310b0)
39+
* [Texture buffer objects](https://social.treehouse.systems/@alyssa/109542058314148170)
40+
* ..and more.
41+
42+
For now, let's talk about...
43+
44+
## Multisampling
45+
46+
Vulkan and OpenGL support _multisampling_, short for _multisampled
47+
anti-aliasing_. In graphics, _aliasing_ causes jagged diagonal edges due to
48+
rendering at insufficient resolution. One solution to aliasing is rendering at
49+
higher resolutions and scaling down. Edges will be blurred, not jagged, which
50+
looks better. Multisampling is an efficient implementation of that idea.
51+
52+
A _multisampled_ image contains multiple _samples_ for every pixel. After
53+
rendering, a multisampled image is _resolved_ to a regular image with one
54+
sample per pixel, typically by averaging the samples within a pixel.
55+
56+
Apple GPUs support multisampled images and framebuffers. There's quite a bit of
57+
typing to plumb the programmer's view of multisampling into the form understood
58+
by the hardware, but there's no fundamental incompatibility.
59+
60+
The trouble comes with _sample shading_. Recall that in modern graphics, the
61+
colour of each _fragment_ is determined by running a _fragment shader_ given by
62+
the programmer. If the fragments are pixels, then each sample within that pixel
63+
gets the same colour. Running the fragment shader once per pixel still benefits
64+
from multisampling thanks to higher quality rasterization, but it's not as good
65+
as *actually* rendering at a higher resolution. If instead the fragments are
66+
samples, each sample gets a unique colour, equivalent to rendering at a higher
67+
resolution (supersampling). In Vulkan and OpenGL, fragment shaders generally
68+
run per-pixel, but with "sample shading", the application can force the
69+
fragment shader to run per-sample.
70+
71+
How does sample shading work from the drivers' perspective? On a typical GPU,
72+
it is simple: the driver compiles a fragment shader that calculates the colour
73+
of a single sample, and sets a hardware bit to execute it per-sample instead of
74+
per-pixel. There is only one bit of state associated with sample shading. The
75+
hardware will execute the fragment shader multiple times per pixel, writing out
76+
pixel colours independently.
77+
78+
Easy, right?
79+
80+
Alas, Apple's "AGX" GPU is not typical.
81+
82+
AGX always executes the shader once per pixel, not once per sample, like older
83+
GPUs that did not support sample shading. AGX _does_ support it, though.
84+
85+
How? The AGX instruction set allows pixel shaders to output different colours
86+
to each sample. The instruction used to output a colour[^1] takes a _set_ of samples to
87+
modify, encoded as a bit mask. The default all-1's mask writes the same value
88+
to all samples in a pixel, but a mask setting a single bit will write only the
89+
single corresponding sample.
90+
91+
This design is unusual, and it requires driver backflips to translate "fragment
92+
shaders" into hardware pixel shaders. How do we do it?
93+
94+
Physically, the hardware executes our shader once per pixel. Logically, we're
95+
supposed to execute the application's fragment shader once per sample. If we
96+
know the number of samples per pixel, then we can wrap the application's shader
97+
in a loop over each sample. So, if the original fragment shader is:
98+
99+
```
100+
interpolated colour = interpolate at current sample(input colour);
101+
output current sample(interpolated colour);
102+
```
103+
104+
then we will transform the program to the pixel shader:
105+
106+
```
107+
for (sample = 0; sample < number of samples; ++sample) {
108+
sample mask = (1 << sample);
109+
interpolated colour = interpolate at sample(input colour, sample);
110+
output samples(sample mask, interpolated colour);
111+
}
112+
```
113+
114+
The original fragment shader runs inside the loop, once per sample. Whenever it
115+
interpolates inputs at the current sample position, we change it to instead
116+
interpolate at a specific sample given by the loop counter `sample`. Likewise,
117+
when it outputs a colour for a sample, we change it to output the colour to the
118+
single sample given by the loop counter.
119+
120+
If the story ended here, this mechanism would be silly. Adding
121+
sample masks to the instruction set is more complicated than a single bit to
122+
invoke the shader multiple times, as other GPUs do. Even Apple's own Metal
123+
driver has to implement this dance, because Metal has a similar approach to
124+
sample shading as OpenGL and Vulkan. With all this extra complexity, is there a
125+
benefit?
126+
127+
If we generated that loop at the end, maybe not. But if we know at compile-time
128+
that sample shading is used, we can run our full optimizer on this sample loop.
129+
If there is an expression that is the same for all samples in a pixel, it can
130+
be hoisted out of the loop.[^3] Instead of
131+
calculating the same value multiple times, as other GPUs do, the value can be
132+
calculated just once and reused for each sample. Although it complicates the
133+
driver, this approach to sample shading isn't Apple cutting corners. If we
134+
slapped on the loop at the end and did no optimizations, the resulting code
135+
would be comparable to what other GPUs execute in hardware. There might be
136+
slight differences from spawning fewer threads but executing more control flow
137+
instructions[^2], but that's minor. Generating the loop early and running the optimizer
138+
enables better performance than possible on other GPUs.
139+
140+
So is the mechanism only an optimization? Did Apple stumble on a better
141+
approach to sample shading that other GPUs should adopt? I wouldn't be so sure.
142+
143+
Let's pull the curtain back. AGX has its roots as a _mobile_ GPU intended for
144+
iPhones, with significant PowerVR heritage. Even if it powers Mac Pros today,
145+
the mobile legacy means AGX prefers software implementations of many features
146+
that desktop GPUs implement with dedicated hardware.
147+
148+
Yes, I'm talking about blending.
149+
150+
Blending is an operation in graphics APIs to combine the fragment shader
151+
output colour with the existing colour in the framebuffer. It is usually used
152+
to implement [alpha blending](https://en.wikipedia.org/wiki/Alpha_compositing),
153+
to let the background poke through translucent objects.
154+
155+
When multisampling is used _without_ sample shading, although the fragment
156+
shader only runs once per pixel, blending happens per-sample. Even if the
157+
fragment shader outputs the same colour to each sample, if the framebuffer
158+
already had different colours in different samples, blending needs to happen
159+
per-sample to avoid losing that information already in the framebuffer.
160+
161+
A traditional desktop GPU blends with dedicated hardware. In the
162+
mobile space, there's a mix of dedicated hardware and software. On AGX,
163+
blending is purely software. Rather than configure blending hardware, the
164+
driver must produce _variants_ of the fragment shader that include
165+
instructions to implement the desired blend mode. With alpha
166+
blending, a fragment shader like:
167+
168+
```
169+
colour = calculate lighting();
170+
output(colour);
171+
```
172+
173+
becomes:
174+
175+
```
176+
colour = calculate lighting();
177+
dest = load destination colour;
178+
alpha = colour.alpha;
179+
blended = (alpha * colour) + ((1 - alpha) * dest));
180+
output(blended);
181+
```
182+
183+
Where's the problem?
184+
185+
Blending happens per sample. Even if the application intends to run
186+
the fragment shader per pixel, the shader _must_ run per sample for
187+
correct blending. Compared to other GPUs, this approach to blending would
188+
regress performance when blending and multisampling are enabled but sample
189+
shading is not.
190+
191+
On the other hand, exposing multisample pixel shaders to the driver solves the
192+
problem neatly. If both the blending and the multisample state are known, we
193+
can first insert instructions for blending, and then wrap with the sample loop.
194+
The above program would then become:
195+
196+
```
197+
for (sample = 0; sample < number of samples; ++sample_id) {
198+
colour = calculate lighting();
199+
200+
dest = load destination colour at sample (sample);
201+
alpha = colour.alpha;
202+
blended = (alpha * colour) + ((1 - alpha) * dest);
203+
204+
sample mask = (1 << sample);
205+
output samples(sample_mask, blended);
206+
}
207+
```
208+
209+
In this form, the fragment shader is asymptotically worse than the application
210+
wanted: the fragment shader is executed inside the loop, running per-sample
211+
unnecessarily.
212+
213+
Have no fear, the optimizer is here. Since `colour` is the same for each sample
214+
in the pixel, it does not depend on the sample ID. The compiler can move the
215+
entire original fragment shader (and related expressions) out of the per-sample
216+
loop:
217+
218+
```
219+
colour = calculate lighting();
220+
alpha = colour.alpha;
221+
inv_alpha = 1 - alpha;
222+
colour_alpha = alpha * colour;
223+
224+
for (sample = 0; sample < number of samples; ++sample_id) {
225+
dest = load destination colour at sample (sample);
226+
blended = colour_alpha + (inv_alpha * dest);
227+
228+
sample mask = (1 << sample);
229+
output samples(sample_mask, blended);
230+
}
231+
```
232+
233+
Now blending happens per sample but the application's fragment shader runs just
234+
once, matching the performance characteristics of traditional GPUs. Even
235+
better, all of this happens without any special work from the compiler. There's
236+
no magic multisampling optimization happening here: it's just a loop.
237+
238+
By the way, what do we do if we _don't_ know the blending and multisample state
239+
at compile-time? Hope is not lost...
240+
241+
...but that's a story for another day.
242+
243+
## What's next?
244+
245+
While OpenGL ES 3.0 is an improvement over ES 2.0, we're not done. In my
246+
work-in-progress branch, OpenGL ES 3.1 support is nearly finished, which will
247+
unlock compute shaders.
248+
249+
The final goal is a Vulkan driver running modern games. We're a while away, but
250+
the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1, so our work
251+
translates to Vulkan. For example, the multisampling compiler passes described
252+
above are common code between the drivers. We've tested them against OpenGL,
253+
and now they're ready to go for Vulkan.
254+
255+
And yes, [the team](https://github.com/ella-0) is already working on Vulkan.
256+
257+
Until then, you're one `pacman -Syu` away from enjoying OpenGL 3.1!
258+
259+
[^1]: Store a formatted value to local memory acting as a tilebuffer.
260+
[^2]: Since the number of samples is constant, all threads branch in the same direction so the usual "GPUs are bad at branching" advice does not apply.
261+
[^3]: Via [common subexpression
262+
elimination](https://en.wikipedia.org/wiki/Common_subexpression_elimination) if
263+
the [loop is unrolled](https://en.wikipedia.org/wiki/Loop_unrolling), otherwise
264+
via [code motion](https://en.wikipedia.org/wiki/Code_motion).
123 KB
Binary file not shown.

0 commit comments

Comments
 (0)