Switch primitives to be stored in SoA style in vertex textures. #448

glennw · 2016-10-18T05:20:30Z

The primary goals of this patch are:

(1) Store primitives in SoA style. This makes some passes, such as
culling primitives much faster due to better cache coherency.
(2) Store primitives in flat arrays, for direct upload to GPU. This
reduces the amount of redundant copying on the CPU during
frame construction.
(3) Decouple primitive information from geometry instances. This allows
us to improve occlusion culling, by submitting segments of primitives
without any extra CPU overhead of copying primitive data.
(4) Allow incremental updates of the GPU SoA arrays during scrolling and
zoom. This isn't implemented yet, but the framework is in place and
will allow significant backend/compositor thread improvements during
scrolling and zoom frames.

There's also a number of other changes that have been rolled in to this
patch since it made sense to fix them at the same time. In particular:

Blend/composite batches are stored as ints, avoiding CPU float packing overhead.
Border segment rectangles are calculated in the vertex shader rather than CPU.
Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead.
Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls.
Angle gradients support arbitrary stop count in primitive data.
Removed some unused interpolators from the border shader.
Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead.
Reduced size of packed image structure - UV type is stored in sign of UV field.
Reduced size of packed glyph structure - color stored per run, rect calculated from UVs.
Moved some utility types from tiling.rs to util.rs.
Remove clip cases from batch enum, store needs_clipping flag in batch key.

This change is

glennw · 2016-10-18T05:20:48Z

r? @pcwalton

I apologize for the size of this patch - if it's too difficult to review, let me know and I'll split it up tomorrow into smaller patches. I've run this locally against WPT and CSS tests successfully. I've also tested on my Linux laptop and Mac Mini. Some timings are pasted below.

glennw · 2016-10-18T05:21:03Z

Some representative timings from my laptop. Note that this change doesn't actually do any incremental updates yet, so there are some good sized wins still to come from follow ups to this patch.

https://en.wikipedia.org
Backend (ms): 4.5 -> 3.7
Compositor (ms): 3.3 -> 2.8
GPU (ms): 2.8 -> 2.5

https://news.ycombinator.com
Backend (ms): 2.6 -> 2.3
Compositor (ms): 1.8 -> 1.2
GPU (ms): 1.3 -> 0.8

https://reddit.com
Backend (ms): 3.6 -> 3.2
Compositor (ms): 4.5 -> 4.5
GPU (ms): 3.4 -> 3.2

https://github.com/servo/servo
Backend (ms): 5.0 -> 3.5
Compositor (ms): 3.0 -> 2.2
GPU (ms): 4.3 -> 3.3

kvark · 2016-10-18T16:48:49Z

I'd be happy to review this gigantic change as well.

glennw · 2016-10-18T18:52:18Z

@kvark @pcwalton Yep, feel free to both look at it - and apologies again for the size of this patch!

bors-servo · 2016-10-18T23:06:35Z

☔ The latest upstream changes (presumably #432) made this pull request unmergeable. Please resolve the merge conflicts.

The primary goals of this patch are: (1) Store primitives in SoA style. This makes some passes, such as culling primitives much faster due to better cache coherency. (2) Store primitives in flat arrays, for direct upload to GPU. This reduces the amount of redundant copying on the CPU during frame construction. (3) Decouple primitive information from geometry instances. This allows us to improve occlusion culling, by submitting segments of primitives without any extra CPU overhead of copying primitive data. (4) Allow incremental updates of the GPU SoA arrays during scrolling and zoom. This isn't implemented yet, but the framework is in place and will allow significant backend/compositor thread improvements during scrolling and zoom frames. There's also a number of other changes that have been rolled in to this patch since it made sense to fix them at the same time. In particular: * Blend/composite batches are stored as ints, avoiding CPU float packing overhead. * Border segment rectangles are calculated in the vertex shader rather than CPU. * Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead. * Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls. * Angle gradients support arbitrary stop count in primitive data. * Removed some unused interpolators from the border shader. * Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead. * Reduced size of packed image structure - UV type is stored in sign of UV field. * Reduced size of packed glyph structure - color stored per run, rect calculated from UVs. * Moved some utility types from tiling.rs to util.rs. * Remove clip cases from batch enum, store needs_clipping flag in batch key.

kvark · 2016-10-19T14:53:03Z

Review notes:

the approach seems difficult to debug or reason about, given that all the data is in the blobs now, and lots of indices passed around. Quite a bit of redirection on the GPU side.
the code complexity is dramatically increased now
get_fetch_uv_X() functions are tightly coupled with one or more texelFetchOffset calls. X=1 means one fetch from sData16, X=2 means 2 fetches from sData32, and so on. We could have GpuData16 get_data_1(int index), GpuData32 get_data_2(int index), and so on instead, saving quite a bit of code.

Some thoughts about the change:

looks quite similar to ECS approach, just on GPU. The Primitive being an entity and having indices into various components stored in GPU arrays/textures.
this would look much better with HLSL structured buffers, too bad GL doesn't support typing of it load/store images...
all the gl_InstanceID-indexed lookups are wasted, given that these could just be instances vertex attributes (optimization opportunity). I only see them used in load_primitive().
why do we need #line 1 in several shaders?
no protection against out-of-bound access. Perhaps, we can have a special define flag to enable bound checks in the shaders?

Other observations through the code:

local_pos = clamp(local_pos, cp0, cp1); is being followed by another clamp with the local_clip_rect. Given that cp0 and cp1 are rounded coordinates of the same clip rectangle, it looks weird. Is this by design?
ps_border seems to have a lot of redundant code. We could vec4 mr0 = max(border.radii[0], border.widths); and vec4 mr1 = max(border.radii[1], border.widths);, followed by the use of mr0 and mr1. Probably a task for a separate PR :)
TextureCoordKind::Pixel should probably be TextureCoordKind::Texel?
the GRADIENT_VERTICAL path seems suspicious. Shouldn't it be using g0.offset.y and g1.offset.y?

I've yet to check tiling.rs changes, for Github not showing them, but don't wait for me if you need to merge this. I imagine the pain of maintaining this PR, probably needs to get in fast :)

glennw · 2016-10-19T19:46:19Z

@kvark Thanks for taking a look, comments below:

The complexity is increased by some amount, I think it's a worthwhile tradeoff for the performance gains (there's some much larger gains coming once I add incremental updates as a follow up).
Switching the get_fetch_uv() calls to get_data() calls seems like a good idea. Ideally we'd do this as a follow up in order to get this landed soon.
Agreed that changing gl_instanceid to a vertex attribute could be a good optimization - it's unrelated to this PR though, it can be done as a follow up.
Having #line1 just makes it easier to see where shader errors are, since several files are concatenated together before being compiled.
The two clamps are for the clip rect of the primitive, and the clip rect of its owning stacking context layer.
ps_border definitely needs a lot of optimization - we wanted to get it mostly feature complete as a first step, then we can worry about optimization after that.
TextureCoordKind rename sounds good - let's do that as a follow up since it's unrelated to this.
The offset field for gradients is OK. It's a vec4 for alignment purposes but the gradient offset is a scalar that's stored in the X component. We could probably simplify this by fetching the vec4 but only returning it as a float in the shader code.

Thanks for looking over the changes! :)

kvark · 2016-10-19T20:24:22Z

@glennw makes sense, thanks for the answers!

pcwalton

Partial review

pcwalton · 2016-10-20T00:47:12Z

webrender/res/ps_angle_gradient.vs.glsl

-        vOffsets[i] = gradient.offsets[i];
+    int stop_index = int(prim.user_data.x);
+
+    for (int i=0 ; i < vStopCount ; ++i) {


nit: weird indentation here.

I think that was the previous file?

pcwalton · 2016-10-20T00:47:59Z

webrender/res/ps_border.vs.glsl

+                                tr_inner.x - tl_inner.x,
+                                border.widths.y);
+            break;
+    }


We can eliminate this switch at some point long-term, I hope?

Yes - border shader needs a lot of optimization work :)

pcwalton · 2016-10-20T00:49:46Z

webrender/src/device.rs

+                                TextureId(0),
+                                TextureId(0),
+                                TextureId(0),
+                                TextureId(0),


Does [TextureId(0); 16] not work?

pcwalton · 2016-10-20T00:52:03Z

webrender/src/prim_store.rs

+}
+
+#[derive(Debug, Clone)]
+pub struct TextRunPrimitiveCpu {


What's the difference between this and TextRunPrimitiveGpu?

Structs get split SoA style into portions that get directly pushed into a vertex texture (the gpu version) and cpu-side only portions (typically used to look up resource lists etc).

pcwalton · 2016-10-20T00:52:24Z

webrender/src/prim_store.rs

+        Clip {
+            rect: ClipRect {
+                rect: rect,
+                padding: [0.0, 0.0, 0.0, 0.0],


This can probably just be [0.0; 4], right?

pcwalton · 2016-10-20T00:53:31Z

webrender/src/prim_store.rs

+                    self.gpu_data16.push(InstanceRect {
+                        rect: rect,
+                    });
+                }


nit: Could just be

self.gpu_data16.extend(instance_rects.into_iter().map(|rect| InstanceRect { rect: rect, }))

pcwalton · 2016-10-20T00:54:13Z

webrender/src/prim_store.rs

+                clip_data[1] = GpuBlock32::from(clip.top_left.clone());
+                clip_data[2] = GpuBlock32::from(clip.top_right.clone());
+                clip_data[3] = GpuBlock32::from(clip.bottom_left.clone());
+                clip_data[4] = GpuBlock32::from(clip.bottom_right.clone());


Do you need to clone the clip components?

pcwalton · 2016-10-20T00:54:46Z

webrender/src/prim_store.rs

+                    TextureCoordKind::Pixel => {
+                        uv1.x = -uv1.x;
+                    }
+                }


uber-nit: no need for the {} after the last =>

pcwalton · 2016-10-20T00:55:22Z

webrender/src/prim_store.rs

+                        *dest = GpuBlock32::from(GradientStop {
+                            offset: 1.0 - src.offset,
+                            color: src.color,
+                            padding: [0.0, 0.0, 0.0],


Can probably be just [0.0; 3]

pcwalton · 2016-10-20T00:56:53Z

webrender/src/util.rs

+            TransformedRectKind::Complex
+        };
+
+/*


I prefer not to have commented out code, but I understand why you did it here. I think it'd be good to add a FIXME at the top saying what the commented out code does and when to reenable it.

pcwalton · 2016-10-20T22:01:35Z

@@ -179,14 +344,17 @@ impl AlphaBatcher {
                     AlphaRenderItem::Primitive(sc_index, prim_index) => {
                         // See if this task fits into the tile UBO
                         let layer = &ctx.layer_store[sc_index.0];
-                        let prim = &ctx.prim_store[prim_index.0];
+                        let prim_metadata = ctx.prim_store.get_metadata(prim_index);
                         let transform_kind = layer.xf_rect.as_ref().unwrap().kind;
                         let needs_blending = transform_kind == TransformedRectKind::Complex ||
-                                             !prim.is_opaque(ctx.resource_cache, ctx.frame_id);
-                        let batch_kind = prim.batch_kind();
-                        let color_texture_id = prim.color_texture_id(ctx.resource_cache,
-                                                                     ctx.frame_id);
-                        let flags = AlphaBatchKeyFlags::new(transform_kind, needs_blending);
+                                             !prim_metadata.is_opaque ||
+                                             prim_metadata.clip_index.is_some();
+                        let needs_clipping = prim_metadata.clip_index.is_some();

nit: Instead of repeating prim_metadata.clip_index.is_some(), could you just say needs_blending = transform_kind == TransformedRectKind::Complex || needs_clipping;?

@@ -1796,8 +770,12 @@ impl AlphaBatchKey {
 struct AlphaBatchKeyFlags(u8);

Can we add a FIXME to change this into a bitflags! at some point?

pcwalton · 2016-10-20T22:01:45Z

This looks good to me with those nits addressed!

glennw · 2016-10-20T22:13:40Z

@bors-servo r=pcwalton

bors-servo · 2016-10-20T22:13:40Z

📌 Commit f112cf6 has been approved by pcwalton

bors-servo · 2016-10-20T22:13:42Z

⌛ Testing commit f112cf6 with merge 12af87a...

Switch primitives to be stored in SoA style in vertex textures. The primary goals of this patch are: (1) Store primitives in SoA style. This makes some passes, such as culling primitives much faster due to better cache coherency. (2) Store primitives in flat arrays, for direct upload to GPU. This reduces the amount of redundant copying on the CPU during frame construction. (3) Decouple primitive information from geometry instances. This allows us to improve occlusion culling, by submitting segments of primitives without any extra CPU overhead of copying primitive data. (4) Allow incremental updates of the GPU SoA arrays during scrolling and zoom. This isn't implemented yet, but the framework is in place and will allow significant backend/compositor thread improvements during scrolling and zoom frames. There's also a number of other changes that have been rolled in to this patch since it made sense to fix them at the same time. In particular: * Blend/composite batches are stored as ints, avoiding CPU float packing overhead. * Border segment rectangles are calculated in the vertex shader rather than CPU. * Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead. * Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls. * Angle gradients support arbitrary stop count in primitive data. * Removed some unused interpolators from the border shader. * Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead. * Reduced size of packed image structure - UV type is stored in sign of UV field. * Reduced size of packed glyph structure - color stored per run, rect calculated from UVs. * Moved some utility types from tiling.rs to util.rs. * Remove clip cases from batch enum, store needs_clipping flag in batch key.  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/448)

bors-servo · 2016-10-20T22:53:18Z

☀️ Test successful - status-travis

glennw force-pushed the prim-vt3 branch from 84b4f62 to e64ee71 Compare October 18, 2016 23:36

glennw force-pushed the prim-vt3 branch from e64ee71 to 4615f96 Compare October 18, 2016 23:38

pcwalton reviewed Oct 20, 2016

View reviewed changes

gw3583 added 2 commits October 20, 2016 12:02

Address some review comments.

f51ae50

Fix image_clip shader in b.html

5908f8c

Address review comments.

f112cf6

bors-servo merged commit f112cf6 into servo:master Oct 20, 2016

glennw deleted the prim-vt3 branch October 20, 2016 22:53

paulrouget mentioned this pull request Nov 21, 2016

macOS nightly shows black page on startup servo/servo#13953

Closed

Switch primitives to be stored in SoA style in vertex textures. #448

Switch primitives to be stored in SoA style in vertex textures. #448

Uh oh!

Conversation

glennw commented Oct 18, 2016 • edited by larsbergstrom Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glennw commented Oct 18, 2016

Uh oh!

glennw commented Oct 18, 2016

Uh oh!

kvark commented Oct 18, 2016

Uh oh!

glennw commented Oct 18, 2016

Uh oh!

bors-servo commented Oct 18, 2016

Uh oh!

kvark commented Oct 19, 2016

Uh oh!

glennw commented Oct 19, 2016

Uh oh!

kvark commented Oct 19, 2016

Uh oh!

pcwalton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcwalton commented Oct 20, 2016

Uh oh!

pcwalton commented Oct 20, 2016

Uh oh!

glennw commented Oct 20, 2016

Uh oh!

bors-servo commented Oct 20, 2016

Uh oh!

bors-servo commented Oct 20, 2016

Uh oh!

bors-servo commented Oct 20, 2016

Uh oh!

Uh oh!

glennw commented Oct 18, 2016 •

edited by larsbergstrom

Loading