Skip to content

Switch primitives to be stored in SoA style in vertex textures. #448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 20, 2016

Conversation

glennw
Copy link
Member

@glennw glennw commented Oct 18, 2016

The primary goals of this patch are:

(1) Store primitives in SoA style. This makes some passes, such as
culling primitives much faster due to better cache coherency.
(2) Store primitives in flat arrays, for direct upload to GPU. This
reduces the amount of redundant copying on the CPU during
frame construction.
(3) Decouple primitive information from geometry instances. This allows
us to improve occlusion culling, by submitting segments of primitives
without any extra CPU overhead of copying primitive data.
(4) Allow incremental updates of the GPU SoA arrays during scrolling and
zoom. This isn't implemented yet, but the framework is in place and
will allow significant backend/compositor thread improvements during
scrolling and zoom frames.

There's also a number of other changes that have been rolled in to this
patch since it made sense to fix them at the same time. In particular:

  • Blend/composite batches are stored as ints, avoiding CPU float packing overhead.
  • Border segment rectangles are calculated in the vertex shader rather than CPU.
  • Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead.
  • Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls.
  • Angle gradients support arbitrary stop count in primitive data.
  • Removed some unused interpolators from the border shader.
  • Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead.
  • Reduced size of packed image structure - UV type is stored in sign of UV field.
  • Reduced size of packed glyph structure - color stored per run, rect calculated from UVs.
  • Moved some utility types from tiling.rs to util.rs.
  • Remove clip cases from batch enum, store needs_clipping flag in batch key.

This change is Reviewable

@glennw
Copy link
Member Author

glennw commented Oct 18, 2016

r? @pcwalton

I apologize for the size of this patch - if it's too difficult to review, let me know and I'll split it up tomorrow into smaller patches. I've run this locally against WPT and CSS tests successfully. I've also tested on my Linux laptop and Mac Mini. Some timings are pasted below.

@glennw
Copy link
Member Author

glennw commented Oct 18, 2016

Some representative timings from my laptop. Note that this change doesn't actually do any incremental updates yet, so there are some good sized wins still to come from follow ups to this patch.

https://en.wikipedia.org
Backend (ms): 4.5 -> 3.7
Compositor (ms): 3.3 -> 2.8
GPU (ms): 2.8 -> 2.5

https://news.ycombinator.com
Backend (ms): 2.6 -> 2.3
Compositor (ms): 1.8 -> 1.2
GPU (ms): 1.3 -> 0.8

https://reddit.com
Backend (ms): 3.6 -> 3.2
Compositor (ms): 4.5 -> 4.5
GPU (ms): 3.4 -> 3.2

https://github.com/servo/servo
Backend (ms): 5.0 -> 3.5
Compositor (ms): 3.0 -> 2.2
GPU (ms): 4.3 -> 3.3

@kvark
Copy link
Member

kvark commented Oct 18, 2016

I'd be happy to review this gigantic change as well.

@glennw
Copy link
Member Author

glennw commented Oct 18, 2016

@kvark @pcwalton Yep, feel free to both look at it - and apologies again for the size of this patch!

@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably #432) made this pull request unmergeable. Please resolve the merge conflicts.

The primary goals of this patch are:

(1) Store primitives in SoA style. This makes some passes, such as
    culling primitives much faster due to better cache coherency.
(2) Store primitives in flat arrays, for direct upload to GPU. This
    reduces the amount of redundant copying on the CPU during
    frame construction.
(3) Decouple primitive information from geometry instances. This allows
    us to improve occlusion culling, by submitting segments of primitives
    without any extra CPU overhead of copying primitive data.
(4) Allow incremental updates of the GPU SoA arrays during scrolling and
    zoom. This isn't implemented yet, but the framework is in place and
    will allow significant backend/compositor thread improvements during
    scrolling and zoom frames.

There's also a number of other changes that have been rolled in to this
patch since it made sense to fix them at the same time. In particular:

* Blend/composite batches are stored as ints, avoiding CPU float packing overhead.
* Border segment rectangles are calculated in the vertex shader rather than CPU.
* Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead.
* Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls.
* Angle gradients support arbitrary stop count in primitive data.
* Removed some unused interpolators from the border shader.
* Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead.
* Reduced size of packed image structure - UV type is stored in sign of UV field.
* Reduced size of packed glyph structure - color stored per run, rect calculated from UVs.
* Moved some utility types from tiling.rs to util.rs.
* Remove clip cases from batch enum, store needs_clipping flag in batch key.
@kvark
Copy link
Member

kvark commented Oct 19, 2016

Review notes:

  • the approach seems difficult to debug or reason about, given that all the data is in the blobs now, and lots of indices passed around. Quite a bit of redirection on the GPU side.
  • the code complexity is dramatically increased now
  • get_fetch_uv_X() functions are tightly coupled with one or more texelFetchOffset calls. X=1 means one fetch from sData16, X=2 means 2 fetches from sData32, and so on. We could have GpuData16 get_data_1(int index), GpuData32 get_data_2(int index), and so on instead, saving quite a bit of code.

Some thoughts about the change:

  • looks quite similar to ECS approach, just on GPU. The Primitive being an entity and having indices into various components stored in GPU arrays/textures.
  • this would look much better with HLSL structured buffers, too bad GL doesn't support typing of it load/store images...
  • all the gl_InstanceID-indexed lookups are wasted, given that these could just be instances vertex attributes (optimization opportunity). I only see them used in load_primitive().
  • why do we need #line 1 in several shaders?
  • no protection against out-of-bound access. Perhaps, we can have a special define flag to enable bound checks in the shaders?

Other observations through the code:

  • local_pos = clamp(local_pos, cp0, cp1); is being followed by another clamp with the local_clip_rect. Given that cp0 and cp1 are rounded coordinates of the same clip rectangle, it looks weird. Is this by design?
  • ps_border seems to have a lot of redundant code. We could vec4 mr0 = max(border.radii[0], border.widths); and vec4 mr1 = max(border.radii[1], border.widths);, followed by the use of mr0 and mr1. Probably a task for a separate PR :)
  • TextureCoordKind::Pixel should probably be TextureCoordKind::Texel?
  • the GRADIENT_VERTICAL path seems suspicious. Shouldn't it be using g0.offset.y and g1.offset.y?

I've yet to check tiling.rs changes, for Github not showing them, but don't wait for me if you need to merge this. I imagine the pain of maintaining this PR, probably needs to get in fast :)

@glennw
Copy link
Member Author

glennw commented Oct 19, 2016

@kvark Thanks for taking a look, comments below:

  • The complexity is increased by some amount, I think it's a worthwhile tradeoff for the performance gains (there's some much larger gains coming once I add incremental updates as a follow up).
  • Switching the get_fetch_uv() calls to get_data() calls seems like a good idea. Ideally we'd do this as a follow up in order to get this landed soon.
  • Agreed that changing gl_instanceid to a vertex attribute could be a good optimization - it's unrelated to this PR though, it can be done as a follow up.
  • Having #line1 just makes it easier to see where shader errors are, since several files are concatenated together before being compiled.
  • The two clamps are for the clip rect of the primitive, and the clip rect of its owning stacking context layer.
  • ps_border definitely needs a lot of optimization - we wanted to get it mostly feature complete as a first step, then we can worry about optimization after that.
  • TextureCoordKind rename sounds good - let's do that as a follow up since it's unrelated to this.
  • The offset field for gradients is OK. It's a vec4 for alignment purposes but the gradient offset is a scalar that's stored in the X component. We could probably simplify this by fetching the vec4 but only returning it as a float in the shader code.

Thanks for looking over the changes! :)

@kvark
Copy link
Member

kvark commented Oct 19, 2016

@glennw makes sense, thanks for the answers! :shipit:

Copy link
Contributor

@pcwalton pcwalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

vOffsets[i] = gradient.offsets[i];
int stop_index = int(prim.user_data.x);

for (int i=0 ; i < vStopCount ; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: weird indentation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the previous file?

tr_inner.x - tl_inner.x,
border.widths.y);
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate this switch at some point long-term, I hope?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - border shader needs a lot of optimization work :)

TextureId(0),
TextureId(0),
TextureId(0),
TextureId(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does [TextureId(0); 16] not work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

#[derive(Debug, Clone)]
pub struct TextRunPrimitiveCpu {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and TextRunPrimitiveGpu?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structs get split SoA style into portions that get directly pushed into a vertex texture (the gpu version) and cpu-side only portions (typically used to look up resource lists etc).

Clip {
rect: ClipRect {
rect: rect,
padding: [0.0, 0.0, 0.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably just be [0.0; 4], right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.gpu_data16.push(InstanceRect {
rect: rect,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could just be

self.gpu_data16.extend(instance_rects.into_iter().map(|rect| InstanceRect {
    rect: rect,
}))

clip_data[1] = GpuBlock32::from(clip.top_left.clone());
clip_data[2] = GpuBlock32::from(clip.top_right.clone());
clip_data[3] = GpuBlock32::from(clip.bottom_left.clone());
clip_data[4] = GpuBlock32::from(clip.bottom_right.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to clone the clip components?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

TextureCoordKind::Pixel => {
uv1.x = -uv1.x;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uber-nit: no need for the {} after the last =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

*dest = GpuBlock32::from(GradientStop {
offset: 1.0 - src.offset,
color: src.color,
padding: [0.0, 0.0, 0.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be just [0.0; 3]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

TransformedRectKind::Complex
};

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to have commented out code, but I understand why you did it here. I think it'd be good to add a FIXME at the top saying what the commented out code does and when to reenable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@pcwalton
Copy link
Contributor

@@ -179,14 +344,17 @@ impl AlphaBatcher {
                     AlphaRenderItem::Primitive(sc_index, prim_index) => {
                         // See if this task fits into the tile UBO
                         let layer = &ctx.layer_store[sc_index.0];
-                        let prim = &ctx.prim_store[prim_index.0];
+                        let prim_metadata = ctx.prim_store.get_metadata(prim_index);
                         let transform_kind = layer.xf_rect.as_ref().unwrap().kind;
                         let needs_blending = transform_kind == TransformedRectKind::Complex ||
-                                             !prim.is_opaque(ctx.resource_cache, ctx.frame_id);
-                        let batch_kind = prim.batch_kind();
-                        let color_texture_id = prim.color_texture_id(ctx.resource_cache,
-                                                                     ctx.frame_id);
-                        let flags = AlphaBatchKeyFlags::new(transform_kind, needs_blending);
+                                             !prim_metadata.is_opaque ||
+                                             prim_metadata.clip_index.is_some();
+                        let needs_clipping = prim_metadata.clip_index.is_some();

nit: Instead of repeating prim_metadata.clip_index.is_some(), could you just say needs_blending = transform_kind == TransformedRectKind::Complex || needs_clipping;?

@@ -1796,8 +770,12 @@ impl AlphaBatchKey {
 struct AlphaBatchKeyFlags(u8);

Can we add a FIXME to change this into a bitflags! at some point?

@pcwalton
Copy link
Contributor

This looks good to me with those nits addressed!

@glennw
Copy link
Member Author

glennw commented Oct 20, 2016

@bors-servo r=pcwalton

@bors-servo
Copy link
Contributor

📌 Commit f112cf6 has been approved by pcwalton

@bors-servo
Copy link
Contributor

⌛ Testing commit f112cf6 with merge 12af87a...

bors-servo pushed a commit that referenced this pull request Oct 20, 2016
Switch primitives to be stored in SoA style in vertex textures.

The primary goals of this patch are:

(1) Store primitives in SoA style. This makes some passes, such as
    culling primitives much faster due to better cache coherency.
(2) Store primitives in flat arrays, for direct upload to GPU. This
    reduces the amount of redundant copying on the CPU during
    frame construction.
(3) Decouple primitive information from geometry instances. This allows
    us to improve occlusion culling, by submitting segments of primitives
    without any extra CPU overhead of copying primitive data.
(4) Allow incremental updates of the GPU SoA arrays during scrolling and
    zoom. This isn't implemented yet, but the framework is in place and
    will allow significant backend/compositor thread improvements during
    scrolling and zoom frames.

There's also a number of other changes that have been rolled in to this
patch since it made sense to fix them at the same time. In particular:

* Blend/composite batches are stored as ints, avoiding CPU float packing overhead.
* Border segment rectangles are calculated in the vertex shader rather than CPU.
* Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead.
* Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls.
* Angle gradients support arbitrary stop count in primitive data.
* Removed some unused interpolators from the border shader.
* Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead.
* Reduced size of packed image structure - UV type is stored in sign of UV field.
* Reduced size of packed glyph structure - color stored per run, rect calculated from UVs.
* Moved some utility types from tiling.rs to util.rs.
* Remove clip cases from batch enum, store needs_clipping flag in batch key.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/448)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

☀️ Test successful - status-travis

@bors-servo bors-servo merged commit f112cf6 into servo:master Oct 20, 2016
@glennw glennw deleted the prim-vt3 branch October 20, 2016 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants