-
Notifications
You must be signed in to change notification settings - Fork 296
Switch primitives to be stored in SoA style in vertex textures. #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @pcwalton I apologize for the size of this patch - if it's too difficult to review, let me know and I'll split it up tomorrow into smaller patches. I've run this locally against WPT and CSS tests successfully. I've also tested on my Linux laptop and Mac Mini. Some timings are pasted below. |
Some representative timings from my laptop. Note that this change doesn't actually do any incremental updates yet, so there are some good sized wins still to come from follow ups to this patch. https://en.wikipedia.org https://news.ycombinator.com https://reddit.com https://github.com/servo/servo |
I'd be happy to review this gigantic change as well. |
☔ The latest upstream changes (presumably #432) made this pull request unmergeable. Please resolve the merge conflicts. |
The primary goals of this patch are: (1) Store primitives in SoA style. This makes some passes, such as culling primitives much faster due to better cache coherency. (2) Store primitives in flat arrays, for direct upload to GPU. This reduces the amount of redundant copying on the CPU during frame construction. (3) Decouple primitive information from geometry instances. This allows us to improve occlusion culling, by submitting segments of primitives without any extra CPU overhead of copying primitive data. (4) Allow incremental updates of the GPU SoA arrays during scrolling and zoom. This isn't implemented yet, but the framework is in place and will allow significant backend/compositor thread improvements during scrolling and zoom frames. There's also a number of other changes that have been rolled in to this patch since it made sense to fix them at the same time. In particular: * Blend/composite batches are stored as ints, avoiding CPU float packing overhead. * Border segment rectangles are calculated in the vertex shader rather than CPU. * Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead. * Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls. * Angle gradients support arbitrary stop count in primitive data. * Removed some unused interpolators from the border shader. * Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead. * Reduced size of packed image structure - UV type is stored in sign of UV field. * Reduced size of packed glyph structure - color stored per run, rect calculated from UVs. * Moved some utility types from tiling.rs to util.rs. * Remove clip cases from batch enum, store needs_clipping flag in batch key.
Review notes:
Some thoughts about the change:
Other observations through the code:
I've yet to check |
@kvark Thanks for taking a look, comments below:
Thanks for looking over the changes! :) |
@glennw makes sense, thanks for the answers! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review
vOffsets[i] = gradient.offsets[i]; | ||
int stop_index = int(prim.user_data.x); | ||
|
||
for (int i=0 ; i < vStopCount ; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: weird indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was the previous file?
tr_inner.x - tl_inner.x, | ||
border.widths.y); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can eliminate this switch at some point long-term, I hope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - border shader needs a lot of optimization work :)
TextureId(0), | ||
TextureId(0), | ||
TextureId(0), | ||
TextureId(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does [TextureId(0); 16]
not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
} | ||
|
||
#[derive(Debug, Clone)] | ||
pub struct TextRunPrimitiveCpu { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and TextRunPrimitiveGpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structs get split SoA style into portions that get directly pushed into a vertex texture (the gpu version) and cpu-side only portions (typically used to look up resource lists etc).
Clip { | ||
rect: ClipRect { | ||
rect: rect, | ||
padding: [0.0, 0.0, 0.0, 0.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably just be [0.0; 4]
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
self.gpu_data16.push(InstanceRect { | ||
rect: rect, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could just be
self.gpu_data16.extend(instance_rects.into_iter().map(|rect| InstanceRect {
rect: rect,
}))
clip_data[1] = GpuBlock32::from(clip.top_left.clone()); | ||
clip_data[2] = GpuBlock32::from(clip.top_right.clone()); | ||
clip_data[3] = GpuBlock32::from(clip.bottom_left.clone()); | ||
clip_data[4] = GpuBlock32::from(clip.bottom_right.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to clone the clip components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
TextureCoordKind::Pixel => { | ||
uv1.x = -uv1.x; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uber-nit: no need for the {}
after the last =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
*dest = GpuBlock32::from(GradientStop { | ||
offset: 1.0 - src.offset, | ||
color: src.color, | ||
padding: [0.0, 0.0, 0.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably be just [0.0; 3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
TransformedRectKind::Complex | ||
}; | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to have commented out code, but I understand why you did it here. I think it'd be good to add a FIXME
at the top saying what the commented out code does and when to reenable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
nit: Instead of repeating
Can we add a |
This looks good to me with those nits addressed! |
@bors-servo r=pcwalton |
📌 Commit f112cf6 has been approved by |
Switch primitives to be stored in SoA style in vertex textures. The primary goals of this patch are: (1) Store primitives in SoA style. This makes some passes, such as culling primitives much faster due to better cache coherency. (2) Store primitives in flat arrays, for direct upload to GPU. This reduces the amount of redundant copying on the CPU during frame construction. (3) Decouple primitive information from geometry instances. This allows us to improve occlusion culling, by submitting segments of primitives without any extra CPU overhead of copying primitive data. (4) Allow incremental updates of the GPU SoA arrays during scrolling and zoom. This isn't implemented yet, but the framework is in place and will allow significant backend/compositor thread improvements during scrolling and zoom frames. There's also a number of other changes that have been rolled in to this patch since it made sense to fix them at the same time. In particular: * Blend/composite batches are stored as ints, avoiding CPU float packing overhead. * Border segment rectangles are calculated in the vertex shader rather than CPU. * Gradient colors use vertex shader interpolators for axis aligned gradients, reducing FS overhead. * Removed separate Text/TextRun shader types, resulting in better batching and fewer draw calls. * Angle gradients support arbitrary stop count in primitive data. * Removed some unused interpolators from the border shader. * Axis aligned gradient segments are calculated in vertex shader, reducing CPU overhead. * Reduced size of packed image structure - UV type is stored in sign of UV field. * Reduced size of packed glyph structure - color stored per run, rect calculated from UVs. * Moved some utility types from tiling.rs to util.rs. * Remove clip cases from batch enum, store needs_clipping flag in batch key. <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/448) <!-- Reviewable:end -->
☀️ Test successful - status-travis |
The primary goals of this patch are:
(1) Store primitives in SoA style. This makes some passes, such as
culling primitives much faster due to better cache coherency.
(2) Store primitives in flat arrays, for direct upload to GPU. This
reduces the amount of redundant copying on the CPU during
frame construction.
(3) Decouple primitive information from geometry instances. This allows
us to improve occlusion culling, by submitting segments of primitives
without any extra CPU overhead of copying primitive data.
(4) Allow incremental updates of the GPU SoA arrays during scrolling and
zoom. This isn't implemented yet, but the framework is in place and
will allow significant backend/compositor thread improvements during
scrolling and zoom frames.
There's also a number of other changes that have been rolled in to this
patch since it made sense to fix them at the same time. In particular:
This change is