Skip to content

Clip stack support #581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Clip stack support #581

wants to merge 2 commits into from

Conversation

kvark
Copy link
Member

@kvark kvark commented Nov 21, 2016

Closes #498

Removes StackingContext::overflow in favour of DisplayItem::clip.
Generates clip tasks for the layers that the primitives may depend on when generating their masks, copies over the affected sub-regions instead of clearing (new cs_clip_copy shader).

Current status: base tests are passing, new logic test cases are pending


This change is Reviewable

@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably #568) made this pull request unmergeable. Please resolve the merge conflicts.

@glennw
Copy link
Member

glennw commented Nov 22, 2016

@kvark Some bugs that may have some relevant test cases:

These may not be 100% relevant to the current task, and they may be caused by other unrelated bugs, but I think this work should be a step towards solving these issues.

@kvark
Copy link
Member Author

kvark commented Nov 22, 2016

Thanks @glennw ! I'll check out these specifically after making sure the basic tests work as expected. I figured the easiest way to test now is to modify the wr-sample, so I found a few spots with missing logic and working on it now.

Copy link
Member

@glennw glennw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to run this branch locally to get a better understanding of how it fits together - once it's rebased tomorrow I'll pull it locally and have a more detailed look over it :)

let scrollbar_rect = Rect::new(Point2D::zero(), Size2D::new(10.0, 70.0));
context.builder.add_solid_rectangle(&scrollbar_rect,
&ClipRegion::simple(&scrollbar_rect),
context.builder.add_solid_rectangle(&stacking_context.bounds,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the change from a constant rect here deliberate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I shouldn't have changed this one

@@ -1723,6 +1794,26 @@ impl ScreenTile {
alpha_task_stack.push(prev_task);
}
}

// Create a task for the layer mask, if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do renders for axis aligned regions, that can be handled without a mask?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clip_cache_info would be None for stuff with just rectangle clips and no images/rounded borders. That's what "if needed" part implies here. I'll specify it in more detail.

@kvark
Copy link
Member Author

kvark commented Nov 22, 2016

I'm facing a problem here with cleared tiles. Now that I have clip regions assigned to stacking contexts, they don't draw anything (even their own background) to areas outside of the clip regions. So the pixels in otherwise affected tiles stay the clear color (black), and the pixels in tiles outside the clipping regions are filled up with white by Frame::clear_tiles. This is certainly not desirable, but it's not obvious to me how the semantics of the background color and/or clear tiles has to be adapted. @glennw ?

@kvark kvark changed the title Clip stack support [WIP] Clip stack support Nov 22, 2016
@kvark kvark changed the title [WIP] Clip stack support Clip stack support Nov 22, 2016
@glennw
Copy link
Member

glennw commented Nov 23, 2016

I didn't get a chance to take a detailed look at this today, sorry.

I did build it though and run a few tests - all the tests I've run so far do pass. Anecdotally it seems like sites are quite a bit slower with this patch than before - it could be worth testing a few sites locally with/without this patch and seeing if you see the same result.

I'm not sure I understand how the clip copies stuff works - doesn't it mean that you have the same texture bound for reading from at the same time as you are writing via FBO. If that's the case, although it often kind of works, it's undefined behaviour - but perhaps I just haven't read the code closely enough yet.

I'll take another look in the morning and we can discuss further.

@glennw
Copy link
Member

glennw commented Nov 23, 2016

Oh, disregard the above question about the FBO - I can see how it adds a dependent task now. But, like we discussed earlier, that means it's not available to the correct render pass, doesn't it? How does the current patch handle that case?

@kvark
Copy link
Member Author

kvark commented Nov 23, 2016

@glennw

Anecdotally it seems like sites are quite a bit slower with this patch than before

Weird, I don't expect the workload change with this PR, at least not until we start using the new API. One thing that it does differently is - pushing/popping a dummy context for the level zero's background rectangle. I'll try to measure the perf difference myself and see.

But, like we discussed earlier, that means it's not available to the correct render pass, doesn't it? How does the current patch handle that case?

The clip generation uses the task system API and specifies the dependencies correctly. I consider the extra work that may occur to be an implementation problem of the task system itself. Perhaps, we can relax the constraints of the tasks hashing a bit (i.e. allow tasks reusal from different passes under certain conditions) in order to optimize this and potentially other cases?

@glennw
Copy link
Member

glennw commented Nov 23, 2016

@kvark I've been thinking about a modification to this implementation a bit, which I've described below. Let's discuss over IRC / Vidyo tomorrow and see if it makes sense.

First, I need you to check my understanding of the current implementation, which might be wrong. I think that as it stands now, we will require the same number of render target passes as the depth of the worst case clip stack - is that right? If so, I think that could probably lead to some cases that require a lot of render target passes (which are esp. bad on mobile). I suspect it could also lead to us having an extra render target on a lot of pages (which is not ideal since each render target introduces an additional fixed overhead).

The other option we talked about briefly was having a shader that can loop over multiple clips and write them in one pass. This has some advantages, but also some disadvantages, as we've discussed.

But there's a third option - which sounds bad on the face of it - but hear me out. We render the clip stack for a primitive in a single render task, but multi-pass. So if a primitive has its own clip mask and also clip masks from two nested (rotated) stacking contexts, we would render them all into a single render task spot (so only one render target), but with multiple draw calls (that are batched where possible, of course).

So the downside of this is that the stacking context mask ends up getting drawn multiple times, instead of once in a different pass. But consider this (and disagree as you see fit):

  • Rendering a clip mask is (almost) as fast as the cs_copy shader anyway, so it's minimal performance difference in terms of the rendering of the masks.
  • We save extra render target allocations, clears and remove that fixed overhead, so it's a performance and memory win in that sense.
  • We can, at our choice, choose whether to render the mask as a special case shader that can render multiple clip masks in one pass, or do it multi-pass (which would be the default).
  • It fits in nicely to the existing task system :)

I think it's pretty important to minimize the number of render target switches, esp. on mobile / tiled GPU architecturues, and this seems to me like a compromise that gives us the ability to do the clips masks as either single / multi-pass.

If we were to do something like this, I'd imagine something like the following:

  • Primitives that have their own clip (e.g. rounded corners) use (PrimitiveIndex, TileId) as a key for the render task.
  • Primitives that don't need their own specific clip, but have clips from stacking contexts would use (StackingContextIndex, TileId) as a key for the task.
  • The render task data would include (equivalent to) a Vec<> of stacking contexts that they need to draw into this mask region.
  • Primitives without specific or stacking context clip get a normal dummy item and no render task.

Thoughts? Does that make any sense?

@kvark
Copy link
Member Author

kvark commented Nov 24, 2016

@glennw Thanks for thinking about it!

I think that as it stands now, we will require the same number of render target passes as the depth of the worst case clip stack - is that right?

I don't see it "requiring" this number of render targets. As I said in the previous comment, this is up to the task system implementation to decide. It may have N targets, or it may be bouncing between 2 targets, or even have a single target with texture barriers to allow sampling and rendering to it at the same time.

But there's a third option

I like the idea. I wish we could have it as an optional path when the clip stack is not too deep, but it sounds like it would be easier to just put it as the only option. I agree (yes, good points there!) with everything except these two points:

Rendering a clip mask is (almost) as fast as the cs_copy shader anyway, so it's minimal performance difference in terms of the rendering of the masks.

If a stacking context happens to accumulate N clips, we'll have to redo them for every child primitive in this 3rd approach, which is certainly incomparable with a simple cs_clip_copy call.

I think it's pretty important to minimize the number of render target switches, esp. on mobile / tiled GPU architectures

While true in general, we aren't talking about a hundred or even ten target switches that currently happen. It only adds up a few switches to what we already had, so I wouldn't worry about the RT switch cost in particular.

So, I like your 3rd approach because it closes the gap between the 1st (which is implemented) and the 2nd (uber-clip shader). There is also an optimization on the table that I mentioned - have non-blended clip instances that avoid issuing a cs_clip_clear prior to them. It could fix the performance of 90% of cases without modifying any major logic paths.

Here is what I propose. I'll try to get that non-blended clip version running and update this PR. The idea is that by the time you get in today we can merge this PR at last. And then I'll proceed to implementing your 3rd approach. The benefit of landing now comparing to just rewriting the code straight away to the 3rd is that we'll have more time to adopt the API changes in servo and maybe find some issues that are common between 1st and 3rd paths. How does this sound?

@glennw
Copy link
Member

glennw commented Nov 24, 2016

Sounds like a good plan, thanks! :)

@glennw
Copy link
Member

glennw commented Nov 24, 2016

@kvark I did some initial testing - the tests so far that I've run pass - but there seems to be a timing issue, which will expose those Servo intermittents again. Could you see if you can reproduce the results below locally? Hopefully I was just testing something incorrectly...

If you run:

time ./mach test-wpt --release --processes=8 --log-raw=wpt.log tests/wpt/mozilla/tests/css/

The results I get without this patch:

real	1m25.866s
user	8m48.928s
sys	0m58.704s

The results with this patch:

real	2m33.484s
user	17m6.348s
sys	0m57.728s

@glennw
Copy link
Member

glennw commented Nov 25, 2016

Reviewed 5 of 13 files at r1, 3 of 8 files at r2.
Review status: 8 of 14 files reviewed at latest revision, 5 unresolved discussions.


webrender/src/tiling.rs, line 158 at r2 (raw file):

                         device_pixel_ratio: f32) -> bool {
        let metadata = self.get_metadata(prim_index);

Why remove this?


webrender/src/tiling.rs, line 526 at r2 (raw file):

                        let prim_metadata = ctx.prim_store.get_metadata(prim_index);
                        let transform_kind = layer.xf_rect.as_ref().unwrap().kind;
                        let needs_clipping = prim_metadata.clip_cache_info.is_some() || true; //TODO

Ah, this will be causing the slowdown I'm seeing, I expect


webrender/src/tiling.rs, line 1042 at r2 (raw file):

    fn new_mask(actual_rect: DeviceIntRect,
                cache_info: &MaskCacheInfo,
                dependant: Option<&RenderTask>,

nit: spelling (I think)


Comments from Reviewable

@glennw
Copy link
Member

glennw commented Nov 25, 2016

Partial review only above - will do a more detailed look once we've got the test performance issue sorted.

@kvark
Copy link
Member Author

kvark commented Nov 25, 2016

Review status: 8 of 14 files reviewed at latest revision, 5 unresolved discussions.


webrender/src/tiling.rs, line 158 at r2 (raw file):

Previously, glennw (Glenn Watson) wrote… > Why remove this?
No longer necessary as the culling is now moved [higher up](https://github.com//pull/581/files#diff-a465a0e8f2a0c6ff2d8a8efcf2696279R2583) to `assign_prims_to_screen_tiles`

webrender/src/tiling.rs, line 526 at r2 (raw file):

Previously, glennw (Glenn Watson) wrote… > Ah, this will be causing the slowdown I'm seeing, I expect
Wow, that did slip by me. That was meant to be temporary. Thanks for spotting! I'll fix it and do a proper benchmark with rebasing on your non-clip rectangle patch.

webrender/src/tiling.rs, line 1042 at r2 (raw file):
Seems to be a difference between British and American English:

According to every British reference resource we checked, British English treats dependant as the noun and dependent as the adjective.


Comments from Reviewable

…gle.

Renamed prim_store::PrimitiveClipSource to mask_cache::ClipSource.
Creating dependent tasks for the layers.
clip copy shader.
Nested clipping and some text for the wr-sample.
…all.

Fixed the needs_clipping flag to avoid blending where not necessary.
@kvark
Copy link
Member Author

kvark commented Nov 25, 2016

Time before my PR, before your rectangle no-clip fix, with laptop on battery:

real	6m18.830s
user	18m52.097s
sys	1m5.355s

Time before my PR, with your fix, on power:

real	3m16.849s
user	9m20.058s
sys	1m7.691s

Time with my PR, after fixing the needs_clipping bug, and without your fix, with laptop on battery:

real	5m58.585s
user	18m50.510s
sys	1m2.438s

Time with my PR, after the clip fix, with your fix, on power:

real	3m17.273s
user	9m12.677s
sys	1m8.231s

This PR is now updated, looks like all the issues are resolved. @glennw please have a look.

@kvark kvark mentioned this pull request Nov 25, 2016
@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably #592) made this pull request unmergeable. Please resolve the merge conflicts.

bors-servo pushed a commit that referenced this pull request Nov 28, 2016
Clip stack collapse path

Based on #581
Implements the 3rd option explained in #581 (comment) as an optional code path, controlled by `CLIP_TASK_COLLAPSE`, enabled by default.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/595)
<!-- Reviewable:end -->
@kvark
Copy link
Member Author

kvark commented Nov 28, 2016

Shadowed by #595

@kvark kvark closed this Nov 28, 2016
@kvark kvark deleted the clip_stack branch November 28, 2016 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants