Summary: | [WPE][GTK] Dispatch display refreshes earlier when skipping frames | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Alejandro G. Castro <alex> | ||||||||||||||||
Component: | WebKit Misc. | Assignee: | Chris Lord <clord> | ||||||||||||||||
Status: | NEW --- | ||||||||||||||||||
Severity: | Normal | CC: | achristensen, aperez, bugs-noreply, cdumez, cgarcia, clopez, clord, cmarcelo, crzwdjk, ews-watchlist, gyuyoung.kim, kdwkleung, luiz, mcatanzaro, pnormand, psaavedra, ryuan.choi, sergio, webkit-bug-importer, zan, zdobersek, zeno | ||||||||||||||||
Priority: | P2 | Keywords: | InRadar | ||||||||||||||||
Version: | WebKit Nightly Build | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Bug Depends on: | |||||||||||||||||||
Bug Blocks: | 238999 | ||||||||||||||||||
Attachments: |
|
Description
Alejandro G. Castro
2021-11-18 03:47:46 PST
Created attachment 444663 [details]
Patch
This forces all content (flushed from the layer tree or produced through rAF callbacks) to be presented on the second future vsync event, regardless of complexity. Only benefit is that rAF callbacks are dispatched more often. But that's already achievable by customizing the update cycle through the embedding application (i.e. taking it out of the alignment with vsync events). Thanks for the feedback! Some questions inlined. (In reply to Zan Dobersek from comment #2) > This forces all content (flushed from the layer tree or produced through rAF > callbacks) to be presented on the second future vsync event, regardless of > complexity. > Do you mean that current solution was checking complexity? Or that we could improve it checking the complexity of the frame? How we could check the complexity of the frame? > Only benefit is that rAF callbacks are dispatched more often. But that's > already achievable by customizing the update cycle through the embedding > application (i.e. taking it out of the alignment with vsync events). We understood we are still honoring the vsync, because in the end this new frame is going to render but it is not to be composited until the next frameDone. Maybe we are missing some of the conditions in the architecture, any new information is welcome! Thanks again. (In reply to Alejandro G. Castro from comment #3) > Thanks for the feedback! Some questions inlined. > > (In reply to Zan Dobersek from comment #2) > > This forces all content (flushed from the layer tree or produced through rAF > > callbacks) to be presented on the second future vsync event, regardless of > > complexity. > > > > Do you mean that current solution was checking complexity? Or that we could > improve it checking the complexity of the frame? How we could check the > complexity of the frame? > The current implementation dispatches any rAF callbacks only when the frame is completely finished. This means that content updates done this way will be incorporated into the following flush and composited in the next ThreadedCompositor run. When complexity of the content is simple enough, this is all managed inside the 16ms frame budget (at 60FPS). Which is great. When complexity grows beyond that budget, the update will be presented when it's done. This change introduces behavior where rAF dispatch is done in the scope of the next-immediate ThreadedCompositor run. This means that, upon frame completion, an immediate composition run will be done, and only at that point will the rAF dispatch. This means that any content update done under rAF will only be incorporated in the second composition run, and will be presented two whole frames after the corresponding frame-completion event. And that will be done regardless of whether the content is light enough for it to be handled in the time-space of a single frame. > > Only benefit is that rAF callbacks are dispatched more often. But that's > > already achievable by customizing the update cycle through the embedding > > application (i.e. taking it out of the alignment with vsync events). > > We understood we are still honoring the vsync, because in the end this new > frame is going to render but it is not to be composited until the next > frameDone. Maybe we are missing some of the conditions in the architecture, > any new information is welcome! > > Thanks again. So you described the problem here pretty well. (In reply to Zan Dobersek from comment #4) > (In reply to Alejandro G. Castro from comment #3) > > Thanks for the feedback! Some questions inlined. > > > > (In reply to Zan Dobersek from comment #2) > > > This forces all content (flushed from the layer tree or produced through rAF > > > callbacks) to be presented on the second future vsync event, regardless of > > > complexity. > > > > > > > Do you mean that current solution was checking complexity? Or that we could > > improve it checking the complexity of the frame? How we could check the > > complexity of the frame? > > > > The current implementation dispatches any rAF callbacks only when the frame > is completely finished. This means that content updates done this way will > be incorporated into the following flush and composited in the next > ThreadedCompositor run. > > When complexity of the content is simple enough, this is all managed inside > the 16ms frame budget (at 60FPS). Which is great. When complexity grows > beyond that budget, the update will be presented when it's done. > > This change introduces behavior where rAF dispatch is done in the scope of > the next-immediate ThreadedCompositor run. This means that, upon frame > completion, an immediate composition run will be done, and only at that > point will the rAF dispatch. This means that any content update done under > rAF will only be incorporated in the second composition run, and will be > presented two whole frames after the corresponding frame-completion event. > And that will be done regardless of whether the content is light enough for > it to be handled in the time-space of a single frame. Is this not desirable though? We're increasing throughput by dispatching rAF earlier here and essentially switching from a blocking double-buffer to a non-blocking double-buffer (with possibly increased latency). What you say is correct only if the work done on the composition side is trivial, otherwise we end up in the situation where the compositor is eating into the frame-time budget of the main thread (which is what this patch is trying to address). > > > Only benefit is that rAF callbacks are dispatched more often. But that's > > > already achievable by customizing the update cycle through the embedding > > > application (i.e. taking it out of the alignment with vsync events). > > > > We understood we are still honoring the vsync, because in the end this new > > frame is going to render but it is not to be composited until the next > > frameDone. Maybe we are missing some of the conditions in the architecture, > > any new information is welcome! > > > > Thanks again. > > So you described the problem here pretty well. I don't fully understand this - it seems the current scheme will always penalise the main thread by however long composition takes? (In reply to Chris Lord from comment #5) > (In reply to Zan Dobersek from comment #4) > > (In reply to Alejandro G. Castro from comment #3) > > > Thanks for the feedback! Some questions inlined. > > > > > > (In reply to Zan Dobersek from comment #2) > > > > This forces all content (flushed from the layer tree or produced through rAF > > > > callbacks) to be presented on the second future vsync event, regardless of > > > > complexity. > > > > > > > > > > Do you mean that current solution was checking complexity? Or that we could > > > improve it checking the complexity of the frame? How we could check the > > > complexity of the frame? > > > > > > > The current implementation dispatches any rAF callbacks only when the frame > > is completely finished. This means that content updates done this way will > > be incorporated into the following flush and composited in the next > > ThreadedCompositor run. > > > > When complexity of the content is simple enough, this is all managed inside > > the 16ms frame budget (at 60FPS). Which is great. When complexity grows > > beyond that budget, the update will be presented when it's done. > > > > This change introduces behavior where rAF dispatch is done in the scope of > > the next-immediate ThreadedCompositor run. This means that, upon frame > > completion, an immediate composition run will be done, and only at that > > point will the rAF dispatch. This means that any content update done under > > rAF will only be incorporated in the second composition run, and will be > > presented two whole frames after the corresponding frame-completion event. > > And that will be done regardless of whether the content is light enough for > > it to be handled in the time-space of a single frame. > > Is this not desirable though? We're increasing throughput by dispatching rAF > earlier here and essentially switching from a blocking double-buffer to a > non-blocking double-buffer (with possibly increased latency). > Dispatches of requestAnimationFrame callbacks are done with intention of producing the content that's to be presented in the immediately-following frame. The change in this patch breaks this as the changes done in those callbacks will only be presented two frames after. Also note that basic layer flushes are also scheduled through DisplayRefreshMonitor. So all the content at this point will be presented a frame late. > What you say is correct only if the work done on the composition side is > trivial, otherwise we end up in the situation where the compositor is eating > into the frame-time budget of the main thread (which is what this patch is > trying to address). > You've mainly evaluated this on painting benchmarks where the primary problem is exactly the amount of time taken under rAF callbacks. Once that time is prolonged, it will cause the frame update mechanism to miss the next vsync event, so subsequent updates (and rAF callbacks) are delayed, and the ultimate score is lower. I don't think composition plays a part here. If you want some benchmarking mode, it can be better instrumentalized in Cog. WPE exports can be disassociated from the Wayland-imposed vsync events, and the frame dispatches are done immediately after the newest export is received. This means that WebKit will internally unconditionally run at the maximum possible framerate at which the engine can process the content. In the meantime, the wl_surface in Cog will still have its contents updated according to the vsync signals, using whatever buffer was most recently exported. I can link you up with a branch for that. > > > > Only benefit is that rAF callbacks are dispatched more often. But that's > > > > already achievable by customizing the update cycle through the embedding > > > > application (i.e. taking it out of the alignment with vsync events). > > > > > > We understood we are still honoring the vsync, because in the end this new > > > frame is going to render but it is not to be composited until the next > > > frameDone. Maybe we are missing some of the conditions in the architecture, > > > any new information is welcome! > > > > > > Thanks again. > > > > So you described the problem here pretty well. > > I don't fully understand this - it seems the current scheme will always > penalise the main thread by however long composition takes? I think for the purposes of common understanding on this whole topic (from nomenclature to details of different procedures and operations), it would be great to construct some sort of visual graph of the update cycle and how it applies to the complete WK architecture. (In reply to Zan Dobersek from comment #7) > > [...] > > I think for the purposes of common understanding on this whole topic (from > nomenclature to details of different procedures and operations), it would be > great to construct some sort of visual graph of the update cycle and how it > applies to the complete WK architecture. I agree, we will try to create it to discuss the situation. I think I understand the problem you described if we generate a new rAF with the change we have to avoid it. But we have been checking the way the rAF is generated and when there is already a pending frame in the queue (not the one we are going to generate), that is the one we find in that condition it means we are already late and the frame should be rendered as soon as possible. Anyway, we will add the action point to generate the graph to discuss all the possibilities to generate rAF, rendering and compositing; it is a really complex graph and I just have perceptions but not clarity about it. Thanks again for the comments! I'm going to repurpose this bug slightly as I have a patch that addresses the issue. This does apply to both GTK and WPE, but only WPE actually implements the mechanisms correctly that make this an issue right now - I'll file separate bugs to fix GTK in this regard. This bug applies to WPE when using either the minibrowser or Cog with the Wayland backend (other backends don't implement the vsync mechanism correctly(/at all?), at the moment). Note that when using cog as the minibrowser, the gtk4 backend is the default, where this bug does not apply (specify "-P wl" at the end of the command line to use the Wayland backend). A problem we have currently is that if a frame goes over budget on the web process side (e.g. expensive canvas rendering or a very heavy page), we will miss a vsync and end up waiting for the next one. This is essentially synchronous vsynced double buffering - if we can't hit 60fps, we will hard drop to 30fps, even if we could render at an uneven cadence and get a significantly higher frame-rate. This is not the behaviour of Firefox or Chrome - Chrome appears to be triple-buffering (which provides a very smooth frame cadence in the presence of variable load, at the expense of increased latency and memory use) and Firefox seems to receive vsync asynchronously so that it doesn't miss them and can begin rendering at some point after the vsync is received, if the web process is blocked when it's received. Both browsers are capable of consistent frame-rates between 30fps and 60fps under a constant load that exceeds the 60fps frame budget. More info/patches to come... Created attachment 457051 [details]
Pre-patch rendering event timeline with no load
This diagram generated with custom tools/instrumentation shows the sequence of events for rendering when a frame is consistently within budget (which I'm going to call 'no load' as shorthand).
We receive a 'frameComplete' which triggers handleDisplayRefresh. renderNextFrame is called which synchronously flushes the layer tree, waiting for the web process to finish drawing. Before that's finished, we've requested another refresh callback. As renderNextFrame finishes before the next frameComplete, the sequence repeats at a regular 16.6ms cadence, as expected.
Created attachment 457052 [details]
Pre-patch rendering event timeline with 20ms load
This shows the sequence of events when rendering the frame takes >=20ms. Ideally, a 20ms frame time would mean a 50fps output. At 60Hz, this would result in the occasional skipped frame.
Here we can see that the waiting for the web process to finish rendering is delaying our request and we missed when we would have otherwise got a frameComplete call because we were blocking. This results in a consistent 30fps and under-utilisation of resources (shown by the blank space between renderNextFrame and frameComplete).
In this situation, it would be ideal if we could detect that we've missed a refresh and immediately start rendering the next frame. This is currently what happens on GTK and non-Wayland WPE because they rely on a timer to drive display refreshes and so events are postponed instead of skipped.
Created attachment 457054 [details]
Patch
Created attachment 457055 [details]
Post-patch rendering event timeline with 20ms load
This is the new sequence of events after this patch. When we detect that we've missed a display refresh and we have a refresh requested, we immediately schedule a rendering update.
You can see that renderNextFrame now does not wait for the next display refresh and we are receiving more frameComplete events, corresponding to more rendered frames. With this patch, we are able to achieve ~50fps with a 20ms load, instead of 30fps and better utilise resources.
This better matches the behaviour of other browsers and can result in significantly improved synthetic test scores on resource-constrained devices.
(In reply to Chris Lord from comment #13) > This better matches the behaviour of other browsers and can result in > significantly improved synthetic test scores on resource-constrained devices. Very superficially: Does it 'result in significantly improved synthetic test scores'? How do the memory and CPU consumption behave? And separately, why, on resource-constrained devices, would we want to put more constraint on resources? Comment on attachment 457054 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=457054&action=review > Source/WebKit/Shared/CoordinatedGraphics/threadedcompositor/ThreadedDisplayRefreshMonitor.cpp:42 > +constexpr Seconds FrameTime = 1_s / DefaultFramesPerSecond; What should happen when a 30FPS framerate is targeted by the embedder, either through how they manually manage the embedded view or how e.g. their display system is configured? (In reply to Zan Dobersek (Reviews) from comment #14) > (In reply to Chris Lord from comment #13) > > This better matches the behaviour of other browsers and can result in > > significantly improved synthetic test scores on resource-constrained devices. > > Very superficially: > > Does it 'result in significantly improved synthetic test scores'? How do the > memory and CPU consumption behave? I'll need to get specific numbers for memory/CPU - presumably CPU will go up in proportion to the decrease in idle rendering time. So if frames consistently took 17ms, we would have had about ~16ms of idle time per ~33ms and if CPU usage was somewhere around 40%, perhaps we could expect it to hit somewhere around 80%. With regards to numbers, without this and the patch in bug 238999 applied (which likely has very little, if any effect in this particular case), Alex tested MotionMark delivering 84.35 and with these patches applied, 289.54. > And separately, why, on resource-constrained devices, would we want to put > more constraint on resources? Assuming that a UI is targeting 60fps, if there's a particular rare case that causes the budget to be exceeded by an insignificant amount, getting the occasional frame drop is better than just hard dropping to half the refresh. There are better ways to throttle the frame-rate, this behaviour differs from other major browsers and I certainly find it unexpected. (In reply to Zan Dobersek (Reviews) from comment #15) > Comment on attachment 457054 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=457054&action=review > > > Source/WebKit/Shared/CoordinatedGraphics/threadedcompositor/ThreadedDisplayRefreshMonitor.cpp:42 > > +constexpr Seconds FrameTime = 1_s / DefaultFramesPerSecond; > > What should happen when a 30FPS framerate is targeted by the embedder, > either through how they manually manage the embedded view or how e.g. their > display system is configured? This would depend how they throttle their frame-rate. If they have a 30Hz display, there's already lots of code that assumes 60Hz in WebKit that needs fixing and I'd say this is a separate issue (existing comments in the tree back up this assertion). If they throttled to half refresh by ignoring every other rAF, this would still work as expected because it wouldn't count as a skipped frame as the web processes wouldn't be blocking during that time. (In reply to Zan Dobersek (Reviews) from comment #14) > (In reply to Chris Lord from comment #13) > > This better matches the behaviour of other browsers and can result in > > significantly improved synthetic test scores on resource-constrained devices. > > Very superficially: > > Does it 'result in significantly improved synthetic test scores'? How do the > memory and CPU consumption behave? > > And separately, why, on resource-constrained devices, would we want to put > more constraint on resources? Yes, we are going to use more memory and CPU for sure, we are generating more frames, but when an application requests more resources I don't think we should limit them to 30fps because we are not reaching 60fps. I think we have to give information to the application and developers need to decide about what to do in their case. But at least they will have to option to use all the resources at that point if that is what they want. Just some follow-up; running MotionMark 1.2 on my machine, my results mirror Alex's. Unpatched: 35.41 +/- 14.25% Patched: 233.68 +/- 2.4% That's a release build of master (f68175d94ef7ae6b2a5fe117fecb1e3253858f12) with only this patch and nothing else applied and running the default set of tests on https://browserbench.org/MotionMark1.2/ Your setups aren't resource-constrained devices. Comment on attachment 457054 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=457054&action=review >>> Source/WebKit/Shared/CoordinatedGraphics/threadedcompositor/ThreadedDisplayRefreshMonitor.cpp:42 >>> +constexpr Seconds FrameTime = 1_s / DefaultFramesPerSecond; >> >> What should happen when a 30FPS framerate is targeted by the embedder, either through how they manually manage the embedded view or how e.g. their display system is configured? > > This would depend how they throttle their frame-rate. If they have a 30Hz display, there's already lots of code that assumes 60Hz in WebKit that needs fixing and I'd say this is a separate issue (existing comments in the tree back up this assertion). > > If they throttled to half refresh by ignoring every other rAF, this would still work as expected because it wouldn't count as a skipped frame as the web processes wouldn't be blocking during that time. 30FPS framerate right now works just fine, if that's what the display system or the embedding framework is configured for and the content is capable of being processed at that rate. (In reply to Alejandro G. Castro from comment #18) > (In reply to Zan Dobersek (Reviews) from comment #14) > > (In reply to Chris Lord from comment #13) > > > This better matches the behaviour of other browsers and can result in > > > significantly improved synthetic test scores on resource-constrained devices. > > > > Very superficially: > > > > Does it 'result in significantly improved synthetic test scores'? How do the > > memory and CPU consumption behave? > > > > And separately, why, on resource-constrained devices, would we want to put > > more constraint on resources? > > Yes, we are going to use more memory and CPU for sure, we are generating > more frames, but when an application requests more resources I don't think > we should limit them to 30fps because we are not reaching 60fps. I think we > have to give information to the application and developers need to decide > about what to do in their case. But at least they will have to option to use > all the resources at that point if that is what they want. It's not necessarily the application requesting more resources, it's the underlying system not being able to provide them, for whatever reason -- either by being too weak to handle it, or intentionally limited, or there's additional set of work completely separate from the work of the Web engine. Yet what this approach does is put more strain on such a system, and nothing towards making the work more efficient. Primary purpose of this patch is to squeeze out better numbers on some benchmarks. The numbers are poor there right now because the work employed by these benchmarks is super expensive and wasteful. The proposed change achieves better numbers because it just enables more such work to be done, at the expense more CPU and memory consumption, and subsequently more drawn power and heating. This isn't a win on resource-constrained devices. The graphs coming from reduced test cases (not done on such devices) imply that if assuming a 20ms workload under the requestAnimationFrame handler, it should be possible to ideally target a 50FPS framerate. This of course isn't realistic. The rAF handler just produces the desired content, and right now it does that on CPU, with great waste. That's then taken to the composition stage inside the Web engine, which packs it all up into a GPU task, the results of which, when completed, are handled by the embedder, in whatever fashion. If just simple painting is done under rAF, you are increasing the CPU consumption. If there's WebGL content or GL-backed painting, you are constructing a GPU task right there under the rAF, another such task under the composition step, and then possibly additional tasks inside the embedder and the compositor -- all competing for GPU resources and now having to battle with yet another GPU task coming from the early rAF dispatch. I don't think the change in itself is necessarily wrong, but it's not the main painpoint and it erases the current natural throttle, and instead blows up any pretension of efficiency. (In reply to Zan Dobersek from comment #23) > Primary purpose of this patch is to squeeze out better numbers on some > benchmarks. The numbers are poor there right now because the work employed > by these benchmarks is super expensive and wasteful. The proposed change > achieves better numbers because it just enables more such work to be done, > at the expense more CPU and memory consumption, and subsequently more drawn > power and heating. > > This isn't a win on resource-constrained devices. The graphs coming from > reduced test cases (not done on such devices) imply that if assuming a 20ms > workload under the requestAnimationFrame handler, it should be possible to > ideally target a 50FPS framerate. This of course isn't realistic. The rAF > handler just produces the desired content, and right now it does that on > CPU, with great waste. That's then taken to the composition stage inside the > Web engine, which packs it all up into a GPU task, the results of which, > when completed, are handled by the embedder, in whatever fashion. > > If just simple painting is done under rAF, you are increasing the CPU > consumption. If there's WebGL content or GL-backed painting, you are > constructing a GPU task right there under the rAF, another such task under > the composition step, and then possibly additional tasks inside the embedder > and the compositor -- all competing for GPU resources and now having to > battle with yet another GPU task coming from the early rAF dispatch. > > I don't think the change in itself is necessarily wrong, but it's not the > main painpoint and it erases the current natural throttle, and instead blows > up any pretension of efficiency. I don't see how you can possibly argue that waiting idle for a vsync is an efficient use of resources. If an application is targeting 60fps and has a short period, let's say 2 seconds, where it goes over budget by a small amount (this is a very realistic situation), dropping to 30fps is not a desirable outcome. More importantly, this is also not the behaviour of any of the major browsers. You're right that this disproportionately affects benchmarks, but that doesn't make it the wrong thing to do. I would explicitly suggest that our current behaviour is actually bad behaviour and doesn't put us at any real advantage on any platform. That I'm presenting results recorded on my machine is purely out of convenience, I'm confident that these assertions hold under more constrained resources because they are very basic assertions. If it takes me running a battery of tests on a Pi to convince you, I can go about doing that, but it doesn't seem like a great use of time when the patch and the concept are so simple. Also, that 60fps static value is not added by this patch. There's even an accompanying FIXME. There is definitely code in this path that does not do the correct thing when the refresh rate isn't 60Hz. I suppose if you feel very strongly about that existing bug, I could take a look at fixing that also, but that patch is non-trivial and will likely involve external libraries, I would rather fix this very small thing that gives us an easy win in our most common situation first. Created attachment 459889 [details]
Patch
So this is reworked so that the current behaviour is maintained and new behaviour is controlled by the WEBKIT_MAX_FRAME_DELAY_MS environment variable. If a frame goes over budget by under this amount of milliseconds, a new render frame will be kicked off immediately. Setting it to 5, for example, will allow the browser to vary between ~50-60fps, but if frame goes over budget by more than amount, it will drop to 30fps like it does currently.
It took a while to get it to do something resembling what I wanted and I don't think the logic is quite right, but it at least works a bit closer to what I expect than the previous version of the patch, which seems to have a re-entrancy issue where as soon as a frame is pre-empted, all subsequent frames will be pre-empted until continuous rendering stops.
In theory, setting WEBKIT_MAX_FRAME_DELAY_MS to a very big number ought to achieve similar behaviour to the previous patch, but in practice that isn't the case - I expect this is because we're not accounting for the full frame time, so a frame going only slightly over budget isn't detected. This ought to be fairly easily fixed or worked around.
Created attachment 465771 [details]
Patch v2
This is Chris' patch updated to build on “main”; I am trying to look
now into this issue and one of the first things I wanted to do was
testing the suggested approach.
|