219586 – [Concurrent Display Lists] GPU process should not immediately sleep after reading all available display list items

RESOLVED FIXED 219586

[Concurrent Display Lists] GPU process should not immediately sleep after reading all available display list items

https://bugs.webkit.org/show_bug.cgi?id=219586

Summary [Concurrent Display Lists] GPU process should not immediately sleep after rea...

Wenson Hsieh

Reported 2020-12-06 15:41:40 PST

Currently, the GPU process goes to sleep immediately after reading all available display list items. The 512-item wakeup hysteresis partially mitigates this by giving the web process a head start, but on Lines, Arcs, and Paths, I'm still seeing more GPUP sleeps/wakeups than necessary. This bug tracks experimenting with a small (~40-80 µs) sleep interval after hitting 0 unread bytes in the GPU process, such that we can start processing display list items again if the web process writes more item data.

Attachments
Patch (10.69 KB, patch) 2020-12-06 20:29 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
WIP v2 (25.62 KB, patch) 2020-12-07 14:02 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
Patch (32.04 KB, patch) 2020-12-07 16:10 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
Work in progress (33.57 KB, patch) 2020-12-28 16:36 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
Work in progress (40.61 KB, patch) 2020-12-29 15:17 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
Patch (40.56 KB, patch) 2021-01-05 08:33 PST, Wenson Hsieh	cdumez: review+ cdumez: commit-queue-	Details Formatted Diff Diff
Address comments + rebase (40.71 KB, patch) 2021-01-06 11:06 PST, Wenson Hsieh	no flags	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Wenson Hsieh

Comment 1 2020-12-06 20:29:59 PST Comment hidden (obsolete)

Created attachment 415535 [details] Patch

Wenson Hsieh

Comment 2 2020-12-07 14:02:11 PST Comment hidden (obsolete)

Created attachment 415580 [details] WIP v2

Wenson Hsieh

Comment 3 2020-12-07 16:10:50 PST

Created attachment 415599 [details] Patch

Geoffrey Garen

Comment 4 2020-12-09 11:52:40 PST

Do we know how this sleep behavior compares in performance to a traditional semaphore?

Wenson Hsieh

Comment 5 2020-12-09 12:53:25 PST

(In reply to Geoffrey Garen from comment #4) > Do we know how this sleep behavior compares in performance to a traditional > semaphore? I don't have any data on this, but it would be interesting to try! I'll experiment with this.

Geoffrey Garen

Comment 6 2020-12-09 13:11:05 PST

It looks like pthread mutexes and condition variables can be sent over shared memory directly, if you initialize them using the "*_setpshared" APIs. The reason I'm curious about a semaphore is that, if the system call to invoke a semaphore is <= 8us, then it is strictly better -- because it is not slower, and it is a more robust way to wait.

Geoffrey Garen

Comment 7 2020-12-13 12:05:58 PST

I did a little micro-benchmarking this weekend. Here's an example result: ~/OpenSource/WebKitBuild/Release> ./Scratch CPU count: 16 iteration count: 10000 ----- no synchronization: 46us spin: 77us yield: 75us semaphore_t: 74us sleep 2us: 83us sleep 4us: 87us sleep 8us: 94us It looks like a mach semaphore is competitive with the fastest approaches, and slightly faster than just sleeping. But I'm much more interested in another benefit: With a semaphore, we're not waisting any CPU time re-checking for more work, so we never have to fall back to the "go slow and wait for IPC" path. I think that might be an immediate speedup compared to sleeping; and also a more robust and predictable way to wait across a variety of devices and workloads. (A semaphore even knows how to recover from a crash, so we don't even need a timeout on the wait; but we can add one if we really want to.)

Geoffrey Garen

Comment 8 2020-12-13 12:58:41 PST

For sending a semaphore over IPC, it looks like Darwin supports either sending a semaphore_t as a mach port / mach send right, or sending a temporary file name as a string and then using ftok + semget to turn it into a semaphore. Seems like semaphore_t would be best, if the mach documentation doesn't make your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this reference to mach_port_extract_right: https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between-processes-in-mac-os/27083937#27083937.

Radar WebKit Bug Importer

Comment 9 2020-12-13 15:42:16 PST

<rdar://problem/72275412>

Wenson Hsieh

Comment 10 2020-12-28 16:36:28 PST

Created attachment 416833 [details] Work in progress

Wenson Hsieh

Comment 11 2020-12-28 16:53:57 PST

(In reply to Geoffrey Garen from comment #8) > For sending a semaphore over IPC, it looks like Darwin supports either > sending a semaphore_t as a mach port / mach send right, or sending a > temporary file name as a string and then using ftok + semget to turn it into > a semaphore. > > Seems like semaphore_t would be best, if the mach documentation doesn't make > your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this > reference to mach_port_extract_right: > https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between- > processes-in-mac-os/27083937#27083937. After trying out a couple versions of a patch to use mach semaphores for fast wakeups, the strategy that seems to yield the best performance gains seems to be one that only signals the semaphore in the web process if we know that the GPU process is definitely waiting for additional items (indicated by an enum flag). This is because signaling the semaphore in the web process seems to be a more costly operation than I'd anticipated — earlier, I tried having the web process unconditionally `semaphore_signal_thread` when writing new items and then having the GPU process only resume if the timed wait succeeded (i.e. returned `KERN_SUCCESS`), but this led to significantly worse performance (system tracing revealed that the heaviest stack in the web process by far was the kernel trap underneath `semaphore_signal_thread`). I briefly explored adding a second (smaller) item count hysteresis for semaphore wakeups, such that we'd wait for 64 items before attempting to signal the semaphore, but this either didn't seem to have any significant impact, or resulted in very slightly worse performance. It's possible that a similar policy could yield better results though, but I'll need to carry out some more experiments and spend some more time tuning the numbers. In the meantime, the WIP patch I uploaded is the current approach that seems to yield performance numbers on Canvas Lines that are potentially better than trunk (if only slightly).

Wenson Hsieh

Comment 12 2020-12-29 15:17:58 PST

Created attachment 416843 [details] Work in progress

Wenson Hsieh

Comment 13 2021-01-05 08:33:53 PST

Created attachment 417008 [details] Patch

Chris Dumez

Comment 14 2021-01-05 09:04:42 PST

Comment on attachment 417008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review r=me with a few changes. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234 > + RELEASE_ASSERT_NOT_REACHED(); I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes). > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241 > + RELEASE_ASSERT_NOT_REACHED(); ditto. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249 > + RELEASE_ASSERT_NOT_REACHED(); ditto. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45 > +#include <wtf/cocoa/MachSemaphore.h> Seems like this could be forward-declared. > Source/WebKit/Shared/SharedDisplayListHandle.h:59 > + NotWaiting = 0, = 0 seems redundant

Wenson Hsieh

Comment 15 2021-01-06 10:00:20 PST

Comment on attachment 417008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review Thanks for the review! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234 >> + RELEASE_ASSERT_NOT_REACHED(); > > I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes). Good point. I originally did this to make debugging perf tests a bit easier (by making the crashes more noticeable), with the intention of turning these into debug assertions once I added all the MESSAGE_CHECKs. I'll just start these off as debug assertions instead. >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241 >> + RELEASE_ASSERT_NOT_REACHED(); > > ditto. Fixed! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249 >> + RELEASE_ASSERT_NOT_REACHED(); > > ditto. Fixed! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45 >> +#include <wtf/cocoa/MachSemaphore.h> > > Seems like this could be forward-declared. Good catch — fixed. >> Source/WebKit/Shared/SharedDisplayListHandle.h:59 >> + NotWaiting = 0, > > = 0 seems redundant Removed the = 0.

Wenson Hsieh

Comment 16 2021-01-06 11:06:09 PST

Created attachment 417103 [details] Address comments + rebase

EWS

Comment 17 2021-01-06 12:35:52 PST

Committed r271212: <https://trac.webkit.org/changeset/271212> All reviewed patches have been landed. Closing bug and clearing flags on attachment 417103 [details].

Chris Dumez

Comment 18 2021-01-07 11:34:32 PST

Comment on attachment 417103 [details] Address comments + rebase View in context: https://bugs.webkit.org/attachment.cgi?id=417103&action=review > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > + m_resumeDisplayListSemaphore->waitFor(30_us); Wait a second.. Isn't this running on the main thread of the GPUProcess? How is it OK to hang the main thread of the GPUProcess for this? The GPUProcess does many other things besides DOM rendering and hanging the main thread for DOM rendering does not seem acceptable.

Geoffrey Garen

Comment 19 2021-01-07 11:41:12 PST

> > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > > + m_resumeDisplayListSemaphore->waitFor(30_us); > > Wait a second.. Isn't this running on the main thread of the GPUProcess? Yes. > How > is it OK to hang the main thread of the GPUProcess for this? The GPUProcess > does many other things besides DOM rendering and hanging the main thread for > DOM rendering does not seem acceptable. I think the theory is that 30us is not a long delay. Just returning to the RunLoop takes a similar amount of time. That said, what are some specific operations that we should consider, that might be harmed by this delay?

Wenson Hsieh

Comment 20 2021-01-07 11:42:09 PST

(In reply to Chris Dumez from comment #18) > Comment on attachment 417103 [details] > Address comments + rebase > > View in context: > https://bugs.webkit.org/attachment.cgi?id=417103&action=review > > > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > > + m_resumeDisplayListSemaphore->waitFor(30_us); > > Wait a second.. Isn't this running on the main thread of the GPUProcess? How > is it OK to hang the main thread of the GPUProcess for this? The GPUProcess > does many other things besides DOM rendering and hanging the main thread for > DOM rendering does not seem acceptable. The idea is that we'll only perform the wait here if we're receiving such a high rate of display list items, that the next IPC message would probably otherwise be a wakeup message for more DL item processing anyways. I chose a tiny (30 microsecond) wait duration here, since that is going to be much lower (in practice) than the time needed to propagate a message via IPC. Essentially, this wait should only kick in when the GPUP would have otherwise just gone to sleep anyways. I think in the future, if we could move DL processing off the main thread, then we'd be able to make this waiting policy much more generous.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebKit2

Assignee

Wenson Hsieh

Reported

2020-12-06 15:41 PST

Modified

2021-01-07 11:42 PST History

CC List

8 users Show

URL

Keywords InRadar

Depends on

219091 219657 220110

Blocks

Dependencies

tree graph