RESOLVED FIXED Bug 219586
[Concurrent Display Lists] GPU process should not immediately sleep after reading all available display list items
https://bugs.webkit.org/show_bug.cgi?id=219586
Summary [Concurrent Display Lists] GPU process should not immediately sleep after rea...
Wenson Hsieh
Reported 2020-12-06 15:41:40 PST
Currently, the GPU process goes to sleep immediately after reading all available display list items. The 512-item wakeup hysteresis partially mitigates this by giving the web process a head start, but on Lines, Arcs, and Paths, I'm still seeing more GPUP sleeps/wakeups than necessary. This bug tracks experimenting with a small (~40-80 µs) sleep interval after hitting 0 unread bytes in the GPU process, such that we can start processing display list items again if the web process writes more item data.
Attachments
Patch (10.69 KB, patch)
2020-12-06 20:29 PST, Wenson Hsieh
no flags
WIP v2 (25.62 KB, patch)
2020-12-07 14:02 PST, Wenson Hsieh
no flags
Patch (32.04 KB, patch)
2020-12-07 16:10 PST, Wenson Hsieh
no flags
Work in progress (33.57 KB, patch)
2020-12-28 16:36 PST, Wenson Hsieh
no flags
Work in progress (40.61 KB, patch)
2020-12-29 15:17 PST, Wenson Hsieh
no flags
Patch (40.56 KB, patch)
2021-01-05 08:33 PST, Wenson Hsieh
cdumez: review+
cdumez: commit-queue-
Address comments + rebase (40.71 KB, patch)
2021-01-06 11:06 PST, Wenson Hsieh
no flags
Wenson Hsieh
Comment 1 2020-12-06 20:29:59 PST
Comment hidden (obsolete)
Wenson Hsieh
Comment 2 2020-12-07 14:02:11 PST
Comment hidden (obsolete)
Wenson Hsieh
Comment 3 2020-12-07 16:10:50 PST
Geoffrey Garen
Comment 4 2020-12-09 11:52:40 PST
Do we know how this sleep behavior compares in performance to a traditional semaphore?
Wenson Hsieh
Comment 5 2020-12-09 12:53:25 PST
(In reply to Geoffrey Garen from comment #4) > Do we know how this sleep behavior compares in performance to a traditional > semaphore? I don't have any data on this, but it would be interesting to try! I'll experiment with this.
Geoffrey Garen
Comment 6 2020-12-09 13:11:05 PST
It looks like pthread mutexes and condition variables can be sent over shared memory directly, if you initialize them using the "*_setpshared" APIs. The reason I'm curious about a semaphore is that, if the system call to invoke a semaphore is <= 8us, then it is strictly better -- because it is not slower, and it is a more robust way to wait.
Geoffrey Garen
Comment 7 2020-12-13 12:05:58 PST
I did a little micro-benchmarking this weekend. Here's an example result: ~/OpenSource/WebKitBuild/Release> ./Scratch CPU count: 16 iteration count: 10000 ----- no synchronization: 46us spin: 77us yield: 75us semaphore_t: 74us sleep 2us: 83us sleep 4us: 87us sleep 8us: 94us It looks like a mach semaphore is competitive with the fastest approaches, and slightly faster than just sleeping. But I'm much more interested in another benefit: With a semaphore, we're not waisting any CPU time re-checking for more work, so we never have to fall back to the "go slow and wait for IPC" path. I think that might be an immediate speedup compared to sleeping; and also a more robust and predictable way to wait across a variety of devices and workloads. (A semaphore even knows how to recover from a crash, so we don't even need a timeout on the wait; but we can add one if we really want to.)
Geoffrey Garen
Comment 8 2020-12-13 12:58:41 PST
For sending a semaphore over IPC, it looks like Darwin supports either sending a semaphore_t as a mach port / mach send right, or sending a temporary file name as a string and then using ftok + semget to turn it into a semaphore. Seems like semaphore_t would be best, if the mach documentation doesn't make your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this reference to mach_port_extract_right: https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between-processes-in-mac-os/27083937#27083937.
Radar WebKit Bug Importer
Comment 9 2020-12-13 15:42:16 PST
Wenson Hsieh
Comment 10 2020-12-28 16:36:28 PST
Created attachment 416833 [details] Work in progress
Wenson Hsieh
Comment 11 2020-12-28 16:53:57 PST
(In reply to Geoffrey Garen from comment #8) > For sending a semaphore over IPC, it looks like Darwin supports either > sending a semaphore_t as a mach port / mach send right, or sending a > temporary file name as a string and then using ftok + semget to turn it into > a semaphore. > > Seems like semaphore_t would be best, if the mach documentation doesn't make > your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this > reference to mach_port_extract_right: > https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between- > processes-in-mac-os/27083937#27083937. After trying out a couple versions of a patch to use mach semaphores for fast wakeups, the strategy that seems to yield the best performance gains seems to be one that only signals the semaphore in the web process if we know that the GPU process is definitely waiting for additional items (indicated by an enum flag). This is because signaling the semaphore in the web process seems to be a more costly operation than I'd anticipated — earlier, I tried having the web process unconditionally `semaphore_signal_thread` when writing new items and then having the GPU process only resume if the timed wait succeeded (i.e. returned `KERN_SUCCESS`), but this led to significantly worse performance (system tracing revealed that the heaviest stack in the web process by far was the kernel trap underneath `semaphore_signal_thread`). I briefly explored adding a second (smaller) item count hysteresis for semaphore wakeups, such that we'd wait for 64 items before attempting to signal the semaphore, but this either didn't seem to have any significant impact, or resulted in very slightly worse performance. It's possible that a similar policy could yield better results though, but I'll need to carry out some more experiments and spend some more time tuning the numbers. In the meantime, the WIP patch I uploaded is the current approach that seems to yield performance numbers on Canvas Lines that are potentially better than trunk (if only slightly).
Wenson Hsieh
Comment 12 2020-12-29 15:17:58 PST
Created attachment 416843 [details] Work in progress
Wenson Hsieh
Comment 13 2021-01-05 08:33:53 PST
Chris Dumez
Comment 14 2021-01-05 09:04:42 PST
Comment on attachment 417008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review r=me with a few changes. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234 > + RELEASE_ASSERT_NOT_REACHED(); I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes). > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241 > + RELEASE_ASSERT_NOT_REACHED(); ditto. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249 > + RELEASE_ASSERT_NOT_REACHED(); ditto. > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45 > +#include <wtf/cocoa/MachSemaphore.h> Seems like this could be forward-declared. > Source/WebKit/Shared/SharedDisplayListHandle.h:59 > + NotWaiting = 0, = 0 seems redundant
Wenson Hsieh
Comment 15 2021-01-06 10:00:20 PST
Comment on attachment 417008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review Thanks for the review! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234 >> + RELEASE_ASSERT_NOT_REACHED(); > > I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes). Good point. I originally did this to make debugging perf tests a bit easier (by making the crashes more noticeable), with the intention of turning these into debug assertions once I added all the MESSAGE_CHECKs. I'll just start these off as debug assertions instead. >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241 >> + RELEASE_ASSERT_NOT_REACHED(); > > ditto. Fixed! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249 >> + RELEASE_ASSERT_NOT_REACHED(); > > ditto. Fixed! >> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45 >> +#include <wtf/cocoa/MachSemaphore.h> > > Seems like this could be forward-declared. Good catch — fixed. >> Source/WebKit/Shared/SharedDisplayListHandle.h:59 >> + NotWaiting = 0, > > = 0 seems redundant Removed the = 0.
Wenson Hsieh
Comment 16 2021-01-06 11:06:09 PST
Created attachment 417103 [details] Address comments + rebase
EWS
Comment 17 2021-01-06 12:35:52 PST
Committed r271212: <https://trac.webkit.org/changeset/271212> All reviewed patches have been landed. Closing bug and clearing flags on attachment 417103 [details].
Chris Dumez
Comment 18 2021-01-07 11:34:32 PST
Comment on attachment 417103 [details] Address comments + rebase View in context: https://bugs.webkit.org/attachment.cgi?id=417103&action=review > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > + m_resumeDisplayListSemaphore->waitFor(30_us); Wait a second.. Isn't this running on the main thread of the GPUProcess? How is it OK to hang the main thread of the GPUProcess for this? The GPUProcess does many other things besides DOM rendering and hanging the main thread for DOM rendering does not seem acceptable.
Geoffrey Garen
Comment 19 2021-01-07 11:41:12 PST
> > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > > + m_resumeDisplayListSemaphore->waitFor(30_us); > > Wait a second.. Isn't this running on the main thread of the GPUProcess? Yes. > How > is it OK to hang the main thread of the GPUProcess for this? The GPUProcess > does many other things besides DOM rendering and hanging the main thread for > DOM rendering does not seem acceptable. I think the theory is that 30us is not a long delay. Just returning to the RunLoop takes a similar amount of time. That said, what are some specific operations that we should consider, that might be harmed by this delay?
Wenson Hsieh
Comment 20 2021-01-07 11:42:09 PST
(In reply to Chris Dumez from comment #18) > Comment on attachment 417103 [details] > Address comments + rebase > > View in context: > https://bugs.webkit.org/attachment.cgi?id=417103&action=review > > > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226 > > + m_resumeDisplayListSemaphore->waitFor(30_us); > > Wait a second.. Isn't this running on the main thread of the GPUProcess? How > is it OK to hang the main thread of the GPUProcess for this? The GPUProcess > does many other things besides DOM rendering and hanging the main thread for > DOM rendering does not seem acceptable. The idea is that we'll only perform the wait here if we're receiving such a high rate of display list items, that the next IPC message would probably otherwise be a wakeup message for more DL item processing anyways. I chose a tiny (30 microsecond) wait duration here, since that is going to be much lower (in practice) than the time needed to propagate a message via IPC. Essentially, this wait should only kick in when the GPUP would have otherwise just gone to sleep anyways. I think in the future, if we could move DL processing off the main thread, then we'd be able to make this waiting policy much more generous.
Note You need to log in before you can comment on or make changes to this bug.