Bug 219586 - [Concurrent Display Lists] GPU process should not immediately sleep after reading all available display list items
Summary: [Concurrent Display Lists] GPU process should not immediately sleep after rea...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit2 (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Wenson Hsieh
URL:
Keywords: InRadar
Depends on: 219091 219657 220110
Blocks:
  Show dependency treegraph
 
Reported: 2020-12-06 15:41 PST by Wenson Hsieh
Modified: 2021-01-07 11:42 PST (History)
8 users (show)

See Also:


Attachments
Patch (10.69 KB, patch)
2020-12-06 20:29 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
WIP v2 (25.62 KB, patch)
2020-12-07 14:02 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Patch (32.04 KB, patch)
2020-12-07 16:10 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Work in progress (33.57 KB, patch)
2020-12-28 16:36 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Work in progress (40.61 KB, patch)
2020-12-29 15:17 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Patch (40.56 KB, patch)
2021-01-05 08:33 PST, Wenson Hsieh
cdumez: review+
cdumez: commit-queue-
Details | Formatted Diff | Diff
Address comments + rebase (40.71 KB, patch)
2021-01-06 11:06 PST, Wenson Hsieh
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wenson Hsieh 2020-12-06 15:41:40 PST
Currently, the GPU process goes to sleep immediately after reading all available display list items. The 512-item wakeup hysteresis partially mitigates this by giving the web process a head start, but on Lines, Arcs, and Paths, I'm still seeing more GPUP sleeps/wakeups than necessary. This bug tracks experimenting with a small (~40-80 µs) sleep interval after hitting 0 unread bytes in the GPU process, such that we can start processing display list items again if the web process writes more item data.
Comment 1 Wenson Hsieh 2020-12-06 20:29:59 PST Comment hidden (obsolete)
Comment 2 Wenson Hsieh 2020-12-07 14:02:11 PST Comment hidden (obsolete)
Comment 3 Wenson Hsieh 2020-12-07 16:10:50 PST
Created attachment 415599 [details]
Patch
Comment 4 Geoffrey Garen 2020-12-09 11:52:40 PST
Do we know how this sleep behavior compares in performance to a traditional semaphore?
Comment 5 Wenson Hsieh 2020-12-09 12:53:25 PST
(In reply to Geoffrey Garen from comment #4)
> Do we know how this sleep behavior compares in performance to a traditional
> semaphore?

I don't have any data on this, but it would be interesting to try! I'll experiment with this.
Comment 6 Geoffrey Garen 2020-12-09 13:11:05 PST
It looks like pthread mutexes and condition variables can be sent over shared memory directly, if you initialize them using the "*_setpshared" APIs.

The reason I'm curious about a semaphore is that, if the system call to invoke a semaphore is <= 8us, then it is strictly better -- because it is not slower, and it is a more robust way to wait.
Comment 7 Geoffrey Garen 2020-12-13 12:05:58 PST
I did a little micro-benchmarking this weekend. Here's an example result:

~/OpenSource/WebKitBuild/Release> ./Scratch
CPU count: 16
iteration count: 10000
-----
no synchronization: 46us
spin: 77us
yield: 75us
semaphore_t: 74us
sleep 2us: 83us
sleep 4us: 87us
sleep 8us: 94us

It looks like a mach semaphore is competitive with the fastest approaches, and slightly faster than just sleeping.

But I'm much more interested in another benefit: With a semaphore, we're not waisting any CPU time re-checking for more work, so we never have to fall back to the "go slow and wait for IPC" path. I think that might be an immediate speedup compared to sleeping; and also a more robust and predictable way to wait across a variety of devices and workloads. (A semaphore even knows how to recover from a crash, so we don't even need a timeout on the wait; but we can add one if we really want to.)
Comment 8 Geoffrey Garen 2020-12-13 12:58:41 PST
For sending a semaphore over IPC, it looks like Darwin supports either sending a semaphore_t as a mach port / mach send right, or sending a temporary file name as a string and then using ftok + semget to turn it into a semaphore.

Seems like semaphore_t would be best, if the mach documentation doesn't make your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this reference to mach_port_extract_right: https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between-processes-in-mac-os/27083937#27083937.
Comment 9 Radar WebKit Bug Importer 2020-12-13 15:42:16 PST
<rdar://problem/72275412>
Comment 10 Wenson Hsieh 2020-12-28 16:36:28 PST
Created attachment 416833 [details]
Work in progress
Comment 11 Wenson Hsieh 2020-12-28 16:53:57 PST
(In reply to Geoffrey Garen from comment #8)
> For sending a semaphore over IPC, it looks like Darwin supports either
> sending a semaphore_t as a mach port / mach send right, or sending a
> temporary file name as a string and then using ftok + semget to turn it into
> a semaphore.
> 
> Seems like semaphore_t would be best, if the mach documentation doesn't make
> your eyes bleed. WTF::MachSendRight and IPC::MachPort might help. And this
> reference to mach_port_extract_right:
> https://stackoverflow.com/questions/27078699/share-unnamed-semaphore-between-
> processes-in-mac-os/27083937#27083937.

After trying out a couple versions of a patch to use mach semaphores for fast wakeups, the strategy that seems to yield the best performance gains seems to be one that only signals the semaphore in the web process if we know that the GPU process is definitely waiting for additional items (indicated by an enum flag).

This is because signaling the semaphore in the web process seems to be a more costly operation than I'd anticipated — earlier, I tried having the web process unconditionally `semaphore_signal_thread` when writing new items and then having the GPU process only resume if the timed wait succeeded (i.e. returned `KERN_SUCCESS`), but this led to significantly worse performance (system tracing revealed that the heaviest stack in the web process by far was the kernel trap underneath `semaphore_signal_thread`).

I briefly explored adding a second (smaller) item count hysteresis for semaphore wakeups, such that we'd wait for 64 items before attempting to signal the semaphore, but this either didn't seem to have any significant impact, or resulted in very slightly worse performance. It's possible that a similar policy could yield better results though, but I'll need to carry out some more experiments and spend some more time tuning the numbers.

In the meantime, the WIP patch I uploaded is the current approach that seems to yield performance numbers on Canvas Lines that are potentially better than trunk (if only slightly).
Comment 12 Wenson Hsieh 2020-12-29 15:17:58 PST
Created attachment 416843 [details]
Work in progress
Comment 13 Wenson Hsieh 2021-01-05 08:33:53 PST
Created attachment 417008 [details]
Patch
Comment 14 Chris Dumez 2021-01-05 09:04:42 PST
Comment on attachment 417008 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review

r=me with a few changes.

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234
> +                RELEASE_ASSERT_NOT_REACHED();

I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes).

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241
> +                RELEASE_ASSERT_NOT_REACHED();

ditto.

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249
> +                RELEASE_ASSERT_NOT_REACHED();

ditto.

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45
> +#include <wtf/cocoa/MachSemaphore.h>

Seems like this could be forward-declared.

> Source/WebKit/Shared/SharedDisplayListHandle.h:59
> +        NotWaiting = 0,

= 0 seems redundant
Comment 15 Wenson Hsieh 2021-01-06 10:00:20 PST
Comment on attachment 417008 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=417008&action=review

Thanks for the review!

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:234
>> +                RELEASE_ASSERT_NOT_REACHED();
> 
> I don't think we should crash the GPU Process here, at least not in release. It is never OK for a WebProcess to send bad IPC to the GPU Process and cause it to crash (since the GPUProcess is shared with other processes).

Good point. I originally did this to make debugging perf tests a bit easier (by making the crashes more noticeable), with the intention of turning these into debug assertions once I added all the MESSAGE_CHECKs.

I'll just start these off as debug assertions instead.

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:241
>> +                RELEASE_ASSERT_NOT_REACHED();
> 
> ditto.

Fixed!

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:249
>> +                RELEASE_ASSERT_NOT_REACHED();
> 
> ditto.

Fixed!

>> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.h:45
>> +#include <wtf/cocoa/MachSemaphore.h>
> 
> Seems like this could be forward-declared.

Good catch — fixed.

>> Source/WebKit/Shared/SharedDisplayListHandle.h:59
>> +        NotWaiting = 0,
> 
> = 0 seems redundant

Removed the = 0.
Comment 16 Wenson Hsieh 2021-01-06 11:06:09 PST
Created attachment 417103 [details]
Address comments + rebase
Comment 17 EWS 2021-01-06 12:35:52 PST
Committed r271212: <https://trac.webkit.org/changeset/271212>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 417103 [details].
Comment 18 Chris Dumez 2021-01-07 11:34:32 PST
Comment on attachment 417103 [details]
Address comments + rebase

View in context: https://bugs.webkit.org/attachment.cgi?id=417103&action=review

> Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226
> +            m_resumeDisplayListSemaphore->waitFor(30_us);

Wait a second.. Isn't this running on the main thread of the GPUProcess? How is it OK to hang the main thread of the GPUProcess for this? The GPUProcess does many other things besides DOM rendering and hanging the main thread for DOM rendering does not seem acceptable.
Comment 19 Geoffrey Garen 2021-01-07 11:41:12 PST
> > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226
> > +            m_resumeDisplayListSemaphore->waitFor(30_us);
> 
> Wait a second.. Isn't this running on the main thread of the GPUProcess?

Yes.

> How
> is it OK to hang the main thread of the GPUProcess for this? The GPUProcess
> does many other things besides DOM rendering and hanging the main thread for
> DOM rendering does not seem acceptable.

I think the theory is that 30us is not a long delay. Just returning to the RunLoop takes a similar amount of time.

That said, what are some specific operations that we should consider, that might be harmed by this delay?
Comment 20 Wenson Hsieh 2021-01-07 11:42:09 PST
(In reply to Chris Dumez from comment #18)
> Comment on attachment 417103 [details]
> Address comments + rebase
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=417103&action=review
> 
> > Source/WebKit/GPUProcess/graphics/RemoteRenderingBackend.cpp:226
> > +            m_resumeDisplayListSemaphore->waitFor(30_us);
> 
> Wait a second.. Isn't this running on the main thread of the GPUProcess? How
> is it OK to hang the main thread of the GPUProcess for this? The GPUProcess
> does many other things besides DOM rendering and hanging the main thread for
> DOM rendering does not seem acceptable.

The idea is that we'll only perform the wait here if we're receiving such a high rate of display list items, that the next IPC message would probably otherwise be a wakeup message for more DL item processing anyways.

I chose a tiny (30 microsecond) wait duration here, since that is going to be much lower (in practice) than the time needed to propagate a message via IPC. Essentially, this wait should only kick in when the GPUP would have otherwise just gone to sleep anyways.

I think in the future, if we could move DL processing off the main thread, then we'd be able to make this waiting policy much more generous.