Bug 202874

Summary: Use a Version 1 CFRunLoopSource for faster task dispatch
Product: WebKit Reporter: Sihui Liu <sihui_liu>
Component: New BugsAssignee: Geoffrey Garen <ggaren>
Status: REOPENED ---    
Severity: Normal CC: andersca, benjamin, cdumez, cmarcelo, commit-queue, dbates, ews-watchlist, ggaren, koivisto, nham, pvollan, rniwa, sam, sihui_liu, simon.fraser, thorton, tsavell, webkit-bug-importer, wenson_hsieh, youennf
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=203176
https://bugs.webkit.org/show_bug.cgi?id=203157
https://bugs.webkit.org/show_bug.cgi?id=203173
https://bugs.webkit.org/show_bug.cgi?id=203118
https://bugs.webkit.org/show_bug.cgi?id=219401
Bug Depends on: 213043, 219656, 219717, 219785    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
patch - mach message
none
patch - mach message
none
patch - getUserMedia.html
none
patch - mach message
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch for landing
none
Patch
none
Patch
none
Patch
none
Patch - just the RunLoop parts
koivisto: review+, koivisto: commit-queue-
Patch
ews-feeder: commit-queue-
Patch
none
Patch
none
Patch
none
microbenchmark
none
Patch none

Description Sihui Liu 2019-10-11 19:10:30 PDT
...
Comment 1 Sihui Liu 2019-10-11 19:22:52 PDT
Created attachment 380807 [details]
Patch
Comment 2 Geoffrey Garen 2019-10-12 11:36:10 PDT
Comment on attachment 380807 [details]
Patch

It looks like some tests are timing out. My best guess is that some calls to RunLoop::dispatch() implicitly rely on ordering, and when we don’t use the explicit function queue, we break that ordering. 

I think it’s worth testing dispatch_async as the implementation of wakeUp. (We tested that once, but only with a broken patch that called isMain(), which is incorrect). That way, we preserve ordering. 

I’d change wakeUp to take an argument indicating whether this == s_mainRunLoop. Then, if true, the CF implementation should use dispatch_async to call the function that executes all tasks, rather than CFRunLoopWakeup. 

If that’s not fast enough, or it breaks things, then we should just add an inline function that calls dispatch_async or RunLoop::dispatch, depending on platform. And then call that just from IDB code.
Comment 3 Geoffrey Garen 2019-10-13 08:48:05 PDT
Created attachment 380845 [details]
patch - mach message

Out of curiosity, I disassembled libdispatch to see what method it uses to signal the RunLoop. It looks like libdispatch uses a "version 1" RunLoop source <https://developer.apple.com/documentation/corefoundation/cfrunloopsource?language=objc>, which triggers by mach message, rather than CFRunLoopWakeup(). I've duplicated that behavior here.

Can you test the performance of this approach?
Comment 4 Geoffrey Garen 2019-10-13 08:48:55 PDT
(In reply to Geoffrey Garen from comment #3)
(I verified that this patch passes the css3 layout tests.)
Comment 5 Geoffrey Garen 2019-10-13 08:52:03 PDT
Created attachment 380846 [details]
patch - mach message
Comment 6 Geoffrey Garen 2019-10-13 14:20:42 PDT
Created attachment 380851 [details]
patch - getUserMedia.html

It looks like the iOS API test failure for TestWebKitAPI.WebKit2.CaptureMute is real. The test is racy, and its timing has changed slightly. Here's a patch that should make the test more deterministic.
Comment 7 Sihui Liu 2019-10-13 23:35:16 PDT
(In reply to Geoffrey Garen from comment #3)
> Created attachment 380845 [details]
> patch - mach message
> 
> Out of curiosity, I disassembled libdispatch to see what method it uses to
> signal the RunLoop. It looks like libdispatch uses a "version 1" RunLoop
> source
> <https://developer.apple.com/documentation/corefoundation/
> cfrunloopsource?language=objc>, which triggers by mach message, rather than
> CFRunLoopWakeup(). I've duplicated that behavior here.
> 
> Can you test the performance of this approach?

It's about 15% speed up on the test PerformanceTests/IndexedDB/basic/objectstore-get.html.
Comment 8 Geoffrey Garen 2019-10-14 09:42:25 PDT
> > Can you test the performance of this approach?
> 
> It's about 15% speed up on the test
> PerformanceTests/IndexedDB/basic/objectstore-get.html.

So, still slower than dispatch_async?
Comment 9 Geoffrey Garen 2019-10-14 20:24:49 PDT
Oh, actually, it's expected that the patch above doesn't achieve all 20%, because it doesn't include the fix for callOnMainThread. I'll upload a patch that includes callOnMainThread and the regression test change all in one.
Comment 10 Geoffrey Garen 2019-10-14 20:26:46 PDT
Created attachment 380952 [details]
patch - mach message
Comment 11 Sihui Liu 2019-10-15 00:49:17 PDT
(In reply to Geoffrey Garen from comment #9)
> Oh, actually, it's expected that the patch above doesn't achieve all 20%,
> because it doesn't include the fix for callOnMainThread. I'll upload a patch
> that includes callOnMainThread and the regression test change all in one.

I previously made the change of the callOnMainThread locally and tested. I tried with the latest patch and it seemed it's still not as fast as using dispatch_async directly. It's close though.

Also I debugged the layout test failure of the dispatch_async patch. The cause is that we try to block main thread when waiting for reply of sync messages. Our implementation is:
1. Start a task on main thread
2. While not receiving reply, ask RunLoop to run. RunLoop can listen for input sources and processes data from sources(receiving and handling reply) at this time. (Thread is blocked.)
3. Continue to finish the task
It looks like when we do dispatch_async, the RunLoop doesn't think it is a source and thus time out at step2.
Comment 12 Geoffrey Garen 2019-10-15 10:37:57 PDT
(In reply to Sihui Liu from comment #11)
> (In reply to Geoffrey Garen from comment #9)
> > Oh, actually, it's expected that the patch above doesn't achieve all 20%,
> > because it doesn't include the fix for callOnMainThread. I'll upload a patch
> > that includes callOnMainThread and the regression test change all in one.
> 
> I previously made the change of the callOnMainThread locally and tested. I
> tried with the latest patch and it seemed it's still not as fast as using
> dispatch_async directly. It's close though.
> 
> Also I debugged the layout test failure of the dispatch_async patch. The
> cause is that we try to block main thread when waiting for reply of sync
> messages. Our implementation is:
> 1. Start a task on main thread
> 2. While not receiving reply, ask RunLoop to run. RunLoop can listen for
> input sources and processes data from sources(receiving and handling reply)
> at this time. (Thread is blocked.)
> 3. Continue to finish the task
> It looks like when we do dispatch_async, the RunLoop doesn't think it is a
> source and thus time out at step2.

Is this in Connection::waitForSyncReply? Or something else?

I suspect that dispatch_async is not willing to execute inside a nested RunLoop. I believe that's a fundamental difference between libdispatch and CFRunLoop. (libdispatch requires all tasks to dispatch in order, and that means prohibiting re-entrant dispatch. Relatedly, libdispatch explicitly documents that dispatch_sync(dispatch_get_current_queue(), ...) is always a deadlock.)

So, I guess we can't use dispatch_async.

Let's try to finish up the mach_message / version 1 RunLoopSource solution, since it delivers most of the speedup. Can you do that?

Outstanding issues:

1. inspector/css/pseudo-creation.html fails. It used to fail sometimes (e.g. https://build.webkit.org/builders/Apple%20High%20Sierra%20Debug%20WK1%20(Tests)/builds/11323), and now it fails all the time.

So, the code has gotten better (more deterministic), but we broke a test. It looks like something about InspectorProtocol.eventHandler is racy relative to promise resolution. We need to fix this, or perhaps disable the test.

2. TestWebKitAPI.WebKitLegacy.ScrollingDoesNotPauseMedia times out. It must either time out in its call to Util::run, or its call to callOnMainThreadAndWait. I think it's probably callOnMainThreadAndWait. If so, this test has a simple bug. It calls "callOnMainThreadAndWait" *from the main thread*. I'm not sure exactly why this ever worked, but I think you can fix it by just switching to callOnMainThread.
Comment 13 Sihui Liu 2019-10-16 13:12:41 PDT
Created attachment 381101 [details]
Patch
Comment 14 Sihui Liu 2019-10-16 13:27:40 PDT
Created attachment 381103 [details]
Patch
Comment 15 Sam Weinig 2019-10-16 13:35:05 PDT
Comment on attachment 381103 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=381103&action=review

> Source/WTF/ChangeLog:9
> +        We used CFRunLoopWakeUp to wake up runloop to process source, which seems to be slow according to profiling. To 
> +        avoid calling CFRunLoopWakeUp, use CFRunloopSource1 instead of CFRunloopSource0.

What is CFRunloopSource1? Why is this faster? If it's not the current default, what is the reason it's not and what potential downsides are there?
Comment 16 Geoffrey Garen 2019-10-16 13:36:11 PDT
Comment on attachment 381103 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=381103&action=review

Looks great, and thanks for fixing those tests. One comment on callOonMainThread.

> Source/WTF/wtf/cocoa/MainThreadCocoa.mm:136
> -    [staticMainThreadCaller performSelectorOnMainThread:@selector(call) withObject:nil waitUntilDone:NO];
> +    dispatch_async(dispatch_get_main_queue(), ^{
> +        [staticMainThreadCaller call];
> +    });

You can remove some indirection here by deleting staticMainThreadCaller and JSWTFMainThreadCaller and just calling WTF::dispatchFunctionsFromMainThread() directly. staticMainThreadCaller and JSWTFMainThreadCaller only existed for the sake of doing the async dispatch, which you're doing a different way now.

Side note: Is dispatch_async better than RunLoop::main().dispatch() in this case? Also, is it OK that callOnMainThread will not run in nested run loops?
Comment 17 Sihui Liu 2019-10-16 14:11:56 PDT
Created attachment 381106 [details]
Patch
Comment 18 Sihui Liu 2019-10-16 14:51:29 PDT
(In reply to Sam Weinig from comment #15)
> Comment on attachment 381103 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=381103&action=review
> 
> > Source/WTF/ChangeLog:9
> > +        We used CFRunLoopWakeUp to wake up runloop to process source, which seems to be slow according to profiling. To 
> > +        avoid calling CFRunLoopWakeUp, use CFRunloopSource1 instead of CFRunloopSource0.
> 
> What is CFRunloopSource1? Why is this faster? If it's not the current
> default, what is the reason it's not and what potential downsides are there?

It's version 1 source in CFRunloop: 
https://developer.apple.com/documentation/corefoundation/cfrunloopsource-rhr

According to our profiling on test like OpenSource/PerformanceTests/IndexedDB/basic/objectstore-get.html, we spent a lot of time on waking up the runloop when receiving IPC message. And we tried libdispatch, which uses version 1 source for dispatching to runloop, and is much faster. So we think we could probably adopt this for better performance.

I don't know why version 0 source is our default..
Comment 19 Sihui Liu 2019-10-16 15:06:02 PDT
(In reply to Geoffrey Garen from comment #16)
> Comment on attachment 381103 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=381103&action=review
> 
> Looks great, and thanks for fixing those tests. One comment on
> callOonMainThread.
> 
> > Source/WTF/wtf/cocoa/MainThreadCocoa.mm:136
> > -    [staticMainThreadCaller performSelectorOnMainThread:@selector(call) withObject:nil waitUntilDone:NO];
> > +    dispatch_async(dispatch_get_main_queue(), ^{
> > +        [staticMainThreadCaller call];
> > +    });
> 
> You can remove some indirection here by deleting staticMainThreadCaller and
> JSWTFMainThreadCaller and just calling
> WTF::dispatchFunctionsFromMainThread() directly. staticMainThreadCaller and
> JSWTFMainThreadCaller only existed for the sake of doing the async dispatch,
> which you're doing a different way now.
> 
staticMainThreadCaller is used in the #if USE(WEB_THREAD) case, which is a few lines above.
I would change this to WTF::dispatchFunctionsFromMainThread().

> Side note: Is dispatch_async better than RunLoop::main().dispatch() in this
> case? Also, is it OK that callOnMainThread will not run in nested run loops?

The reason why I didn't use RunLoop::main().dispatch() because RunLoop::main() needs to run after RunLoop::initializeMainRunLoop() and
callOnMainThread does not seem to care about whether we have initialized mainRunLoop or not.

Why can't callOnMainThread run in nested run loops?
Comment 20 Sihui Liu 2019-10-16 15:31:50 PDT
Created attachment 381124 [details]
Patch
Comment 21 Sam Weinig 2019-10-16 16:46:16 PDT
(In reply to Sihui Liu from comment #18)
> (In reply to Sam Weinig from comment #15)
> > Comment on attachment 381103 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=381103&action=review
> > 
> > > Source/WTF/ChangeLog:9
> > > +        We used CFRunLoopWakeUp to wake up runloop to process source, which seems to be slow according to profiling. To 
> > > +        avoid calling CFRunLoopWakeUp, use CFRunloopSource1 instead of CFRunloopSource0.
> > 
> > What is CFRunloopSource1? Why is this faster? If it's not the current
> > default, what is the reason it's not and what potential downsides are there?
> 
> It's version 1 source in CFRunloop: 
> https://developer.apple.com/documentation/corefoundation/cfrunloopsource-rhr
> 
> According to our profiling on test like
> OpenSource/PerformanceTests/IndexedDB/basic/objectstore-get.html, we spent a
> lot of time on waking up the runloop when receiving IPC message. And we
> tried libdispatch, which uses version 1 source for dispatching to runloop,
> and is much faster. So we think we could probably adopt this for better
> performance.

Does that mean this is not a speedup on it's own?
Comment 22 Sihui Liu 2019-10-16 17:22:19 PDT
(In reply to Sam Weinig from comment #21)
> (In reply to Sihui Liu from comment #18)
> > (In reply to Sam Weinig from comment #15)
> > > Comment on attachment 381103 [details]
> > > Patch
> > > 
> > > View in context:
> > > https://bugs.webkit.org/attachment.cgi?id=381103&action=review
> > > 
> > > > Source/WTF/ChangeLog:9
> > > > +        We used CFRunLoopWakeUp to wake up runloop to process source, which seems to be slow according to profiling. To 
> > > > +        avoid calling CFRunLoopWakeUp, use CFRunloopSource1 instead of CFRunloopSource0.
> > > 
> > > What is CFRunloopSource1? Why is this faster? If it's not the current
> > > default, what is the reason it's not and what potential downsides are there?
> > 
> > It's version 1 source in CFRunloop: 
> > https://developer.apple.com/documentation/corefoundation/cfrunloopsource-rhr
> > 
> > According to our profiling on test like
> > OpenSource/PerformanceTests/IndexedDB/basic/objectstore-get.html, we spent a
> > lot of time on waking up the runloop when receiving IPC message. And we
> > tried libdispatch, which uses version 1 source for dispatching to runloop,
> > and is much faster. So we think we could probably adopt this for better
> > performance.
> 
> Does that mean this is not a speedup on it's own?

It's a speedup on it's own. The patch gives us about 15% speedup on the test we profiled (PerformanceTests/IndexedDB/basic/objectstore-get.html). It saves time by eliminating CFRunLoopWakeUps from the trace.
Comment 23 Geoffrey Garen 2019-10-16 21:16:51 PDT
> staticMainThreadCaller is used in the #if USE(WEB_THREAD) case, which is a
> few lines above.

You can change that case to stop using staticMainThreadCaller too.

> > Side note: Is dispatch_async better than RunLoop::main().dispatch() in this
> > case? Also, is it OK that callOnMainThread will not run in nested run loops?
> 
> The reason why I didn't use RunLoop::main().dispatch() because
> RunLoop::main() needs to run after RunLoop::initializeMainRunLoop() and
> callOnMainThread does not seem to care about whether we have initialized
> mainRunLoop or not.

callOnMainThread depends on initializeMainThread() to initialize staticMainThreadCaller.

> Why can't callOnMainThread run in nested run loops?

Because dispatch_async doesn't run in nested run loops. (I confirmed this theory at https://www.thecave.com/2015/08/10/dispatch-async-to-main-queue-doesnt-work-with-modal-window-on-mac-os-x/.)

If there isn't a specific advantage to dispatch_async here, maybe RunLoop::main() is better, to avoid this minor change in behavior.
Comment 24 Sihui Liu 2019-10-17 01:30:06 PDT
Created attachment 381180 [details]
Patch
Comment 25 Sam Weinig 2019-10-17 07:59:00 PDT
(In reply to Sihui Liu from comment #22)
> (In reply to Sam Weinig from comment #21)
> > (In reply to Sihui Liu from comment #18)
> > > (In reply to Sam Weinig from comment #15)
> > > > Comment on attachment 381103 [details]
> > > > Patch
> > > > 
> > > > View in context:
> > > > https://bugs.webkit.org/attachment.cgi?id=381103&action=review
> > > > 
> > > > > Source/WTF/ChangeLog:9
> > > > > +        We used CFRunLoopWakeUp to wake up runloop to process source, which seems to be slow according to profiling. To 
> > > > > +        avoid calling CFRunLoopWakeUp, use CFRunloopSource1 instead of CFRunloopSource0.
> > > > 
> > > > What is CFRunloopSource1? Why is this faster? If it's not the current
> > > > default, what is the reason it's not and what potential downsides are there?
> > > 
> > > It's version 1 source in CFRunloop: 
> > > https://developer.apple.com/documentation/corefoundation/cfrunloopsource-rhr
> > > 
> > > According to our profiling on test like
> > > OpenSource/PerformanceTests/IndexedDB/basic/objectstore-get.html, we spent a
> > > lot of time on waking up the runloop when receiving IPC message. And we
> > > tried libdispatch, which uses version 1 source for dispatching to runloop,
> > > and is much faster. So we think we could probably adopt this for better
> > > performance.
> > 
> > Does that mean this is not a speedup on it's own?
> 
> It's a speedup on it's own. The patch gives us about 15% speedup on the test
> we profiled (PerformanceTests/IndexedDB/basic/objectstore-get.html). It
> saves time by eliminating CFRunLoopWakeUps from the trace.

Very nice!
Comment 26 Geoffrey Garen 2019-10-17 10:40:20 PDT
Comment on attachment 381180 [details]
Patch

r=me
Comment 27 Sihui Liu 2019-10-17 12:58:29 PDT
Created attachment 381217 [details]
Patch for landing
Comment 28 WebKit Commit Bot 2019-10-17 14:09:57 PDT
The commit-queue encountered the following flaky tests while processing attachment 381217 [details]:

inspector/console/webcore-logging.html bug 203118 (authors: drousso@apple.com and eric.carlson@apple.com)
The commit-queue is continuing to process your patch.
Comment 29 WebKit Commit Bot 2019-10-17 14:10:03 PDT
The commit-queue encountered the following flaky tests while processing attachment 381217 [details]:

imported/w3c/web-platform-tests/websockets/bufferedAmount-unchanged-by-sync-xhr.any.worker.html bug 202003 (author: youennf@gmail.com)
The commit-queue is continuing to process your patch.
Comment 30 WebKit Commit Bot 2019-10-17 15:05:30 PDT
Comment on attachment 381217 [details]
Patch for landing

Clearing flags on attachment: 381217

Committed r251261: <https://trac.webkit.org/changeset/251261>
Comment 31 WebKit Commit Bot 2019-10-17 15:05:33 PDT
All reviewed patches have been landed.  Closing bug.
Comment 32 Radar WebKit Bug Importer 2019-10-17 15:06:29 PDT
<rdar://problem/56387205>
Comment 33 Truitt Savell 2019-10-18 09:34:03 PDT
It looks like the changes in https://trac.webkit.org/changeset/251261/webkit 

broke inspector/console/webcore-logging.html

History:
https://results.webkit.org/?suite=layout-tests&test=inspector%2Fconsole%2Fwebcore-logging.html

Diff:
--- /Volumes/Data/slave/mojave-release-tests-wk1/build/layout-test-results/inspector/console/webcore-logging-expected.txt
+++ /Volumes/Data/slave/mojave-release-tests-wk1/build/layout-test-results/inspector/console/webcore-logging-actual.txt
@@ -27,7 +27,9 @@
 -- Running test case: Console.Logging.MediaLogging
 PASS: Media logging disabled.
 PASS: Media logging has been enabled.
-PASS: Media log message should have source 'media'.
+FAIL: Media log message should have source 'media'.
+    Expected: truthy
+    Actual: false
 
 -- Running test case: Console.Logging.LogAsJSONWithoutRepeat
 PASS: Media logging disabled.

This reproduces easily by running the test in iterations. It fails on 251261 but not 251260.
Comment 34 Truitt Savell 2019-10-18 09:39:55 PDT
This also broke inspector/runtime/getProperties.html

History
https://results.webkit.org/?suite=layout-tests&test=inspector%2Fruntime%2FgetProperties.html
Comment 35 Truitt Savell 2019-10-23 17:06:03 PDT
Reverted r251261 for reason:

This broke multiple tests

Committed r251514: <https://trac.webkit.org/changeset/251514>
Comment 36 Sihui Liu 2019-10-23 17:45:12 PDT
(In reply to Truitt Savell from comment #35)
> Reverted r251261 for reason:
> 
> This broke multiple tests
> 
> Committed r251514: <https://trac.webkit.org/changeset/251514>

inspector/console/webcore-logging.html was fixed by r251482 (and got reverted too), which fixed the test and made another meaningful bug fix.

inspector/runtime/getProperties.html is caused by that the test file doesn't handle error case correctly. (The diff is one error console message.)

scrollingcoordinator/ios/non-stable-viewport-scroll.html is flaky by its nature, as I commented in rdar://problem/56410921 and I replied in Slack.

Are there any other failure caused by this patch? If yes, what are they?
If no, what led this to be reverted?
Comment 37 Truitt Savell 2019-10-24 08:06:24 PDT
(In reply to Sihui Liu from comment #36)
> (In reply to Truitt Savell from comment #35)
> > Reverted r251261 for reason:
> > 
> > This broke multiple tests
> > 
> > Committed r251514: <https://trac.webkit.org/changeset/251514>
> 
> inspector/console/webcore-logging.html was fixed by r251482 (and got
> reverted too), which fixed the test and made another meaningful bug fix.
> 
> inspector/runtime/getProperties.html is caused by that the test file doesn't
> handle error case correctly. (The diff is one error console message.)
> 
> scrollingcoordinator/ios/non-stable-viewport-scroll.html is flaky by its
> nature, as I commented in rdar://problem/56410921 and I replied in Slack.
> 
> Are there any other failure caused by this patch? If yes, what are they?
> If no, what led this to be reverted?

Those are the tests that lead us to roll this patch out
Comment 38 Geoffrey Garen 2019-10-24 12:47:36 PDT
> inspector/console/webcore-logging.html was fixed by r251482 (and got
> reverted too), which fixed the test and made another meaningful bug fix.

Sounds great!

> inspector/runtime/getProperties.html is caused by that the test file doesn't
> handle error case correctly. (The diff is one error console message.)

Does this mean that inspector/runtime/getProperties.html does fail when this patch is applied? If so, I'd suggest updating the test to handle error or updating the expected result to include the console message when re-landing.

> scrollingcoordinator/ios/non-stable-viewport-scroll.html is flaky by its
> nature, as I commented in rdar://problem/56410921 and I replied in Slack.

Was this test failing sometimes before this patch landed? If so, seems OK to ignore. If not, we probably need to skip this test until it gets fixed.
Comment 39 Geoffrey Garen 2019-10-24 13:25:49 PDT
> inspector/runtime/getProperties.html is caused by that the test file doesn't handle error case correctly. (The diff is one error console message.)

For this case, let's use

if (window.internals)
    window.internals.settings.setUnhandledPromiseRejectionToConsoleEnabled(false);

and then update the test result not to expect a console message.
Comment 40 Geoffrey Garen 2019-10-24 13:46:06 PDT
> scrollingcoordinator/ios/non-stable-viewport-scroll.html is flaky by its
> nature, as I commented in rdar://problem/56410921 and I replied in Slack.

For this case, the test is marked as flaky, so I think it's OK to land without fixing it, since doing so won't make the bots red.

(We analyzed this test a bit, and fixing it will be non-trivial. More information from <rdar://problem/56410921>:

"Talked to Simon, the test is flaky without my patch. We found some graphics work is done after callback of uiController.doAfterPresentationUpdate and RenderLayerCompositor::layerTreeAsText, but layerTreeAsText should show us the final result. 

With my patch, some of those work (result is not reflected in the -expected.txt file) has bigger chance to finish sooner than layerTreeAsText, so the test has become more flaky."

)
Comment 41 Antti Koivisto 2019-10-25 01:53:33 PDT
Seems like this patch was a significant (~11%) StyleBench perf regression:

https://perf.webkit.org/v3/#/charts?since=1571129192465&paneList=((20-1649-36468600-null-(5-2.5-500)))

Rollout brought the perf back where it was. This should be looked into before relanding.
Comment 42 Sihui Liu 2019-10-25 06:22:34 PDT
(In reply to Antti Koivisto from comment #41)
> Seems like this patch was a significant (~11%) StyleBench perf regression:
> 
> https://perf.webkit.org/v3/#/charts?since=1571129192465&paneList=((20-1649-
> 36468600-null-(5-2.5-500)))
> 
> Rollout brought the perf back where it was. This should be looked into
> before relanding.

That's surprising! How can I run this test?
Comment 43 Geoffrey Garen 2019-10-25 08:14:19 PDT
(In reply to Sihui Liu from comment #42)
> (In reply to Antti Koivisto from comment #41)
> > Seems like this patch was a significant (~11%) StyleBench perf regression:
> > 
> > https://perf.webkit.org/v3/#/charts?since=1571129192465&paneList=((20-1649-
> > 36468600-null-(5-2.5-500)))
> > 
> > Rollout brought the perf back where it was. This should be looked into
> > before relanding.
> 
> That's surprising! How can I run this test?

OpenSource/PerformanceTests/StyleBench/index.html
Comment 44 Antti Koivisto 2019-10-25 09:07:53 PDT
cd PerformanceTests/StyleBench
python -m SimpleHTTPServer 8000
navigate to localhost:8000

Alternatively individual subtests can be run with web server by opening 

PerformanceTests/StyleBench/InteractiveRunner.html
Comment 45 Antti Koivisto 2019-10-25 09:08:32 PDT
> Alternatively individual subtests can be run with web server by opening 

*without web server
Comment 46 Ryosuke Niwa 2019-10-25 11:26:46 PDT
There is also Tools/Scripts/run-perf-tests <path>
Comment 47 Antti Koivisto 2020-02-13 13:43:54 PST
Are there any show stoppers for landing this? It might help with first paint performance (type 1 sources don't prevent runloop observers from firing and so don't delay layer flushes).

I don't think StyleBench regression is a show stopper, it is not a zero regression test. It sounded like it was a bug in how measurements are made we understood what was wrong.
Comment 48 Sihui Liu 2020-02-13 14:05:31 PST
(In reply to Antti Koivisto from comment #47)
> Are there any show stoppers for landing this? It might help with first paint
> performance (type 1 sources don't prevent runloop observers from firing and
> so don't delay layer flushes).
> 
> I don't think StyleBench regression is a show stopper, it is not a zero
> regression test. It sounded like it was a bug in how measurements are made
> we understood what was wrong.

No,the 

we could try landing this again.
Comment 49 Ben Nham 2020-02-14 11:42:32 PST
I still think we should try to commit this but I'm not sure if this will help first paint perf due to excessive batching before an observer fires (namely the one that causes layer flush to occur). As far as I can tell, it looks like servicing a type 0 source will still turn the entire run loop and fire observers. After a type 0 source is handled:

1. The GCD main queue port is polled with 0 timeout; if this succeeds, go to (5) skipping before/after waiting observers
2. Before waiting observers fired
3. wakeup/timer/source1 port set polled with a 0 timeout
4. after waiting observers fired
5. GCD main queue / wakeup / timer / source1 callback runs

This patch mainly seems to optimize things by using a source1 mach port to wake up the run loop instead of CFRunLoopWakeUp. This is more efficient because this patch just sends a mach message, while CFRunLoopWakeUp both takes a lock and sends a mach message. But both before and after this patch, we can end up handling a bunch of IPC messages in a row before a run loop observer fires because of the way WTF::RunLoop::performWork batches up work.
Comment 50 Simon Fraser (smfr) 2020-02-14 11:47:22 PST
(In reply to Ben Nham from comment #49)
> I still think we should try to commit this but I'm not sure if this will
> help first paint perf due to excessive batching before an observer fires
> (namely the one that causes layer flush to occur). As far as I can tell, it
> looks like servicing a type 0 source will still turn the entire run loop and
> fire observers.

It does not fire "before waiting" observers, which the CoreAnimation commit observer is:
https://github.com/opensource-apple/CF/blob/master/CFRunLoop.c#L2403
https://github.com/opensource-apple/CF/blob/master/CFRunLoop.c#L2425

After a type 0 source is handled:
> 
> 1. The GCD main queue port is polled with 0 timeout; if this succeeds, go to
> (5) skipping before/after waiting observers
> 2. Before waiting observers fired

No, because of "if (!poll".

> 3. wakeup/timer/source1 port set polled with a 0 timeout
> 4. after waiting observers fired
> 5. GCD main queue / wakeup / timer / source1 callback runs
> 
> This patch mainly seems to optimize things by using a source1 mach port to
> wake up the run loop instead of CFRunLoopWakeUp. This is more efficient
> because this patch just sends a mach message, while CFRunLoopWakeUp both
> takes a lock and sends a mach message. But both before and after this patch,
> we can end up handling a bunch of IPC messages in a row before a run loop
> observer fires because of the way WTF::RunLoop::performWork batches up work.
Comment 51 Ben Nham 2020-02-14 11:57:19 PST
I see, I missed that. So the hypothesis is basically that an IPC that causes a layout to occur in a source 0 callback, which then calls flushLayers, and then CA's auto-commit observer doesn't run as soon as it should because of this source 0 behavior?
Comment 52 Simon Fraser (smfr) 2020-02-14 13:19:49 PST
(In reply to Ben Nham from comment #51)
> I see, I missed that. So the hypothesis is basically that an IPC that causes
> a layout to occur in a source 0 callback, which then calls flushLayers, and
> then CA's auto-commit observer doesn't run as soon as it should because of
> this source 0 behavior?

More simply put, a flood of IPC messages can postpone rendering for many event loop cycles.
Comment 53 Sihui Liu 2020-02-17 09:59:53 PST
Created attachment 390927 [details]
Patch
Comment 54 Sihui Liu 2020-02-24 14:19:34 PST
Created attachment 391575 [details]
Patch
Comment 55 Sihui Liu 2020-02-25 17:10:17 PST
Created attachment 391705 [details]
Patch
Comment 56 Geoffrey Garen 2020-06-09 14:57:17 PDT
Created attachment 401478 [details]
Patch - just the RunLoop parts

Let's see if we can land the refactoring part of this patch without the behavior change, to shrink the diff and help move to just one function queue.
Comment 57 Sihui Liu 2020-06-09 21:53:56 PDT
Comment on attachment 401478 [details]
Patch - just the RunLoop parts

Looks green. Do we need a new bug for this change?
Comment 58 Antti Koivisto 2020-06-09 23:32:39 PDT
Comment on attachment 401478 [details]
Patch - just the RunLoop parts

View in context: https://bugs.webkit.org/attachment.cgi?id=401478&action=review

> Source/WTF/ChangeLog:3
> +        Use CFRunLoopSource1 for faster task dispatch

The tittle does not describe this patch.
Comment 59 Geoffrey Garen 2020-06-10 12:50:25 PDT
Comment on attachment 401478 [details]
Patch - just the RunLoop parts

Yeah, I'll obsolete this patch and post in a separate bug.
Comment 60 Geoffrey Garen 2020-06-10 12:55:09 PDT
Refactoring patch is now posted @ https://bugs.webkit.org/show_bug.cgi?id=213043.
Comment 61 Geoffrey Garen 2020-11-13 11:54:53 PST
Created attachment 414072 [details]
Patch
Comment 62 Geoffrey Garen 2020-11-13 11:56:19 PST
Comment on attachment 414072 [details]
Patch

Re-based patch.
Comment 63 Geoffrey Garen 2020-11-13 12:00:26 PST
Created attachment 414073 [details]
Patch
Comment 64 Sihui Liu 2020-11-13 16:05:59 PST
Note using version 1 source may cause issue in our test runner, details in <rdar://problem/59637925>.
Comment 65 Geoffrey Garen 2020-11-16 12:17:58 PST
Created attachment 414266 [details]
Patch
Comment 66 Geoffrey Garen 2020-11-16 20:40:28 PST
Created attachment 414303 [details]
Patch
Comment 67 Geoffrey Garen 2020-11-20 13:28:14 PST
Created attachment 414710 [details]
microbenchmark

Uploading a microbenchmark for posterity.

Interestingly, the fastest case for V0 and V1 is about equal.

But on average V0 is 2X slower; and in the worst case, it's 5X slower, causing a full 2ms of latency!
Comment 68 Geoffrey Garen 2020-11-20 13:29:37 PST
Comment on attachment 414303 [details]
Patch

Tests pass and microbenchmark confirms a speedup. So, I think this is ready for review.
Comment 69 EWS 2020-11-20 14:04:47 PST
Committed r270132: <https://trac.webkit.org/changeset/270132>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 414303 [details].
Comment 70 Ryan Haddad 2020-12-01 10:00:55 PST
Reverted r270132 for reason:

Caused at least one regression test failure (webkit.org/b/219401)

Committed r270311: <https://trac.webkit.org/changeset/270311>
Comment 72 Geoffrey Garen 2020-12-10 15:35:37 PST
Created attachment 415939 [details]
Patch
Comment 73 Geoffrey Garen 2020-12-10 15:36:28 PST
Previous test failures should be resolved by:

https://bugs.webkit.org/show_bug.cgi?id=219717
https://bugs.webkit.org/show_bug.cgi?id=219656
Comment 74 EWS 2020-12-10 16:17:48 PST
Committed r270661: <https://trac.webkit.org/changeset/270661>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 415939 [details].
Comment 78 Ryan Haddad 2020-12-12 21:44:20 PST
Reverted r270661 for reason:

Caused layout test failures and timeouts

Committed r270746: <https://trac.webkit.org/changeset/270746>
Comment 80 Per Arne Vollan 2020-12-17 03:30:51 PST
Comment on attachment 415939 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415939&action=review

> Source/WTF/wtf/cf/RunLoopCF.cpp:76
> +    mach_msg_header_t header;
> +    header.msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);
> +    header.msgh_size = sizeof(mach_msg_header_t);
> +    header.msgh_remote_port = CFMachPortGetPort(m_port.get());
> +    header.msgh_local_port = MACH_PORT_NULL;
> +    header.msgh_id = 0;
> +    mach_msg_return_t result = mach_msg(&header, MACH_SEND_MSG | MACH_SEND_TIMEOUT, header.msgh_size, 0, MACH_PORT_NULL, 0, MACH_PORT_NULL);
> +    RELEASE_ASSERT(result == MACH_MSG_SUCCESS || result == MACH_SEND_TIMED_OUT);
> +    if (result == MACH_SEND_TIMED_OUT)
> +        mach_msg_destroy(&header);

Could this be simplified by using CFMessagePort instead of CFMachPort?
Comment 81 Geoffrey Garen 2020-12-17 08:42:25 PST
> Could this be simplified by using CFMessagePort instead of CFMachPort?

Unfortunately, CFMessagePortSendRequest still locks a pthread_mutex_t before sending, and pthread_mutex_t is most of the overhead we are trying to avoid.
Comment 82 Per Arne Vollan 2021-01-20 14:14:12 PST
(In reply to Geoffrey Garen from comment #81)
> > Could this be simplified by using CFMessagePort instead of CFMachPort?
> 
> Unfortunately, CFMessagePortSendRequest still locks a pthread_mutex_t before
> sending, and pthread_mutex_t is most of the overhead we are trying to avoid.

Ah, got it, thanks!

If there is a little bit of flakiness associated with this patch, could this possibly be addressed by dispatching off a run loop observer for the main thread run loop, in order to make the ordering deterministic?
Comment 83 Geoffrey Garen 2021-01-21 15:27:31 PST
> If there is a little bit of flakiness associated with this patch, could this
> possibly be addressed by dispatching off a run loop observer for the main
> thread run loop, in order to make the ordering deterministic?

I had the same thought! :P I do think that approach *could* work. I spent a few days trying to make that work, but I ended up with too many test failures. It's possible that I did not succeed in making behavior more deterministic; or maybe I succeeded in making behavior more deterministic, but deterministically failing.

One gotcha is that run loop observers don't run at *exactly* the same point as run loop sources. Another gotcha is that the relative order of run loop observers is different in the case where you first call CFRunLoopRun() vs the case where you were already sleeping inside  CFRunLoopRun().

My next move is to A/B test to see if the change is worth it. If it is, I think I can make things more deterministic by standardizing how we schedule tasks. (I started this in https://trac.webkit.org/changeset/271493/webkit.)