| Summary: | REGRESSION (iOS 15): Massive variation in microbenchmark results when tab returns to foreground | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Simon Taylor <simontaylor1> | ||||||||
| Component: | JavaScriptCore | Assignee: | Nobody <webkit-unassigned> | ||||||||
| Status: | NEW --- | ||||||||||
| Severity: | Normal | CC: | cdumez, mark.lam, saam, sabouhallawa, simon.fraser, thorton, webkit-bug-importer, ysuzuki | ||||||||
| Priority: | P2 | Keywords: | InRadar | ||||||||
| Version: | Safari 15 | ||||||||||
| Hardware: | All | ||||||||||
| OS: | iOS 15 | ||||||||||
| See Also: | https://bugs.webkit.org/show_bug.cgi?id=234923 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Simon Taylor
2022-01-12 07:43:01 PST
Created attachment 448944 [details] Demo page I've attached the demo page. It's also hosted at: https://tango-bravo.net/webkit-bug-235114/index.html I've also successfully reproduced this on both Intel and M1 Pro MacBooks with Safari 15.2 - both have two clear operating points. Quitting Safari and "Reopen Last Closed Window" seems to do the trick for rolling the dice and seeing which one you end up with. Created attachment 448946 [details]
Screen recording from M1 Pro MBP
Screen recording showing this in action on an M1 MBP.
Shows me doing Quit / Restart loop a few times (Safari set to re-open previous windows) until the page seems to be in the slower operating point. Then Refresh eventually seems to get one at the faster point.
On Mac at least it seems refresh doesn't usually trigger the fast->slow transition, the Quit / Restart app is the more reliable way to observe that behaviour.
I can reproduce on macOS / Intel. I just keep relaunching Safari and open https://tango-bravo.net/webkit-bug-235114/index.html. Sometimes, the math.random calls number is ~7k, sometimes the number is ~17k. I am doing some initial investigation but this may be a JSC issue having to do with tiering up or not. The process priorities look similar when the benchmark is slow and fast: - Slow: 491 9 (FGSupport) 13 15MB 46MB 1000MB --- Y Y - 35711 com.apple.WebKit.WebContent policy: POLICY_TIMESHARE timeshare max priority: 63 timeshare base priority: 46 timeshare cur priority: 46 timeshare depressed: NO, prio -1 requested policy: req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1 req workqueue/pthread overides: req legacy qos override: THREAD_QOS_UNSPECIFIED req workqueue qos override: THREAD_QOS_UNSPECIFIED req kernel overides: req kevent overrides: THREAD_QOS_UNSPECIFIED req workloop servicer override: THREAD_QOS_UNSPECIFIED req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0 req latency qos: LATENCY_QOS_TIER_UNSPECIFIED req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED req darwin BG: NO req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT) req other: effective policy: eff thread qos: THREAD_QOS_USER_INTERACTIVE eff thread qos relprio: -1 eff promotion qos: THREAD_QOS_USER_INTERACTIVE eff latency qos: LATENCY_QOS_TIER_0 eff thruput qos: THROUGHPUT_QOS_TIER_0 eff darwin BG: NO eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) eff other: - Fast: 493 9 (FGSupport) 13 16MB 48MB 1000MB --- Y Y - 35877 com.apple.WebKit.WebContent policy: POLICY_TIMESHARE timeshare max priority: 63 timeshare base priority: 46 timeshare cur priority: 46 timeshare depressed: NO, prio -1 requested policy: req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1 req workqueue/pthread overides: req legacy qos override: THREAD_QOS_UNSPECIFIED req workqueue qos override: THREAD_QOS_UNSPECIFIED req kernel overides: req kevent overrides: THREAD_QOS_UNSPECIFIED req workloop servicer override: THREAD_QOS_UNSPECIFIED req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0 req latency qos: LATENCY_QOS_TIER_UNSPECIFIED req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED req darwin BG: NO req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT) req other: effective policy: eff thread qos: THREAD_QOS_USER_INTERACTIVE eff thread qos relprio: -1 eff promotion qos: THREAD_QOS_USER_INTERACTIVE eff latency qos: LATENCY_QOS_TIER_0 eff thruput qos: THROUGHPUT_QOS_TIER_0 eff darwin BG: NO eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) eff other: I don't think this is a process priority issue. I am curious what JSC engineers think about this. I guess I can try and compare profiles next. Looking at the traces: 1. The WebContent process has about as many samples 2. The WebContent process has the same priorities 3. The WebContent process has about the same number of syscalls 4. The WebContent process has about the same amount of scheduled time I'll keep looking a bit but I am starting to wonder if something is wrong with the benchmark itself? In the slow case, top samples are: Sample Count, Samples %, Normalized CPU %, Symbol 10167, 74.2%, 2.8%, vmEntryToJavaScript (in JavaScriptCore) 5205, 38.0%, 1.4%, ??? (in (null)) [0x53a6858040a4] 5107, 37.3%, 1.4%, operationGetFromScope (in JavaScriptCore) 2580, 18.8%, 0.7%, WebCore::JSDOMWindow::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in WebCore) 1467, 10.7%, 0.4%, JSC::JSGlobalObject::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 1136, 8.3%, 0.3%, JSC::JSObject::getOwnStaticPropertySlot(JSC::VM&, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 873, 6.4%, 0.2%, JSC::getStaticPropertySlotFromTable(JSC::VM&, JSC::ClassInfo const*, JSC::HashTable const&, JSC::JSObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 510, 3.7%, 0.1%, WTF::equal(WTF::StringImpl const*, unsigned char const*) (in JavaScriptCore) In the fast case, top samples are: Sample Count, Samples %, Normalized CPU %, Symbol 10164, 72.1%, 2.8%, vmEntryToJavaScript (in JavaScriptCore) 3501, 24.9%, 1.0%, ??? (in (null)) [0x20370c40ae38] 3363, 23.9%, 0.9%, ??? (in (null)) [0x20370c4011d8] 3317, 23.5%, 0.9%, WebCore::jsPerformancePrototypeFunction_now(JSC::JSGlobalObject*, JSC::CallFrame*) (in WebCore) 2072, 14.7%, 0.6%, WTF::MonotonicTime::now() (in JavaScriptCore) 1195, 8.5%, 0.3%, mach_absolute_time (in libsystem_kernel.dylib) They do look very different and I suspect this is about JIT'ing vs not JIT'ing. Somebody from JSC should probably take a look (I attached the profiles to the radar). (In reply to Chris Dumez from comment #7) > I am starting to wonder if something is wrong with the benchmark itself? Certainly possible. In the WebGL one I added console.log() calls (which show up as syscalls if WebInspector is open) and verified that the section where the calls are counted was 1ms long in both fast and slow cases. The idea of accumulating to a global var was to stop the optimizer removing the calls entirely. Different JIT levels being used does feel most likely. I'm not able to reproduce very readily on M1 Pro MBP with Safari Tech Preview. There's a vague hint of a 28k / 32k difference (perhaps some non-determinism but at the same JIT tier?), but I haven't seen the ~11k case there. FWIW, this issue no longer manifests after r283102. Created attachment 450781 [details] Reproducing different operating points in STP 138 (In reply to Mark Lam from comment #10) > FWIW, this issue no longer manifests after r283102. I haven't tried to fully understand that commit, but does it replace the JS interpreter tier with a baseline JIT? So the issue of using less-optimised code on some loads of code may still be present, but with a less obvious gap between the operating points since that commit? I've updated the test page to try and tease this out. Now rather than showing the mean of the counts, I take the max (and median) of each set of 60 - these tend to be really quite consistent. I've also added a history of the "max" results to get more of a feel for the consistency of this over time. https://tango-bravo.net/webkit-bug-235114/index.html There do still seem to be distinct operating points in Safari Tech Preview 138 (the release notes mention unlinked baseline JIT being added in 134). I've attached a video showing what seem to be 3 different performance points from STP 138 on my M1 Pro MBP. 29000, 29600, 31950 - the gaps between these points are all significantly larger than the differences in samples in each of them, so I'm pretty confident there's still different code running in these different cases. I've also added a "manual" benchmark runner page that just takes one set of 200 samples and shows them all each time you click a button: https://tango-bravo.net/webkit-bug-235114/manual.html You can observe the CPU frequency ramping up as the early samples have lower counts than the later ones, and the later ones are still quite consistent. For whatever reason this bug isn't as obvious on that page though, even on iOS 15.2 devices, though a do still see a couple of clear operating points, it's not the 3x difference that is visible in the rAF-based test. So the rAF test page is probably the more useful test case but thought I might as well share this one too. This is readily reproducible on iOS 15.3.1 and not on 15.4 beta 3 (iPhone 12 Pro). Numbers are slightly more variable on iOS so it's not clear if there are still the slightly different operating points mentioned in Comment 11 in-play. |