Bug 235114

Summary:

REGRESSION (iOS 15): Massive variation in microbenchmark results when tab returns to foreground

Product:

WebKit

Reporter:

Simon Taylor <simontaylor1>

Component:

JavaScriptCore

Assignee:

Nobody <webkit-unassigned>

Status:

NEW

Severity:

Normal

CC:

cdumez, mark.lam, saam, sabouhallawa, simon.fraser, thorton, webkit-bug-importer, ysuzuki

Priority:

Keywords:

InRadar

Version:

Safari 15

Hardware:

All

OS:

iOS 15

See Also:

https://bugs.webkit.org/show_bug.cgi?id=234923

Attachments:

Description	Flags
Demo page	none
Screen recording from M1 Pro MBP	none
Reproducing different operating points in STP 138	none

Simon Taylor

Reported 2022-01-12 07:43:01 PST

This is a weird one... I've got a simple microbenchmark that counts how many times Math.random() can be called in a single performance.now() tick (1ms on current iOS Safari). The idea is this provides a metric for the the current JS thread CPU performance (which varies depending on if it is running on an efficiency or performance core, and clock frequency). This is scheduled in a requestAnimationFrame loop. The page also has an option to further busy-loop in the rAF callback so the whole thing takes 20ms - checking with System Trace in Instruments this seems to reliably trigger the content thread being moved up to a Performance core, and gives higher call counts in the microbenchmark. I believe this is resulting in consistently high clocks but Instruments does not expose this data so I can't be 100% sure. Without the busy-loop option enabled the counts drop massively as the thread is moved to an efficiency core and clock speeds can be reduced. Interacting with the page (pinch zoom etc) does give a boost in the numbers, reflecting frequency ramping up to deal with the interaction and keep everything smooth. So far, all is as expected. Here's the weird thing: it looks like there are two distinct "operating points" for the microbenchmark results, and which one you get varies with page reload, or with events that cause the tab to make a background / foreground transition (tab switch, app switch, lock screen, etc). On my iPhone 12 Pro, iOS 15.2 the "fast" operating point has ~4k calls without busy looping and ~30k calls with busy looping. The "slow" operating point is ~1.2k without busy looping and ~12k with busy looping. These numbers seem to remain pretty consistent as long as the tab stays on screen, and only change on a background / foreground or reload event. I also have an iPhone 12 Mini with iOS 14.8.1, and the numbers are very similar to the "fast" operating point mentioned above. So I believe this to be an iOS 15.x regression. I've also been able to reproduce on an iPod touch 7 on 15.2. The best guess I have to explain the observed behaviour here is that some bug in tab re-awakening means it ends up using code from a lower JIT tier? This came to light in my testing for Bug 235002 - note in either operating point for the microbenchmark, the webgl timings look consistent, so I do believe the clocks are consistent. One difference is on refresh the full webGL demo page seems to be in the "slow" mode for the benchmark but the simpler example seems to start out more often in "fast" mode. Some issue with tiering up in general that depends on the code size could be at play?

Attachments
Demo page (3.17 KB, text/html) 2022-01-12 07:50 PST, Simon Taylor	no flags	Details
Screen recording from M1 Pro MBP (6.28 MB, video/quicktime) 2022-01-12 08:15 PST, Simon Taylor	no flags	Details
Reproducing different operating points in STP 138 (1.76 MB, video/quicktime) 2022-02-03 09:23 PST, Simon Taylor	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Simon Taylor

Comment 1 2022-01-12 07:50:39 PST

Created attachment 448944 [details] Demo page I've attached the demo page. It's also hosted at: https://tango-bravo.net/webkit-bug-235114/index.html

Simon Taylor

Comment 2 2022-01-12 07:55:07 PST

I've also successfully reproduced this on both Intel and M1 Pro MacBooks with Safari 15.2 - both have two clear operating points. Quitting Safari and "Reopen Last Closed Window" seems to do the trick for rolling the dice and seeing which one you end up with.

Simon Taylor

Comment 3 2022-01-12 08:15:58 PST

Created attachment 448946 [details] Screen recording from M1 Pro MBP Screen recording showing this in action on an M1 MBP. Shows me doing Quit / Restart loop a few times (Safari set to re-open previous windows) until the page seems to be in the slower operating point. Then Refresh eventually seems to get one at the faster point. On Mac at least it seems refresh doesn't usually trigger the fast->slow transition, the Quit / Restart app is the more reliable way to observe that behaviour.

Radar WebKit Bug Importer

Comment 4 2022-01-12 09:16:12 PST

<rdar://problem/87462931>

Chris Dumez

Comment 5 2022-01-12 09:31:44 PST

I can reproduce on macOS / Intel. I just keep relaunching Safari and open https://tango-bravo.net/webkit-bug-235114/index.html. Sometimes, the math.random calls number is ~7k, sometimes the number is ~17k. I am doing some initial investigation but this may be a JSC issue having to do with tiering up or not.

Chris Dumez

Comment 6 2022-01-12 09:42:54 PST

The process priorities look similar when the benchmark is slow and fast: - Slow: 491 9 (FGSupport) 13 15MB 46MB 1000MB --- Y Y - 35711 com.apple.WebKit.WebContent policy: POLICY_TIMESHARE timeshare max priority: 63 timeshare base priority: 46 timeshare cur priority: 46 timeshare depressed: NO, prio -1 requested policy: req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1 req workqueue/pthread overides: req legacy qos override: THREAD_QOS_UNSPECIFIED req workqueue qos override: THREAD_QOS_UNSPECIFIED req kernel overides: req kevent overrides: THREAD_QOS_UNSPECIFIED req workloop servicer override: THREAD_QOS_UNSPECIFIED req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0 req latency qos: LATENCY_QOS_TIER_UNSPECIFIED req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED req darwin BG: NO req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT) req other: effective policy: eff thread qos: THREAD_QOS_USER_INTERACTIVE eff thread qos relprio: -1 eff promotion qos: THREAD_QOS_USER_INTERACTIVE eff latency qos: LATENCY_QOS_TIER_0 eff thruput qos: THROUGHPUT_QOS_TIER_0 eff darwin BG: NO eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) eff other: - Fast: 493 9 (FGSupport) 13 16MB 48MB 1000MB --- Y Y - 35877 com.apple.WebKit.WebContent policy: POLICY_TIMESHARE timeshare max priority: 63 timeshare base priority: 46 timeshare cur priority: 46 timeshare depressed: NO, prio -1 requested policy: req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1 req workqueue/pthread overides: req legacy qos override: THREAD_QOS_UNSPECIFIED req workqueue qos override: THREAD_QOS_UNSPECIFIED req kernel overides: req kevent overrides: THREAD_QOS_UNSPECIFIED req workloop servicer override: THREAD_QOS_UNSPECIFIED req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0 req latency qos: LATENCY_QOS_TIER_UNSPECIFIED req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED req darwin BG: NO req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT) req other: effective policy: eff thread qos: THREAD_QOS_USER_INTERACTIVE eff thread qos relprio: -1 eff promotion qos: THREAD_QOS_USER_INTERACTIVE eff latency qos: LATENCY_QOS_TIER_0 eff thruput qos: THROUGHPUT_QOS_TIER_0 eff darwin BG: NO eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) eff other: I don't think this is a process priority issue. I am curious what JSC engineers think about this. I guess I can try and compare profiles next.

Chris Dumez

Comment 7 2022-01-12 09:55:43 PST

Looking at the traces: 1. The WebContent process has about as many samples 2. The WebContent process has the same priorities 3. The WebContent process has about the same number of syscalls 4. The WebContent process has about the same amount of scheduled time I'll keep looking a bit but I am starting to wonder if something is wrong with the benchmark itself?

Chris Dumez

Comment 8 2022-01-12 09:59:41 PST

In the slow case, top samples are: Sample Count, Samples %, Normalized CPU %, Symbol 10167, 74.2%, 2.8%, vmEntryToJavaScript (in JavaScriptCore) 5205, 38.0%, 1.4%, ??? (in (null)) [0x53a6858040a4] 5107, 37.3%, 1.4%, operationGetFromScope (in JavaScriptCore) 2580, 18.8%, 0.7%, WebCore::JSDOMWindow::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in WebCore) 1467, 10.7%, 0.4%, JSC::JSGlobalObject::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 1136, 8.3%, 0.3%, JSC::JSObject::getOwnStaticPropertySlot(JSC::VM&, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 873, 6.4%, 0.2%, JSC::getStaticPropertySlotFromTable(JSC::VM&, JSC::ClassInfo const*, JSC::HashTable const&, JSC::JSObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore) 510, 3.7%, 0.1%, WTF::equal(WTF::StringImpl const*, unsigned char const*) (in JavaScriptCore) In the fast case, top samples are: Sample Count, Samples %, Normalized CPU %, Symbol 10164, 72.1%, 2.8%, vmEntryToJavaScript (in JavaScriptCore) 3501, 24.9%, 1.0%, ??? (in (null)) [0x20370c40ae38] 3363, 23.9%, 0.9%, ??? (in (null)) [0x20370c4011d8] 3317, 23.5%, 0.9%, WebCore::jsPerformancePrototypeFunction_now(JSC::JSGlobalObject*, JSC::CallFrame*) (in WebCore) 2072, 14.7%, 0.6%, WTF::MonotonicTime::now() (in JavaScriptCore) 1195, 8.5%, 0.3%, mach_absolute_time (in libsystem_kernel.dylib) They do look very different and I suspect this is about JIT'ing vs not JIT'ing. Somebody from JSC should probably take a look (I attached the profiles to the radar).

Simon Taylor

Comment 9 2022-01-12 10:09:58 PST

(In reply to Chris Dumez from comment #7) > I am starting to wonder if something is wrong with the benchmark itself? Certainly possible. In the WebGL one I added console.log() calls (which show up as syscalls if WebInspector is open) and verified that the section where the calls are counted was 1ms long in both fast and slow cases. The idea of accumulating to a global var was to stop the optimizer removing the calls entirely. Different JIT levels being used does feel most likely. I'm not able to reproduce very readily on M1 Pro MBP with Safari Tech Preview. There's a vague hint of a 28k / 32k difference (perhaps some non-determinism but at the same JIT tier?), but I haven't seen the ~11k case there.

Mark Lam

Comment 10 2022-01-19 15:30:58 PST

FWIW, this issue no longer manifests after r283102.

Simon Taylor

Comment 11 2022-02-03 09:23:08 PST

Created attachment 450781 [details] Reproducing different operating points in STP 138 (In reply to Mark Lam from comment #10) > FWIW, this issue no longer manifests after r283102. I haven't tried to fully understand that commit, but does it replace the JS interpreter tier with a baseline JIT? So the issue of using less-optimised code on some loads of code may still be present, but with a less obvious gap between the operating points since that commit? I've updated the test page to try and tease this out. Now rather than showing the mean of the counts, I take the max (and median) of each set of 60 - these tend to be really quite consistent. I've also added a history of the "max" results to get more of a feel for the consistency of this over time. https://tango-bravo.net/webkit-bug-235114/index.html There do still seem to be distinct operating points in Safari Tech Preview 138 (the release notes mention unlinked baseline JIT being added in 134). I've attached a video showing what seem to be 3 different performance points from STP 138 on my M1 Pro MBP. 29000, 29600, 31950 - the gaps between these points are all significantly larger than the differences in samples in each of them, so I'm pretty confident there's still different code running in these different cases.

Simon Taylor

Comment 12 2022-02-03 09:30:10 PST

I've also added a "manual" benchmark runner page that just takes one set of 200 samples and shows them all each time you click a button: https://tango-bravo.net/webkit-bug-235114/manual.html You can observe the CPU frequency ramping up as the early samples have lower counts than the later ones, and the later ones are still quite consistent. For whatever reason this bug isn't as obvious on that page though, even on iOS 15.2 devices, though a do still see a couple of clear operating points, it's not the 3x difference that is visible in the rAF-based test. So the rAF test page is probably the more useful test case but thought I might as well share this one too.

Simon Taylor

Comment 13 2022-02-16 03:58:46 PST

This is readily reproducible on iOS 15.3.1 and not on 15.4 beta 3 (iPhone 12 Pro). Numbers are slightly more variable on iOS so it's not clear if there are still the slightly different operating points mentioned in Comment 11 in-play.

Note You need to log in before you can comment on or make changes to this bug.