Bug 235114 - REGRESSION (iOS 15): Massive variation in microbenchmark results when tab returns to foreground
Summary: REGRESSION (iOS 15): Massive variation in microbenchmark results when tab ret...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Safari 15
Hardware: All iOS 15
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2022-01-12 07:43 PST by Simon Taylor
Modified: 2022-02-16 03:58 PST (History)
8 users (show)

See Also:


Attachments
Demo page (3.17 KB, text/html)
2022-01-12 07:50 PST, Simon Taylor
no flags Details
Screen recording from M1 Pro MBP (6.28 MB, video/quicktime)
2022-01-12 08:15 PST, Simon Taylor
no flags Details
Reproducing different operating points in STP 138 (1.76 MB, video/quicktime)
2022-02-03 09:23 PST, Simon Taylor
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Taylor 2022-01-12 07:43:01 PST
This is a weird one...

I've got a simple microbenchmark that counts how many times Math.random() can be called in a single performance.now() tick (1ms on current iOS Safari). The idea is this provides a metric for the the current JS thread CPU performance (which varies depending on if it is running on an efficiency or performance core, and clock frequency). This is scheduled in a requestAnimationFrame loop.

The page also has an option to further busy-loop in the rAF callback so the whole thing takes 20ms - checking with System Trace in Instruments this seems to reliably trigger the content thread being moved up to a Performance core, and gives higher call counts in the microbenchmark. I believe this is resulting in consistently high clocks but Instruments does not expose this data so I can't be 100% sure.

Without the busy-loop option enabled the counts drop massively as the thread is moved to an efficiency core and clock speeds can be reduced. Interacting with the page (pinch zoom etc) does give a boost in the numbers, reflecting frequency ramping up to deal with the interaction and keep everything smooth. So far, all is as expected.

Here's the weird thing: it looks like there are two distinct "operating points" for the microbenchmark results, and which one you get varies with page reload, or with events that cause the tab to make a background / foreground transition (tab switch, app switch, lock screen, etc).

On my iPhone 12 Pro, iOS 15.2 the "fast" operating point has ~4k calls without busy looping and ~30k calls with busy looping. The "slow" operating point is ~1.2k without busy looping and ~12k with busy looping. These numbers seem to remain pretty consistent as long as the tab stays on screen, and only change on a background / foreground or reload event.

I also have an iPhone 12 Mini with iOS 14.8.1, and the numbers are very similar to the "fast" operating point mentioned above. So I believe this to be an iOS 15.x regression. 
I've also been able to reproduce on an iPod touch 7 on 15.2.

The best guess I have to explain the observed behaviour here is that some bug in tab re-awakening means it ends up using code from a lower JIT tier?

This came to light in my testing for Bug 235002 - note in either operating point for the microbenchmark, the webgl timings look consistent, so I do believe the clocks are consistent. One difference is on refresh the full webGL demo page seems to be in the "slow" mode for the benchmark but the simpler example seems to start out more often in "fast" mode. Some issue with tiering up in general that depends on the code size could be at play?
Comment 1 Simon Taylor 2022-01-12 07:50:39 PST
Created attachment 448944 [details]
Demo page

I've attached the demo page.

It's also hosted at:
https://tango-bravo.net/webkit-bug-235114/index.html
Comment 2 Simon Taylor 2022-01-12 07:55:07 PST
I've also successfully reproduced this on both Intel and M1 Pro MacBooks with Safari 15.2 - both have two clear operating points.

Quitting Safari and "Reopen Last Closed Window" seems to do the trick for rolling the dice and seeing which one you end up with.
Comment 3 Simon Taylor 2022-01-12 08:15:58 PST
Created attachment 448946 [details]
Screen recording from M1 Pro MBP

Screen recording showing this in action on an M1 MBP.

Shows me doing Quit / Restart loop a few times (Safari set to re-open previous windows) until the page seems to be in the slower operating point. Then Refresh eventually seems to get one at the faster point.

On Mac at least it seems refresh doesn't usually trigger the fast->slow transition, the Quit / Restart app is the more reliable way to observe that behaviour.
Comment 4 Radar WebKit Bug Importer 2022-01-12 09:16:12 PST
<rdar://problem/87462931>
Comment 5 Chris Dumez 2022-01-12 09:31:44 PST
I can reproduce on macOS / Intel.

I just keep relaunching Safari and open https://tango-bravo.net/webkit-bug-235114/index.html. Sometimes, the math.random calls number is ~7k, sometimes the number is ~17k.

I am doing some initial investigation but this may be a JSC issue having to do with tiering up or not.
Comment 6 Chris Dumez 2022-01-12 09:42:54 PST
The process priorities look similar when the benchmark is slow and fast:

- Slow:
  491   9 (FGSupport)   13       15MB            46MB  1000MB                  ---    Y     Y     -    35711  com.apple.WebKit.WebContent

	policy: POLICY_TIMESHARE
		timeshare max  priority: 63
		timeshare base priority: 46
		timeshare cur  priority: 46
		timeshare depressed: NO, prio -1
	requested policy:
		req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1
		req workqueue/pthread overides:
			req legacy qos override: THREAD_QOS_UNSPECIFIED
			req workqueue qos override: THREAD_QOS_UNSPECIFIED
		req kernel overides:
			req kevent overrides: THREAD_QOS_UNSPECIFIED
			req workloop servicer override: THREAD_QOS_UNSPECIFIED
		req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0
		req latency qos: LATENCY_QOS_TIER_UNSPECIFIED
		req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED
		req darwin BG: NO  
		req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
		req other: 
	effective policy:
		eff thread qos: THREAD_QOS_USER_INTERACTIVE
		eff thread qos relprio: -1
		eff promotion qos: THREAD_QOS_USER_INTERACTIVE
		eff latency qos: LATENCY_QOS_TIER_0
		eff thruput qos: THROUGHPUT_QOS_TIER_0
		eff darwin BG: NO
		eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT)
		eff other: 

- Fast:
  493   9 (FGSupport)   13       16MB            48MB  1000MB                  ---    Y     Y     -    35877  com.apple.WebKit.WebContent

policy: POLICY_TIMESHARE
		timeshare max  priority: 63
		timeshare base priority: 46
		timeshare cur  priority: 46
		timeshare depressed: NO, prio -1
	requested policy:
		req thread qos: THREAD_QOS_USER_INTERACTIVE, relprio: -1
		req workqueue/pthread overides:
			req legacy qos override: THREAD_QOS_UNSPECIFIED
			req workqueue qos override: THREAD_QOS_UNSPECIFIED
		req kernel overides:
			req kevent overrides: THREAD_QOS_UNSPECIFIED
			req workloop servicer override: THREAD_QOS_UNSPECIFIED
		req turnstiles sync promotion qos: THREAD_QOS_UNSPECIFIED, user promotion base pri: 0
		req latency qos: LATENCY_QOS_TIER_UNSPECIFIED
		req thruput qos: THROUGHPUT_QOS_TIER_UNSPECIFIED
		req darwin BG: NO  
		req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
		req other: 
	effective policy:
		eff thread qos: THREAD_QOS_USER_INTERACTIVE
		eff thread qos relprio: -1
		eff promotion qos: THREAD_QOS_USER_INTERACTIVE
		eff latency qos: LATENCY_QOS_TIER_0
		eff thruput qos: THROUGHPUT_QOS_TIER_0
		eff darwin BG: NO
		eff iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT)
		eff other: 

I don't think this is a process priority issue. I am curious what JSC engineers think about this. I guess I can try and compare profiles next.
Comment 7 Chris Dumez 2022-01-12 09:55:43 PST
Looking at the traces:
1. The WebContent process has about as many samples
2. The WebContent process has the same priorities
3. The WebContent process has about the same number of syscalls
4. The WebContent process has about the same amount of scheduled time

I'll keep looking a bit but I am starting to wonder if something is wrong with the benchmark itself?
Comment 8 Chris Dumez 2022-01-12 09:59:41 PST
In the slow case, top samples are:
Sample Count, Samples %, Normalized CPU %, Symbol
10167, 74.2%, 2.8%, vmEntryToJavaScript (in JavaScriptCore)
5205, 38.0%, 1.4%,     ??? (in (null)) [0x53a6858040a4]
5107, 37.3%, 1.4%,         operationGetFromScope (in JavaScriptCore)
2580, 18.8%, 0.7%,             WebCore::JSDOMWindow::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in WebCore)
1467, 10.7%, 0.4%,                 JSC::JSGlobalObject::getOwnPropertySlot(JSC::JSObject*, JSC::JSGlobalObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore)
1136, 8.3%, 0.3%,                     JSC::JSObject::getOwnStaticPropertySlot(JSC::VM&, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore)
873, 6.4%, 0.2%,                         JSC::getStaticPropertySlotFromTable(JSC::VM&, JSC::ClassInfo const*, JSC::HashTable const&, JSC::JSObject*, JSC::PropertyName, JSC::PropertySlot&) (in JavaScriptCore)
510, 3.7%, 0.1%,                             WTF::equal(WTF::StringImpl const*, unsigned char const*) (in JavaScriptCore)


In the fast case, top samples are:
Sample Count, Samples %, Normalized CPU %, Symbol
10164, 72.1%, 2.8%, vmEntryToJavaScript (in JavaScriptCore)
3501, 24.9%, 1.0%,     ??? (in (null)) [0x20370c40ae38]
3363, 23.9%, 0.9%,         ??? (in (null)) [0x20370c4011d8]
3317, 23.5%, 0.9%,             WebCore::jsPerformancePrototypeFunction_now(JSC::JSGlobalObject*, JSC::CallFrame*) (in WebCore)
2072, 14.7%, 0.6%,                 WTF::MonotonicTime::now() (in JavaScriptCore)
1195, 8.5%, 0.3%,                     mach_absolute_time (in libsystem_kernel.dylib)


They do look very different and I suspect this is about JIT'ing vs not JIT'ing. Somebody from JSC should probably take a look (I attached the profiles to the radar).
Comment 9 Simon Taylor 2022-01-12 10:09:58 PST
(In reply to Chris Dumez from comment #7)
> I am starting to wonder if something is wrong with the benchmark itself?

Certainly possible.

In the WebGL one I added console.log() calls (which show up as syscalls if WebInspector is open) and verified that the section where the calls are counted was 1ms long in both fast and slow cases. The idea of accumulating to a global var was to stop the optimizer removing the calls entirely. Different JIT levels being used does feel most likely.

I'm not able to reproduce very readily on M1 Pro MBP with Safari Tech Preview. There's a vague hint of a 28k / 32k difference (perhaps some non-determinism but at the same JIT tier?), but I haven't seen the ~11k case there.
Comment 10 Mark Lam 2022-01-19 15:30:58 PST
FWIW, this issue no longer manifests after r283102.
Comment 11 Simon Taylor 2022-02-03 09:23:08 PST
Created attachment 450781 [details]
Reproducing different operating points in STP 138

(In reply to Mark Lam from comment #10)
> FWIW, this issue no longer manifests after r283102.

I haven't tried to fully understand that commit, but does it replace the JS interpreter tier with a baseline JIT? So the issue of using less-optimised code on some loads of code may still be present, but with a less obvious gap between the operating points since that commit?

I've updated the test page to try and tease this out. Now rather than showing the mean of the counts, I take the max (and median) of each set of 60 - these tend to be really quite consistent. I've also added a history of the "max" results to get more of a feel for the consistency of this over time.
https://tango-bravo.net/webkit-bug-235114/index.html

There do still seem to be distinct operating points in Safari Tech Preview 138 (the release notes mention unlinked baseline JIT being added in 134).

I've attached a video showing what seem to be 3 different performance points from STP 138 on my M1 Pro MBP. 29000, 29600, 31950 - the gaps between these points are all significantly larger than the differences in samples in each of them, so I'm pretty confident there's still different code running in these different cases.
Comment 12 Simon Taylor 2022-02-03 09:30:10 PST
I've also added a "manual" benchmark runner page that just takes one set of 200 samples and shows them all each time you click a button:
https://tango-bravo.net/webkit-bug-235114/manual.html

You can observe the CPU frequency ramping up as the early samples have lower counts than the later ones, and the later ones are still quite consistent. For whatever reason this bug isn't as obvious on that page though, even on iOS 15.2 devices, though a do still see a couple of clear operating points, it's not the 3x difference that is visible in the rAF-based test. So the rAF test page is probably the more useful test case but thought I might as well share this one too.
Comment 13 Simon Taylor 2022-02-16 03:58:46 PST
This is readily reproducible on iOS 15.3.1 and not on 15.4 beta 3 (iPhone 12 Pro).

Numbers are slightly more variable on iOS so it's not clear if there are still the slightly different operating points mentioned in Comment 11 in-play.