158906 – REGRESSION(concurrent baseline JIT): Kraken/ai-astar runs 20% slower

RESOLVED FIXED 158906

REGRESSION(concurrent baseline JIT): Kraken/ai-astar runs 20% slower

https://bugs.webkit.org/show_bug.cgi?id=158906

Summary REGRESSION(concurrent baseline JIT): Kraken/ai-astar runs 20% slower

Filip Pizlo

Reported 2016-06-18 12:28:50 PDT

So far this looks like it's because the baseline compile of the search# is delayed by >10ms. I'm still investigating why I do know that it's only the decision to compile baseline code concurrently that causes the slow-down, and not any of the other changes in the patch.

Attachments
the patch (7.25 KB, patch) 2016-06-18 13:50 PDT, Filip Pizlo	benjamin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Filip Pizlo

Comment 1 2016-06-18 12:32:03 PDT

Looks like we enqueue the search# plan and then it takes ~10ms for the compiler thread to wake up!

Filip Pizlo

Comment 2 2016-06-18 12:43:18 PDT

This is fascinating! Here's what's happening: we enqueued a compile of <global>, which is a ginormous piece of global code, during the benchmark's unmeansured initialization phase. Then at the time that we try to compile search#, the JIT thread is busy compiling <global>. That takes a long time to compile and it generates a disgusting amount of code. <global> takes 16ms to compile and generates 8MB!! There are a bunch of solutions we could attempt: 1) If we want to compile something and the concurrent JIT thread is busy then compile on the main thread instead. 2) Launch multiple JIT threads. 3) Delay the compilation of <global> based on size heuristics the same way that we do for DFG and FTL tier-up. This might be hard because I think that <global> actually spends a significant amount of time in its silly loop. 4) Don't compile all of <global> with the baseline JIT. It tiers up because of a tiny loop at the end. I think I'll attempt (1). That seems like the easiest way of recovering this regression. I think it would be great to consider the other solutions, too.

Filip Pizlo

Comment 3 2016-06-18 13:15:53 PDT

Attempt (1) seems to work for Kraken/ai-astar! Trunk (including concurrent baseline JIT): 111.2 ms With my fix: 92.6 ms With concurrent baseline JIT totally disabled: 87.9 ms I'll do more tests.

Filip Pizlo

Comment 4 2016-06-18 13:22:19 PDT

I believe that this resolves the Kraken issue. Here are the numbers now. FROM = concurrent baseline JIT completely disabled. TO = my change, which uses synchronous compilation of the thread is busy. TEST COMPARISON FROM TO DETAILS ==================================================================================== ** TOTAL **: 1.013x as fast 826.1ms +/- 0.6% 815.5ms +/- 1.1% significant ==================================================================================== ai: *1.043x as slow* 85.3ms +/- 2.0% 89.0ms +/- 1.6% significant astar: *1.043x as slow* 85.3ms +/- 2.0% 89.0ms +/- 1.6% significant audio: 1.037x as fast 259.1ms +/- 1.1% 249.8ms +/- 2.9% significant beat-detection: ?? 58.1ms +/- 3.7% 59.1ms +/- 11.5% might be *1.017x as slow* dft: 1.100x as fast 103.1ms +/- 2.1% 93.7ms +/- 2.1% significant fft: - 43.9ms +/- 1.8% 42.9ms +/- 2.4% oscillator: ?? 54.0ms +/- 1.2% 54.1ms +/- 2.1% might be *1.002x as slow* imaging: ?? 185.1ms +/- 1.0% 185.6ms +/- 2.5% might be *1.003x as slow* gaussian-blur: - 64.2ms +/- 2.2% 62.2ms +/- 3.4% darkroom: ?? 72.9ms +/- 1.0% 73.5ms +/- 0.8% might be *1.008x as slow* desaturate: ?? 48.0ms +/- 5.1% 49.9ms +/- 6.5% might be *1.040x as slow* json: - 58.7ms +/- 1.3% 57.8ms +/- 1.3% parse-financial: - 35.2ms +/- 1.6% 34.8ms +/- 1.6% stringify-tinderbox: 1.022x as fast 23.5ms +/- 1.6% 23.0ms +/- 1.5% significant stanford: 1.020x as fast 237.9ms +/- 0.8% 233.3ms +/- 1.2% significant crypto-aes: 1.045x as fast 51.5ms +/- 1.9% 49.3ms +/- 1.2% significant crypto-ccm: - 48.4ms +/- 1.9% 48.3ms +/- 4.7% crypto-pbkdf2: 1.014x as fast 99.6ms +/- 1.3% 98.2ms +/- 0.6% significant crypto-sha256-iterative: 1.024x as fast 38.4ms +/- 1.3% 37.5ms +/- 1.3% significant

Filip Pizlo

Comment 5 2016-06-18 13:50:33 PDT

Created attachment 281618 [details] the patch

Benjamin Poulain

Comment 6 2016-06-18 23:51:22 PDT

Comment on attachment 281618 [details] the patch A bit nasty but let's try anyway. Especially since ARM hurts a lot and we don't have more cores.

Filip Pizlo

Comment 7 2016-06-19 09:40:25 PDT

(In reply to comment #6) > Comment on attachment 281618 [details] > the patch > > A bit nasty but let's try anyway. Especially since ARM hurts a lot and we > don't have more cores. We had a similar thing arise with DFG vs FTL compilation, and we found that even on two cores, it's profitable to have DFG and FTL be in separate threads so that they interleave with each other in case the FTL takes a long time. I think that a similar thing applies here. Having additional baseline threads would likely help. It would be a generalization of what this patch does. But maybe what this patch does is even better, since most baseline compiles are fast enough that it's OK to just do them on the main thread when the helper is busy.

Filip Pizlo

Comment 8 2016-06-19 09:45:04 PDT

Landed in r202205. I think.

Ryosuke Niwa

Comment 9 2016-06-20 18:59:34 PDT

I think we're ~3% regressed on iPhone 6 Plus. Will email you details.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware All

OS All

Product WebKit

Component JavaScriptCore

Assignee

Filip Pizlo

Reported

2016-06-18 12:28 PDT

Modified

2016-06-20 18:59 PDT History

CC List

11 users Show

URL

Keywords

Depends on

Blocks