RESOLVED FIXED Bug 158906
REGRESSION(concurrent baseline JIT): Kraken/ai-astar runs 20% slower
https://bugs.webkit.org/show_bug.cgi?id=158906
Summary REGRESSION(concurrent baseline JIT): Kraken/ai-astar runs 20% slower
Filip Pizlo
Reported 2016-06-18 12:28:50 PDT
So far this looks like it's because the baseline compile of the search# is delayed by >10ms. I'm still investigating why I do know that it's only the decision to compile baseline code concurrently that causes the slow-down, and not any of the other changes in the patch.
Attachments
the patch (7.25 KB, patch)
2016-06-18 13:50 PDT, Filip Pizlo
benjamin: review+
Filip Pizlo
Comment 1 2016-06-18 12:32:03 PDT
Looks like we enqueue the search# plan and then it takes ~10ms for the compiler thread to wake up!
Filip Pizlo
Comment 2 2016-06-18 12:43:18 PDT
This is fascinating! Here's what's happening: we enqueued a compile of <global>, which is a ginormous piece of global code, during the benchmark's unmeansured initialization phase. Then at the time that we try to compile search#, the JIT thread is busy compiling <global>. That takes a long time to compile and it generates a disgusting amount of code. <global> takes 16ms to compile and generates 8MB!! There are a bunch of solutions we could attempt: 1) If we want to compile something and the concurrent JIT thread is busy then compile on the main thread instead. 2) Launch multiple JIT threads. 3) Delay the compilation of <global> based on size heuristics the same way that we do for DFG and FTL tier-up. This might be hard because I think that <global> actually spends a significant amount of time in its silly loop. 4) Don't compile all of <global> with the baseline JIT. It tiers up because of a tiny loop at the end. I think I'll attempt (1). That seems like the easiest way of recovering this regression. I think it would be great to consider the other solutions, too.
Filip Pizlo
Comment 3 2016-06-18 13:15:53 PDT
Attempt (1) seems to work for Kraken/ai-astar! Trunk (including concurrent baseline JIT): 111.2 ms With my fix: 92.6 ms With concurrent baseline JIT totally disabled: 87.9 ms I'll do more tests.
Filip Pizlo
Comment 4 2016-06-18 13:22:19 PDT
I believe that this resolves the Kraken issue. Here are the numbers now. FROM = concurrent baseline JIT completely disabled. TO = my change, which uses synchronous compilation of the thread is busy. TEST COMPARISON FROM TO DETAILS ==================================================================================== ** TOTAL **: 1.013x as fast 826.1ms +/- 0.6% 815.5ms +/- 1.1% significant ==================================================================================== ai: *1.043x as slow* 85.3ms +/- 2.0% 89.0ms +/- 1.6% significant astar: *1.043x as slow* 85.3ms +/- 2.0% 89.0ms +/- 1.6% significant audio: 1.037x as fast 259.1ms +/- 1.1% 249.8ms +/- 2.9% significant beat-detection: ?? 58.1ms +/- 3.7% 59.1ms +/- 11.5% might be *1.017x as slow* dft: 1.100x as fast 103.1ms +/- 2.1% 93.7ms +/- 2.1% significant fft: - 43.9ms +/- 1.8% 42.9ms +/- 2.4% oscillator: ?? 54.0ms +/- 1.2% 54.1ms +/- 2.1% might be *1.002x as slow* imaging: ?? 185.1ms +/- 1.0% 185.6ms +/- 2.5% might be *1.003x as slow* gaussian-blur: - 64.2ms +/- 2.2% 62.2ms +/- 3.4% darkroom: ?? 72.9ms +/- 1.0% 73.5ms +/- 0.8% might be *1.008x as slow* desaturate: ?? 48.0ms +/- 5.1% 49.9ms +/- 6.5% might be *1.040x as slow* json: - 58.7ms +/- 1.3% 57.8ms +/- 1.3% parse-financial: - 35.2ms +/- 1.6% 34.8ms +/- 1.6% stringify-tinderbox: 1.022x as fast 23.5ms +/- 1.6% 23.0ms +/- 1.5% significant stanford: 1.020x as fast 237.9ms +/- 0.8% 233.3ms +/- 1.2% significant crypto-aes: 1.045x as fast 51.5ms +/- 1.9% 49.3ms +/- 1.2% significant crypto-ccm: - 48.4ms +/- 1.9% 48.3ms +/- 4.7% crypto-pbkdf2: 1.014x as fast 99.6ms +/- 1.3% 98.2ms +/- 0.6% significant crypto-sha256-iterative: 1.024x as fast 38.4ms +/- 1.3% 37.5ms +/- 1.3% significant
Filip Pizlo
Comment 5 2016-06-18 13:50:33 PDT
Created attachment 281618 [details] the patch
Benjamin Poulain
Comment 6 2016-06-18 23:51:22 PDT
Comment on attachment 281618 [details] the patch A bit nasty but let's try anyway. Especially since ARM hurts a lot and we don't have more cores.
Filip Pizlo
Comment 7 2016-06-19 09:40:25 PDT
(In reply to comment #6) > Comment on attachment 281618 [details] > the patch > > A bit nasty but let's try anyway. Especially since ARM hurts a lot and we > don't have more cores. We had a similar thing arise with DFG vs FTL compilation, and we found that even on two cores, it's profitable to have DFG and FTL be in separate threads so that they interleave with each other in case the FTL takes a long time. I think that a similar thing applies here. Having additional baseline threads would likely help. It would be a generalization of what this patch does. But maybe what this patch does is even better, since most baseline compiles are fast enough that it's OK to just do them on the main thread when the helper is busy.
Filip Pizlo
Comment 8 2016-06-19 09:45:04 PDT
Landed in r202205. I think.
Ryosuke Niwa
Comment 9 2016-06-20 18:59:34 PDT
I think we're ~3% regressed on iPhone 6 Plus. Will email you details.
Note You need to log in before you can comment on or make changes to this bug.