74170 – DFG's interpretation of rare case profiles should be frequency-based not count-based

RESOLVED FIXED 74170

DFG's interpretation of rare case profiles should be frequency-based not count-based

https://bugs.webkit.org/show_bug.cgi?id=74170

Summary DFG's interpretation of rare case profiles should be frequency-based not coun...

Filip Pizlo

Reported 2011-12-08 23:44:28 PST

Consider the following code: function foo(...) { a = b / c; // b and c are integers, but b / c returns a fractional number for (var i = 0; i < 100000; ++i) { ... } } This will get optimized while in the loop, so the rare case counter for b / c will be 1 - not enough to force b / c to be a double division. foo() will then fail speculation the next time it is called, but even then, the rare case counter will just be at 2 - still not enough. The threshold is currently around 100 before b / c would be viewed by the DFG as a double division - but in this case, that won't happen for a very long time, so the loop will never benefit from DFG optimization due to this one overzealous speculation. The solution is to decide b / c is double based on the frequency of slow path execution, rather than the absolute count of how many times it was taken. This requires knowing the total number of times that an instruction executed, which in turn requires a bit more profiling in the old JIT.

Attachments
the patch (14.56 KB, patch) 2011-12-09 13:31 PST, Filip Pizlo	no flags	Details Formatted Diff Diff
the patch (14.60 KB, patch) 2011-12-09 13:48 PST, Filip Pizlo	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Filip Pizlo

Comment 1 2011-12-08 23:45:21 PST

The easiest way to get a rate rather than an absolute count is to count the number of times each basic block executes. This requires counters only at jump targets, and immediately following branches. I implemented this, and it's really cheap - I can't see any performance degradation. Benchmark report for SunSpider, V8, and Kraken on oldmac.local (MacPro4,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc (r102385) "CountBB" at /Volumes/Data/pizlo/tertiary/OpenSource/WebKitBuild/Release/jsc (r102385) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree CountBB SunSpider: 3d-cube 8.8183+-0.0418 ? 8.8363+-0.0535 ? 3d-morph 10.1214+-0.0387 ? 10.1393+-0.0775 ? 3d-raytrace 9.3476+-0.1096 9.2083+-0.0700 might be 1.0151x faster access-binary-trees 1.9328+-0.0373 ? 1.9617+-0.0373 ? might be 1.0150x slower access-fannkuch 8.9966+-0.0166 ! 9.1016+-0.0566 ! definitely 1.0117x slower access-nbody 4.7551+-0.0418 4.7248+-0.0083 access-nsieve 3.7196+-0.0121 ? 3.7310+-0.0119 ? bitops-3bit-bits-in-byte 1.5102+-0.0225 ? 1.5232+-0.0294 ? bitops-bits-in-byte 5.9291+-0.0192 ! 5.9748+-0.0233 ! definitely 1.0077x slower bitops-bitwise-and 3.9838+-0.0191 ? 4.0024+-0.0467 ? bitops-nsieve-bits 6.8696+-0.0404 6.8539+-0.0441 controlflow-recursive 2.7657+-0.0195 ? 2.7738+-0.0122 ? crypto-aes 8.7318+-0.0918 8.7155+-0.0921 crypto-md5 2.9179+-0.0189 ? 2.9627+-0.0286 ? might be 1.0154x slower crypto-sha1 2.5999+-0.0151 ? 2.6182+-0.0207 ? date-format-tofte 13.3252+-0.1260 13.1250+-0.1467 might be 1.0153x faster date-format-xparb 11.9514+-0.0594 ? 12.0433+-0.0878 ? math-cordic 8.6429+-0.0445 ? 8.6963+-0.0691 ? math-partial-sums 12.6434+-0.0336 12.6069+-0.0228 math-spectral-norm 3.1378+-0.0201 ? 3.1393+-0.0184 ? regexp-dna 15.8619+-0.1243 15.7403+-0.0863 string-base64 4.7747+-0.0419 ? 4.8000+-0.0489 ? string-fasta 8.8949+-0.0507 8.8560+-0.0287 string-tagcloud 15.0438+-0.0765 ? 15.0568+-0.0822 ? string-unpack-code 25.9398+-0.0298 ! 26.5210+-0.2463 ! definitely 1.0224x slower string-validate-input 6.7550+-0.0779 ? 6.7989+-0.0832 ? <arithmetic> * 8.0758+-0.0236 ? 8.0966+-0.0343 ? might be 1.0026x slower <geometric> 6.4493+-0.0230 ? 6.4674+-0.0271 ? might be 1.0028x slower <harmonic> 5.0302+-0.0242 ? 5.0551+-0.0246 ? might be 1.0050x slower TipOfTree CountBB V8: crypto 92.3103+-0.3728 ? 92.5307+-0.4690 ? deltablue 204.9305+-1.6018 204.4549+-1.6621 earley-boyer 120.8883+-1.4748 ? 121.7168+-1.2146 ? raytrace 69.9284+-0.2411 69.8041+-0.2873 regexp 149.0791+-0.7718 148.3416+-0.5401 richards 171.1122+-1.6146 ^ 167.8128+-1.0429 ^ definitely 1.0197x faster splay 107.8526+-1.1289 ? 108.0241+-1.2968 ? <arithmetic> 130.8716+-0.5034 130.3836+-0.3249 might be 1.0037x faster <geometric> * 123.5641+-0.4652 123.2515+-0.2999 might be 1.0025x faster <harmonic> 116.4428+-0.4195 116.2714+-0.2818 might be 1.0015x faster TipOfTree CountBB Kraken: ai-astar 900.4571+-1.3743 ? 900.4581+-0.8574 ? audio-beat-detection 247.8751+-0.5911 ? 248.3219+-0.8955 ? audio-dft 320.1299+-3.1069 319.5416+-3.4951 audio-fft 160.9397+-0.4120 160.5048+-0.2567 audio-oscillator 343.0422+-7.1926 ? 352.6219+-6.6597 ? might be 1.0279x slower imaging-darkroom 403.1950+-5.5486 ? 404.6109+-5.9675 ? imaging-desaturate 288.5352+-0.3384 ? 288.7382+-0.3635 ? imaging-gaussian-blur 740.5361+-2.3800 738.8561+-0.5093 json-parse-financial 86.2767+-0.2412 ! 87.2189+-0.2078 ! definitely 1.0109x slower json-stringify-tinderbox 99.1519+-0.2664 99.1123+-0.2535 stanford-crypto-aes 142.6808+-0.6472 ? 145.4456+-3.9557 ? might be 1.0194x slower stanford-crypto-ccm 138.1201+-1.0647 137.1065+-1.0269 stanford-crypto-pbkdf2 281.7826+-2.0743 ? 283.1335+-2.1555 ? stanford-crypto-sha256-iterative 114.8826+-0.2872 ? 115.0961+-0.5195 ? <arithmetic> * 304.8289+-0.5537 ? 305.7690+-0.7952 ? might be 1.0031x slower <geometric> 237.8354+-0.5344 ? 238.7712+-0.7594 ? might be 1.0039x slower <harmonic> 192.1020+-0.4203 ? 192.8865+-0.6767 ? might be 1.0041x slower TipOfTree CountBB All benchmarks: <arithmetic> 114.7591+-0.2254 ? 114.9779+-0.2364 ? might be 1.0019x slower <geometric> 29.3225+-0.0838 ? 29.3910+-0.0855 ? might be 1.0023x slower <harmonic> 8.8650+-0.0420 ? 8.9081+-0.0426 ? might be 1.0049x slower TipOfTree CountBB Geomean of preferred means: <scaled-result> 67.2527+-0.1669 ? 67.3225+-0.1479 ? might be 1.0010x slower

Filip Pizlo

Comment 2 2011-12-09 02:11:11 PST

Well this is interesting. What I was trying to say is that we want a test like: rateOfFailure / rateOfExecution >= threshold to determine if we should assume the worst for this instruction. But that's not quite right. Consider the following code: for (var i = 0; i < 10000; ++i) doSomethingThatTakesSlowPathOneInEveryLoopIterations() In this case, rateOfFailure = rateOfExecution/10. So it seems like we can assume that it rarely fails and then compile the code as such. But this would be wrong! Our goal is not to maximize the probability of a single speculation check succeeding; it is instead to maximize the probability that the function completes without any speculation failures. Thus if we have a 1/10 chance that any iteration of the loop fails, and the loop executes 10,000 times, then the probability that we will fail at some point during the execution of this loop is essentially 1. Formally, it is something like: 1 - (1 - rateOfFailure / rateOfExecution) ^ 10000 or more broadly: 1 - (1 - rateOfFailure / rateOfExecution) ^ (rateOfExecution / numbeOfFunctionInvocations) We have two choices. We can either perform the formal computation, or we can be clever. Consider that we do not count the total number of executions of an operation (rateOfExecution) but we do count the number of times the function executes and test: rateOfFailure / numberOfFunctionExecutions >= threshold then oddly, we get something much closer to the more general formulation, in terms of its behavior when you plot it for varying inputs. This is actually great news, because it means that we don't even have to track the execution rate of basic blocks - all we need is the total number of times that a code block executed.

Filip Pizlo

Comment 3 2011-12-09 02:21:48 PST

> for (var i = 0; i < 10000; ++i) > doSomethingThatTakesSlowPathOneInEveryLoopIterations() This was meant to be doSomethingThatTakesSlowPathOneInEveryTenLoopIterations(). As in s/OneInEveryLoop/OneInEveryTenLoop/.

Filip Pizlo

Comment 4 2011-12-09 13:24:56 PST

It turns out that this is necessary, but insufficient to precisely catch operations that often take slow path. The early phase of the program's execution is often not representative of how the program will behave in steady-state. Hence if we try to make an operation safe by switching it to use a double addition instead of an integer addition, for example, just because 1/2 of the initial executions took slow path then we may run the risk that in steady state that number will actually be <1/10. So we need a rule like: Make the operation safe only when: (It has taken slow path some minimum number of times (probably close to 100), and It has taken slow path more than some fraction of total executions (probably 1/10)) or The speculation failures that have led to recompilation being triggered were caused by this operation not being safe. In this bug, I'll only solve the first two parts, and fix the last part in another bug.

Filip Pizlo

Comment 5 2011-12-09 13:31:52 PST

Created attachment 118623 [details] the patch

Filip Pizlo

Comment 6 2011-12-09 13:48:32 PST

Created attachment 118627 [details] the patch Fixed some #if's.

Geoffrey Garen

Comment 7 2011-12-09 13:49:51 PST

Comment on attachment 118627 [details] the patch View in context: https://bugs.webkit.org/attachment.cgi?id=118627&action=review r=me > Source/JavaScriptCore/runtime/Heuristics.cpp:148 > + SET(likelyToTakeSlowCaseMinimumCount, 100); It seems like this 100 limit is still problematic in the first example you gave.

Filip Pizlo

Comment 8 2011-12-09 13:54:29 PST

(In reply to comment #7) > (From update of attachment 118627 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118627&action=review > > r=me > > > Source/JavaScriptCore/runtime/Heuristics.cpp:148 > > + SET(likelyToTakeSlowCaseMinimumCount, 100); > > It seems like this 100 limit is still problematic in the first example you gave. Yup. But making it lower will cause us to become overly pessimistic for code that takes many slow cases in the early part of its execution. That's why we also need a story for realizing that the reason why recompilation was triggered, was because of some slow case in the DFG. I'll do that in a follow-on patch, since this one is big enough already.

Filip Pizlo

Comment 9 2011-12-09 16:10:15 PST

Landed in http://trac.webkit.org/changeset/102489

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component JavaScriptCore

Assignee

Nobody

Reported

2011-12-08 23:44 PST

Modified

2011-12-09 16:10 PST History

CC List

0 users

URL

Keywords

Depends on

Blocks