RESOLVED FIXED Bug 112239
LLint should be able to use x87 instead of SSE for floating point
https://bugs.webkit.org/show_bug.cgi?id=112239
Summary LLint should be able to use x87 instead of SSE for floating point
Jan
Reported 2013-03-13 02:40:29 PDT
QtWebKit crashes with illegal instruction (I tested rekonq and arora). I'm using Arch Linux on a quite old system. Traces and /proc/cpuinfo attached.
Attachments
Trace from rekonq (10.57 KB, text/x-log)
2013-03-13 02:41 PDT, Jan
no flags
Trace from arora (3.91 KB, text/x-log)
2013-03-13 02:42 PDT, Jan
no flags
/proc/cpuinfo (523 bytes, application/octet-stream)
2013-03-13 02:42 PDT, Jan
no flags
WIP (15.15 KB, patch)
2013-04-04 01:51 PDT, Allan Sandfeld Jensen
no flags
Patch (14.93 KB, patch)
2013-04-04 02:24 PDT, Allan Sandfeld Jensen
no flags
LLIntAssembly.h (555.91 KB, application/octet-stream)
2013-04-05 01:34 PDT, Allan Sandfeld Jensen
no flags
Patch (3.42 KB, patch)
2013-04-05 09:04 PDT, Allan Sandfeld Jensen
no flags
Patch (21.18 KB, patch)
2013-04-08 09:17 PDT, Allan Sandfeld Jensen
no flags
Patch (24.60 KB, patch)
2013-04-10 03:25 PDT, Allan Sandfeld Jensen
no flags
my version (21.63 KB, patch)
2013-04-11 13:50 PDT, Filip Pizlo
no flags
Patch (21.36 KB, patch)
2013-04-12 05:26 PDT, Allan Sandfeld Jensen
no flags
Patch (21.68 KB, patch)
2013-04-19 05:41 PDT, Allan Sandfeld Jensen
no flags
Jan
Comment 1 2013-03-13 02:41:46 PDT
Created attachment 192888 [details] Trace from rekonq
Jan
Comment 2 2013-03-13 02:42:19 PDT
Created attachment 192889 [details] Trace from arora
Jan
Comment 3 2013-03-13 02:42:54 PDT
Created attachment 192890 [details] /proc/cpuinfo
Allan Sandfeld Jensen
Comment 4 2013-03-30 13:52:19 PDT
QtWebKit 2.3 or from Qt 5? It looks like you have a x86 CPU without SSE2 which are used by default for math in QtWebkit for x86. In QtWebKit it can be disabled by using --no-force-sse2
Jan
Comment 5 2013-04-01 11:19:27 PDT
This is QtWebkit 2.3 using qt4. But it also crashes with the one shipped with qt5. I (and the arch package too) used --no-force-sse2 to create the debug build.
Allan Sandfeld Jensen
Comment 6 2013-04-01 13:35:16 PDT
Confirmed. X86.rb that picks X86 instructions uses SSE2 instructions in various places. You appear to hit mulsd in this case, but there are a few other examples of SSE2 instructions. I will see if this can be fixed, but I am afraid there will be resistance to implementing it the right way since there is a fallback called cloop. Unfortunately cloop is not that fast and also has a lot of bugs since it is only used for otherwise unsupported architectures.
Filip Pizlo
Comment 7 2013-04-01 13:40:18 PDT
(In reply to comment #6) > Confirmed. X86.rb that picks X86 instructions uses SSE2 instructions in various places. You appear to hit mulsd in this case, but there are a few other examples of SSE2 instructions. > > I will see if this can be fixed, but I am afraid there will be resistance to implementing it the right way since there is a fallback called cloop. Unfortunately cloop is not that fast and also has a lot of bugs since it is only used for otherwise unsupported architectures. It would be great to fix cloop. For us, cloop appears to be pretty quick, in the tests we've done - so if it isn't on your platform, it might be easier to fix that problem than modifying x86.rb. On the other hand, I don't want to get in the way of getting things working for you guys: if you'd rather add x87 support to x86.rb then I'd be happy to help. I believe it should be straight-forward if you follow Intel's guidance on how to make x87 "look" like a flat register file.
Allan Sandfeld Jensen
Comment 8 2013-04-01 14:01:48 PDT
(In reply to comment #7) > (In reply to comment #6) > > Confirmed. X86.rb that picks X86 instructions uses SSE2 instructions in various places. You appear to hit mulsd in this case, but there are a few other examples of SSE2 instructions. > > > > I will see if this can be fixed, but I am afraid there will be resistance to implementing it the right way since there is a fallback called cloop. Unfortunately cloop is not that fast and also has a lot of bugs since it is only used for otherwise unsupported architectures. > > It would be great to fix cloop. > > For us, cloop appears to be pretty quick, in the tests we've done - so if it isn't on your platform, it might be easier to fix that problem than modifying x86.rb. > > On the other hand, I don't want to get in the way of getting things working for you guys: if you'd rather add x87 support to x86.rb then I'd be happy to help. I believe it should be straight-forward if you follow Intel's guidance on how to make x87 "look" like a flat register file. I looked further and it seems JIT detects it needs SSE2 for floating point operations and bails for block where it needs it and leaves the job to LLint. Would it be possible for LLint to detect the situation similar to JIT and fallback to CLoop for blocks that can not be interpreted with native assembler?
Filip Pizlo
Comment 9 2013-04-01 14:05:34 PDT
(In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #6) > > > Confirmed. X86.rb that picks X86 instructions uses SSE2 instructions in various places. You appear to hit mulsd in this case, but there are a few other examples of SSE2 instructions. > > > > > > I will see if this can be fixed, but I am afraid there will be resistance to implementing it the right way since there is a fallback called cloop. Unfortunately cloop is not that fast and also has a lot of bugs since it is only used for otherwise unsupported architectures. > > > > It would be great to fix cloop. > > > > For us, cloop appears to be pretty quick, in the tests we've done - so if it isn't on your platform, it might be easier to fix that problem than modifying x86.rb. > > > > On the other hand, I don't want to get in the way of getting things working for you guys: if you'd rather add x87 support to x86.rb then I'd be happy to help. I believe it should be straight-forward if you follow Intel's guidance on how to make x87 "look" like a flat register file. > > I looked further and it seems JIT detects it needs SSE2 for floating point operations and bails for block where it needs it and leaves the job to LLint. I don't think that's true. The JIT will leave the job to its C code slow paths if SSE2 is not around. That's actually quite nasty - calling to C code for every double arithmetic op is really bad. > > Would it be possible for LLint to detect the situation similar to JIT and fallback to CLoop for blocks that can not be interpreted with native assembler? It would be hard to detect it at run-time, and would require a lot of changes to have _both_ the LLInt asm code and the LLInt cloop code in the same executable. I'm not sure any of us want to deal with the maintenance hassles of approach! ;-) So here are our options: - Have an ability to build LLInt to use x87. I like this approach the best. - Have the LLInt do a run-time check on each arithmetic op, and bail to its C slow path, like the JIT does. This is probably less elegant, and slower, than the previous option. But it could work. Anyways, if you're looking for a quick fix I highly recommend you get the cloop working. The whole point of the clop is to be portable; if it isn't then we should fix it. If you want performance, then let's do it right. There's nothing fundamentally blocking x87 support in the LLInt.
Allan Sandfeld Jensen
Comment 10 2013-04-01 14:11:50 PDT
(In reply to comment #9) > If you want performance, then let's do it right. There's nothing fundamentally blocking x87 support in the LLInt. I don't really care about performance on these old machines. The problem is that on Linux many distributions have a policy of supporting architectures as far back as i686 (or even i486 in extreme cases). So if we use a solution that forces the switch on compile time (like cloop) it would force the distributions to compile all x86 with this switch and also slow down more modern x86 processors. Anything on runtime would be fine as long as it only hurts the slow machines. Though it would be good if it is at least as fast as the old interpreter which is what we should avoid regression compared to.
Filip Pizlo
Comment 11 2013-04-01 14:13:46 PDT
(In reply to comment #10) > (In reply to comment #9) > > If you want performance, then let's do it right. There's nothing fundamentally blocking x87 support in the LLInt. > > I don't really care about performance on these old machines. The problem is that on Linux many distributions have a policy of supporting architectures as far back as i686 (or even i486 in extreme cases). So if we use a solution that forces the switch on compile time (like cloop) it would force the distributions to compile all x86 with this switch and also slow down more modern x86 processors. Anything on runtime would be fine as long as it only hurts the slow machines. Though it would be good if it is at least as fast as the old interpreter which is what we should avoid regression compared to. Aha! Got it. I would recommend seeing if you can cache the hasSSE2() (or whatever it's called) result in JSGlobalData, and then, only when compiling in this configuration, have LLInt check that flag prior to doing SSE stuff and bail out if it's not available. It'll be one more check on the double paths of the interpreter. You should benchmark how this affects performance but I'm guessing it won't be much. Or just implement x87.
Allan Sandfeld Jensen
Comment 12 2013-04-04 01:51:42 PDT
Created attachment 196454 [details] WIP First patch. Works if enabled on x64, but still has a few failing tests on x86 32bit
Allan Sandfeld Jensen
Comment 13 2013-04-04 02:24:04 PDT
Created attachment 196457 [details] Patch Now also working on x86-32
Filip Pizlo
Comment 14 2013-04-04 13:50:43 PDT
Comment on attachment 196457 [details] Patch I like this! Based on looking at the code, R=me. Could you do me a favor though: could you (a) upload your LLIntAssembly.h file so I could sanity check it, mostly for my own curiosity; and (b) make sure that you run full LayoutTests with JIT runtime disabled (put useJIT() = false into Options.cpp's Options::initialize()) and make sure all is cool?
Filip Pizlo
Comment 15 2013-04-04 13:53:18 PDT
Comment on attachment 196457 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=196457&action=review > Source/JavaScriptCore/offlineasm/x86.rb:40 > + when "X86" > + true Interesting choice. It's what I would have done, also - but it may be surprising to some that x86-32 loses SSE2 in LLInt. Can you make sure you make it clear in the ChangeLog that this is one of the effects of this change? (An alternative would have been to have a Platform.h macro that selects whether to use x87, and then wire it into here somehow. I don't like that, since it's a lot of complexity for probably no measurable gain - I mean, SSE is faster than x87 but not by enough to have a noticeable effect on the interpreter.)
Allan Sandfeld Jensen
Comment 16 2013-04-05 01:34:18 PDT
Created attachment 196596 [details] LLIntAssembly.h
Allan Sandfeld Jensen
Comment 17 2013-04-05 02:03:33 PDT
Allan Sandfeld Jensen
Comment 18 2013-04-05 04:09:35 PDT
It seems this change caused 52 canvas tests to change subtle, but only on 32bit ( I have tested them with x87 enabled on x64). So I am reopening until I find out what happened. http://build.webkit.sed.hu/builders/x86-32 Linux Qt Release NRWT/builds/31428
Csaba Osztrogonác
Comment 19 2013-04-05 06:16:56 PDT
I think GTK is interested in this fix too, because they disabled LLINT bacause of this bug previously - http://trac.webkit.org/changeset/130076
Allan Sandfeld Jensen
Comment 20 2013-04-05 09:04:20 PDT
Allan Sandfeld Jensen
Comment 21 2013-04-05 09:06:22 PDT
(In reply to comment #20) > Created an attachment (id=196640) [details] > Patch This is very minor fixups for the patch. It doesn't fix the failing canvas tests though. I have also tried forcing rounding to 64bit precission after every FP operation and that doesn't help either. So I am currently out of ideas of how this affects the canvas tests in 32bit mode, but not in 64bit (with x87 enabled).
Filip Pizlo
Comment 22 2013-04-05 13:40:44 PDT
Rolled out in http://trac.webkit.org/changeset/147794 Sorry about this - it's breaking some internal builds. :-( The bug should be easy to fix though: <inline asm>:1267:2: error: ambiguous instructions require an explicit suffix (could be 'ficomps', or 'ficompl') Do you know how to fix it? (If you have a speculative fix, feel free to reland your patch with it and I will watch our bots.)
Allan Sandfeld Jensen
Comment 23 2013-04-05 14:03:36 PDT
(In reply to comment #22) > Rolled out in http://trac.webkit.org/changeset/147794 > > Sorry about this - it's breaking some internal builds. :-( The bug should be easy to fix though: > > <inline asm>:1267:2: error: ambiguous instructions require an explicit suffix (could be 'ficomps', or 'ficompl') > > Do you know how to fix it? > > (If you have a speculative fix, feel free to reland your patch with it and I will watch our bots.) That is already fixed in the fixup patch I attached above.
Filip Pizlo
Comment 24 2013-04-05 14:04:59 PDT
Comment on attachment 196640 [details] Patch r=me Feel free to reland your previous patch along with this one (preferably land them together, or in quick succession - your call).
Filip Pizlo
Comment 25 2013-04-05 14:05:42 PDT
(In reply to comment #23) > (In reply to comment #22) > > Rolled out in http://trac.webkit.org/changeset/147794 > > > > Sorry about this - it's breaking some internal builds. :-( The bug should be easy to fix though: > > > > <inline asm>:1267:2: error: ambiguous instructions require an explicit suffix (could be 'ficomps', or 'ficompl') > > > > Do you know how to fix it? > > > > (If you have a speculative fix, feel free to reland your patch with it and I will watch our bots.) > > That is already fixed in the fixup patch I attached above. I'm sorry! I should have looked at that patch before rolling out.
Allan Sandfeld Jensen
Comment 26 2013-04-05 14:07:53 PDT
(In reply to comment #25) > (In reply to comment #23) > > (In reply to comment #22) > > > Rolled out in http://trac.webkit.org/changeset/147794 > > > > > > Sorry about this - it's breaking some internal builds. :-( The bug should be easy to fix though: > > > > > > <inline asm>:1267:2: error: ambiguous instructions require an explicit suffix (could be 'ficomps', or 'ficompl') > > > > > > Do you know how to fix it? > > > > > > (If you have a speculative fix, feel free to reland your patch with it and I will watch our bots.) > > > > That is already fixed in the fixup patch I attached above. > > I'm sorry! > > I should have looked at that patch before rolling out. No problem. I will wait with landing again until I know what is going on with the last canvas tests in 32bit, and land it all together.
Allan Sandfeld Jensen
Comment 27 2013-04-08 09:17:48 PDT
Allan Sandfeld Jensen
Comment 28 2013-04-09 02:23:41 PDT
(In reply to comment #27) > Created an attachment (id=196862) [details] > Patch While this patch now works with no test regressions. Calling finit before every c-call may cause performance regressions. I will try moving finit to before using fp-instructions, though it will require modifying more llint assembler.
Filip Pizlo
Comment 29 2013-04-09 07:16:36 PDT
(In reply to comment #28) > (In reply to comment #27) > > Created an attachment (id=196862) [details] [details] > > Patch > > While this patch now works with no test regressions. Calling finit before every c-call may cause performance regressions. I will try moving finit to before using fp-instructions, though it will require modifying more llint assembler. Why do you need to do a full finit? Would be good to explain what problem this solves. :-)
Allan Sandfeld Jensen
Comment 30 2013-04-09 07:47:42 PDT
(In reply to comment #29) > (In reply to comment #28) > > (In reply to comment #27) > > > Created an attachment (id=196862) [details] [details] [details] > > > Patch > > > > While this patch now works with no test regressions. Calling finit before every c-call may cause performance regressions. I will try moving finit to before using fp-instructions, though it will require modifying more llint assembler. > > Why do you need to do a full finit? Would be good to explain what problem this solves. :-) I traced the problem with the canvas tests to wrong values calculated in C++ functions that was called from llint (the input values provided by llint was correct, but the output from C++ was wrong). According to the calling convensions details I could find, there should be made no assumptions about FP registers, but it seems like GCC somehow still expect the FPU to be clear when FP-using functions are called. So the finit is simply put there to ensure any mess we made of the FPU state is undone. If it was possible to do, perhaps only calling ffree on the used registers would be enough.
Allan Sandfeld Jensen
Comment 31 2013-04-09 08:30:39 PDT
(In reply to comment #30) > (In reply to comment #29) > > (In reply to comment #28) > > > (In reply to comment #27) > > > > Created an attachment (id=196862) [details] [details] [details] [details] > > > > Patch > > > > > > While this patch now works with no test regressions. Calling finit before every c-call may cause performance regressions. I will try moving finit to before using fp-instructions, though it will require modifying more llint assembler. > > > > Why do you need to do a full finit? Would be good to explain what problem this solves. :-) > > I traced the problem with the canvas tests to wrong values calculated in C++ functions that was called from llint (the input values provided by llint was correct, but the output from C++ was wrong). According to the calling convensions details I could find, there should be made no assumptions about FP registers, but it seems like GCC somehow still expect the FPU to be clear when FP-using functions are called. So the finit is simply put there to ensure any mess we made of the FPU state is undone. > > If it was possible to do, perhaps only calling ffree on the used registers would be enough. Yeah, found it. It wasn't mentioned in any of the common descriptions of the calling convention, but if you go to the source SystemV Intel 386 ABI, you find this: %st(0) Floating-point return values appear on the top of the floating- point register stack; there is no difference in the representation of single- or double-precision values in floating-point registers. If the function does not return a floating-point value, then this register must be empty. This register must be empty before entry to a function. %st(1) through %st(7) Floating-point scratch registers have no specified role in the stan- dard calling sequence. These registers must be empty before entry and upon exit from a function. The stuff about %st(0) being used to return values doesn't apply to Linux I think, but it does state the registers must be empty, which is what finit does (or ffree on registers not popped).
Allan Sandfeld Jensen
Comment 32 2013-04-10 03:25:59 PDT
Created attachment 197236 [details] Patch Replaced finit with releasing used registers
Filip Pizlo
Comment 33 2013-04-11 12:05:15 PDT
Comment on attachment 197236 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=197236&action=review > Source/JavaScriptCore/llint/LowLevelInterpreter32_64.asm:708 > + frelease ft0, ft1 I'm not sure how much I like this new op. This feels like it could get quite fragile - we probably will be adding more stuff to LLint that uses doubles, and it would be weird to have to remember to call this. Are you sure it's a speedup over the finit approach?
Filip Pizlo
Comment 34 2013-04-11 12:07:04 PDT
I'm going to run some of our perf tests on the finit approach to see what the impact is.
Filip Pizlo
Comment 35 2013-04-11 12:29:54 PDT
Performance impact in jsc commandline versus ToT in 32-bit mode with and without the X87 patch that does finit, with all JITs enabled: Benchmark report for SunSpider, V8Spider, Octane, Kraken, and JSRegress on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 SunSpider: 3d-cube 8.4942+-0.1890 ? 8.4965+-0.1056 ? 3d-morph 11.5936+-0.1159 11.4602+-0.1025 might be 1.0116x faster 3d-raytrace 11.0021+-0.2164 ? 11.2055+-0.2798 ? might be 1.0185x slower access-binary-trees 1.8959+-0.0312 ? 1.9282+-0.0331 ? might be 1.0171x slower access-fannkuch 9.1795+-0.1041 ? 9.2876+-0.1071 ? might be 1.0118x slower access-nbody 6.3186+-0.0641 ? 6.3783+-0.1399 ? access-nsieve 4.4097+-0.0756 ? 4.4678+-0.0572 ? might be 1.0132x slower bitops-3bit-bits-in-byte 1.5997+-0.0144 ? 1.6063+-0.0171 ? bitops-bits-in-byte 5.6814+-0.0627 5.6249+-0.0858 might be 1.0100x faster bitops-bitwise-and 1.8749+-0.0576 1.8669+-0.0424 bitops-nsieve-bits 4.9566+-0.0523 4.9548+-0.0510 controlflow-recursive 2.9820+-0.0057 ? 2.9860+-0.0078 ? crypto-aes 7.9668+-0.0931 ? 8.0028+-0.1023 ? crypto-md5 4.1935+-0.0919 ? 4.3964+-0.1250 ? might be 1.0484x slower crypto-sha1 3.2397+-0.0588 3.2360+-0.0570 date-format-tofte 14.1429+-0.3104 ? 14.2059+-0.2927 ? date-format-xparb 9.2639+-0.2593 ? 9.3237+-0.2280 ? math-cordic 3.6682+-0.0479 ? 3.6949+-0.0203 ? math-partial-sums 11.7925+-0.1135 ? 11.8693+-0.0935 ? math-spectral-norm 2.8165+-0.0448 ? 2.8226+-0.0479 ? regexp-dna 8.8477+-0.1811 ? 8.9139+-0.1787 ? string-base64 4.4073+-0.0523 4.4069+-0.0552 string-fasta 10.8259+-0.1055 ? 11.0169+-0.1527 ? might be 1.0176x slower string-tagcloud 13.4961+-0.1733 ? 13.8031+-0.2359 ? might be 1.0227x slower string-unpack-code 25.3899+-0.3293 25.2025+-0.2995 string-validate-input 7.6714+-0.3432 ? 7.7163+-0.3442 ? <arithmetic> * 7.6042+-0.0696 ? 7.6490+-0.0755 ? might be 1.0059x slower <geometric> 6.0906+-0.0565 ? 6.1311+-0.0583 ? might be 1.0066x slower <harmonic> 4.7875+-0.0430 ? 4.8181+-0.0438 ? might be 1.0064x slower TipOfTree X87 V8Spider: crypto 91.6146+-0.5027 91.4137+-0.4699 deltablue 120.6087+-1.7202 118.8778+-0.7180 might be 1.0146x faster earley-boyer 76.1154+-0.8634 75.8586+-0.5411 raytrace 56.1912+-0.1729 ? 56.5184+-0.4265 ? regexp 89.7104+-0.5339 ? 90.5512+-0.3719 ? richards 121.3825+-0.7467 ? 122.2395+-1.4907 ? splay 53.0984+-0.5773 ? 53.3219+-0.3370 ? <arithmetic> 86.9602+-0.3707 ? 86.9687+-0.4621 ? might be 1.0001x slower <geometric> * 83.0763+-0.3135 ? 83.1551+-0.4236 ? might be 1.0009x slower <harmonic> 79.2106+-0.2817 ? 79.3559+-0.3977 ? might be 1.0018x slower TipOfTree X87 Octane and V8v7: encrypt 0.48348+-0.00065 ? 0.48353+-0.00053 ? decrypt 8.90966+-0.00596 8.89872+-0.00920 deltablue x2 0.56748+-0.00494 ^ 0.55906+-0.00186 ^ definitely 1.0151x faster earley 0.93068+-0.00425 0.92654+-0.00374 boyer 13.20650+-0.04582 13.19327+-0.04078 raytrace x2 4.38906+-0.01493 4.37664+-0.01765 regexp x2 27.44398+-0.08363 ^ 27.20890+-0.07996 ^ definitely 1.0086x faster richards x2 0.31996+-0.00264 0.31936+-0.00241 splay x2 0.75715+-0.00909 0.74053+-0.01132 might be 1.0224x faster navier-stokes x2 9.38862+-0.00818 ? 9.45802+-0.07138 ? closure 0.31999+-0.00895 ? 0.32463+-0.00928 ? might be 1.0145x slower jquery 3.88385+-0.49710 ? 3.92851+-0.50244 ? might be 1.0115x slower gbemu x2 135.47291+-1.18772 135.38785+-0.68321 box2d x2 32.42179+-0.11889 32.42034+-0.10271 V8v7: <arithmetic> 6.82893+-0.01033 ^ 6.80169+-0.01670 ^ definitely 1.0040x faster <geometric> * 2.40812+-0.00569 ^ 2.39418+-0.00655 ^ definitely 1.0058x faster <harmonic> 0.97024+-0.00397 0.96270+-0.00396 might be 1.0078x faster Octane including V8v7: <arithmetic> 20.42073+-0.10691 20.39530+-0.05372 might be 1.0012x faster <geometric> * 4.09899+-0.03111 4.08625+-0.02954 might be 1.0031x faster <harmonic> 1.10206+-0.00837 1.09754+-0.00765 might be 1.0041x faster TipOfTree X87 Kraken: ai-astar 472.628+-0.471 ^ 465.689+-4.594 ^ definitely 1.0149x faster audio-beat-detection 274.489+-1.197 ? 276.595+-1.695 ? audio-dft 381.043+-9.137 373.690+-1.184 might be 1.0197x faster audio-fft 137.559+-0.181 ? 137.815+-0.216 ? audio-oscillator 298.985+-0.725 ? 299.378+-0.733 ? imaging-darkroom 339.991+-0.727 ? 365.113+-30.737 ? might be 1.0739x slower imaging-desaturate 137.357+-0.622 136.682+-0.484 imaging-gaussian-blur 418.365+-1.395 417.264+-0.265 json-parse-financial 78.265+-0.249 ^ 75.689+-0.343 ^ definitely 1.0340x faster json-stringify-tinderbox 106.516+-0.297 ! 107.743+-0.322 ! definitely 1.0115x slower stanford-crypto-aes 101.634+-0.531 101.034+-0.338 stanford-crypto-ccm 103.308+-1.699 102.352+-1.850 stanford-crypto-pbkdf2 261.911+-1.525 261.581+-2.100 stanford-crypto-sha256-iterative 110.083+-0.357 ? 110.181+-0.432 ? <arithmetic> * 230.152+-0.785 ? 230.772+-2.217 ? might be 1.0027x slower <geometric> 193.090+-0.575 193.036+-1.085 might be 1.0003x faster <harmonic> 162.403+-0.501 161.676+-0.605 might be 1.0045x faster TipOfTree X87 JSRegress: adapt-to-double-divide 19.0074+-0.1913 ^ 18.6296+-0.1514 ^ definitely 1.0203x faster aliased-arguments-getbyval 0.8775+-0.0087 ? 0.9004+-0.0162 ? might be 1.0261x slower allocate-big-object 2.0838+-0.0328 ? 2.1084+-0.0401 ? might be 1.0118x slower arity-mismatch-inlining 0.6851+-0.0080 ? 0.7001+-0.0079 ? might be 1.0219x slower array-access-polymorphic-structure 6.7216+-0.1357 ? 6.7285+-0.1152 ? array-with-double-add 5.4884+-0.0229 5.4359+-0.0683 array-with-double-increment 4.1975+-0.0137 4.1942+-0.0146 array-with-double-mul-add 6.7421+-0.0724 ? 6.7514+-0.0711 ? array-with-double-sum 7.1602+-0.1074 ? 7.2963+-0.1622 ? might be 1.0190x slower array-with-int32-add-sub 11.7535+-0.0958 ? 11.8917+-0.0825 ? might be 1.0118x slower array-with-int32-or-double-sum 7.2823+-0.1007 7.2463+-0.0909 big-int-mul 4.8611+-0.0549 ? 4.9735+-0.1023 ? might be 1.0231x slower boolean-test 3.9271+-0.0224 ? 3.9501+-0.0067 ? cast-int-to-double 20.1325+-0.1357 ? 20.1586+-0.1269 ? cell-argument 12.1223+-0.1192 ? 12.2541+-0.1462 ? might be 1.0109x slower cfg-simplify 2.8959+-0.0180 ? 2.8970+-0.0098 ? cmpeq-obj-to-obj-other 11.3564+-0.1334 ? 11.6300+-0.1946 ? might be 1.0241x slower constant-test 7.5822+-0.0770 ? 7.6811+-0.0850 ? might be 1.0131x slower direct-arguments-getbyval 0.8090+-0.0075 ! 0.8241+-0.0066 ! definitely 1.0187x slower double-pollution-getbyval 8.9846+-0.1005 ? 8.9944+-0.1016 ? double-pollution-putbyoffset 6.3751+-0.1142 6.3006+-0.0794 might be 1.0118x faster empty-string-plus-int 10.1471+-0.1503 ? 10.2435+-0.2278 ? external-arguments-getbyval 2.1419+-0.0103 ! 2.1730+-0.0146 ! definitely 1.0145x slower external-arguments-putbyval 5.0277+-0.1003 ? 5.0886+-0.0610 ? might be 1.0121x slower Float32Array-matrix-mult 12.3379+-0.1668 ? 12.5577+-0.1459 ? might be 1.0178x slower fold-double-to-int 23.0576+-0.3205 22.7423+-0.1152 might be 1.0139x faster function-dot-apply 2.8390+-0.0041 ! 2.8574+-0.0037 ! definitely 1.0065x slower function-test 5.5198+-0.0918 ? 5.5437+-0.0469 ? get-by-id-chain-from-try-block 6.0694+-0.0337 6.0636+-0.0811 HashMap-put-get-iterate-keys 96.0344+-0.8850 ? 96.0750+-1.1000 ? HashMap-put-get-iterate 97.4174+-0.7436 ? 97.5351+-0.6526 ? HashMap-string-put-get-iterate 71.9545+-0.8233 71.3582+-1.3494 indexed-properties-in-objects 4.9761+-0.0455 ? 4.9909+-0.0647 ? inline-arguments-access 1.0701+-0.0150 ? 1.0779+-0.0079 ? inline-arguments-local-escape 23.9192+-0.3102 ? 24.1633+-0.3372 ? might be 1.0102x slower inline-get-scoped-var 7.8889+-0.1194 7.8187+-0.0998 inlined-put-by-id-transition 13.3434+-0.1641 ? 13.3613+-0.1483 ? int-or-other-abs-then-get-by-val 7.6326+-0.1583 7.5191+-0.1008 might be 1.0151x faster int-or-other-abs-zero-then-get-by-val 40.3623+-1.0207 ? 40.4948+-0.5783 ? int-or-other-add-then-get-by-val 9.5777+-0.1319 9.4847+-0.0905 int-or-other-add 9.6544+-0.1102 ? 9.7780+-0.0944 ? might be 1.0128x slower int-or-other-div-then-get-by-val 13.6582+-0.3412 ? 13.8374+-0.2928 ? might be 1.0131x slower int-or-other-max-then-get-by-val 8.6050+-0.3335 ? 8.8420+-0.2348 ? might be 1.0275x slower int-or-other-min-then-get-by-val 7.0259+-0.0817 ? 7.0536+-0.0841 ? int-or-other-mod-then-get-by-val 6.5811+-0.0837 ? 6.6192+-0.0770 ? int-or-other-mul-then-get-by-val 6.2142+-0.0797 6.1776+-0.0673 int-or-other-neg-then-get-by-val 7.0530+-0.0928 ? 7.0781+-0.1463 ? int-or-other-neg-zero-then-get-by-val 40.0904+-0.9424 39.7395+-1.3277 int-or-other-sub-then-get-by-val 9.7512+-0.1150 ? 9.8447+-0.0999 ? int-or-other-sub 7.5048+-0.1044 7.4675+-0.0922 int-overflow-local 12.2717+-0.0939 ? 12.3912+-0.1467 ? Int16Array-bubble-sort 47.9568+-0.2409 47.8767+-0.2574 Int16Array-load-int-mul 1.6440+-0.0055 ! 1.6625+-0.0054 ! definitely 1.0112x slower Int8Array-load 4.1016+-0.0089 ! 4.1242+-0.0093 ! definitely 1.0055x slower integer-divide 14.1260+-0.1495 13.9724+-0.1302 might be 1.0110x faster integer-modulo 2.2314+-0.0247 ! 2.5744+-0.0699 ! definitely 1.1537x slower make-indexed-storage 3.8672+-0.0235 ? 3.8971+-0.0257 ? method-on-number 25.8172+-0.4572 25.5702+-0.2097 nested-function-parsing-random 362.4207+-8.6741 ? 362.5110+-7.8273 ? nested-function-parsing 44.2154+-1.3448 43.8929+-1.3587 new-array-buffer-dead 3.1131+-0.0344 3.1070+-0.0208 new-array-buffer-push 8.9108+-0.1986 8.8308+-0.1789 new-array-dead 23.6157+-0.0904 ? 23.7266+-0.1060 ? new-array-push 7.7680+-0.8281 7.3580+-0.8312 might be 1.0557x faster number-test 3.9562+-0.0067 ! 3.9843+-0.0114 ! definitely 1.0071x slower object-closure-call 7.4360+-0.0873 ? 7.4516+-0.0926 ? object-test 5.3319+-0.0526 ? 5.3892+-0.0736 ? might be 1.0107x slower poly-stricteq 125.2238+-0.2312 ? 125.4714+-0.6003 ? polymorphic-structure 20.9966+-0.1812 20.9728+-0.0953 polyvariant-monomorphic-get-by-id 10.6855+-0.1418 ? 11.8826+-2.2215 ? might be 1.1120x slower rare-osr-exit-on-local 17.4106+-0.1685 ? 17.4527+-0.1974 ? register-pressure-from-osr 39.8603+-0.1835 ? 39.8854+-0.1673 ? simple-activation-demo 32.8159+-0.1963 32.7628+-0.2312 slow-array-profile-convergence 4.9639+-0.0184 4.9515+-0.0582 slow-convergence 3.4908+-0.0091 ! 3.5181+-0.0125 ! definitely 1.0078x slower sparse-conditional 1.0529+-0.0081 ? 1.0669+-0.0082 ? might be 1.0134x slower splice-to-remove 82.7474+-0.5393 82.2995+-0.4044 string-concat-object 2.5940+-0.0951 ? 2.6412+-0.0814 ? might be 1.0182x slower string-concat-pair-object 1.7753+-0.0363 ? 1.8018+-0.0199 ? might be 1.0149x slower string-concat-pair-simple 9.8503+-0.1304 ? 10.0220+-0.1592 ? might be 1.0174x slower string-concat-simple 23.9797+-0.3282 ? 23.9944+-0.4071 ? string-cons-repeat 7.9639+-0.0929 ? 8.0451+-0.1028 ? might be 1.0102x slower string-cons-tower 7.8234+-0.1375 ? 7.8898+-0.1135 ? string-equality 115.8730+-2.2722 114.4837+-1.1632 might be 1.0121x faster string-hash 2.6324+-0.0090 ! 2.6520+-0.0074 ! definitely 1.0074x slower string-repeat-arith 95.8650+-0.4895 95.4210+-0.2657 string-sub 170.1931+-0.8318 ? 171.0446+-1.0772 ? string-test 4.0589+-0.0068 ! 4.0987+-0.0053 ! definitely 1.0098x slower structure-hoist-over-transitions 2.6808+-0.0232 ? 2.7037+-0.0153 ? tear-off-arguments-simple 1.7446+-0.0084 ! 1.7646+-0.0075 ! definitely 1.0115x slower tear-off-arguments 3.2545+-0.0078 ! 3.2722+-0.0085 ! definitely 1.0054x slower temporal-structure 20.9941+-0.1439 20.9355+-0.1897 to-int32-boolean 27.9474+-0.1637 27.9473+-0.1661 undefined-test 4.1003+-0.0313 ! 4.1557+-0.0195 ! definitely 1.0135x slower <arithmetic> 22.6581+-0.0684 ? 22.6658+-0.0613 ? might be 1.0003x slower <geometric> * 9.1816+-0.0388 ? 9.2349+-0.0446 ? might be 1.0058x slower <harmonic> 4.8062+-0.0292 ! 4.8670+-0.0278 ! definitely 1.0127x slower TipOfTree X87 All benchmarks: <arithmetic> 40.5379+-0.1082 ? 40.5997+-0.2094 ? might be 1.0015x slower <geometric> 11.0130+-0.0549 ? 11.0570+-0.0580 ? might be 1.0040x slower <harmonic> 3.6120+-0.0217 ? 3.6278+-0.0201 ? might be 1.0044x slower TipOfTree X87 Geomean of preferred means: <scaled-result> 22.2650+-0.1099 ? 22.3189+-0.1192 ? might be 1.0024x slower
Allan Sandfeld Jensen
Comment 36 2013-04-11 12:42:24 PDT
(In reply to comment #33) > (From update of attachment 197236 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=197236&action=review > > > Source/JavaScriptCore/llint/LowLevelInterpreter32_64.asm:708 > > + frelease ft0, ft1 > > I'm not sure how much I like this new op. This feels like it could get quite fragile - we probably will be adding more stuff to LLint that uses doubles, and it would be weird to have to remember to call this. > > Are you sure it's a speedup over the finit approach? From some of the timing info I got for finit it seemed very slow. Though I can't imagine why if the fpu is unused and finit has the effect of a nop. One thing missing from from the finit patch is that it only cleans up when calling deeper functions, not when returning from a function. So if a FP using function called into llint, that would return with a unclean fpu state that could lead to wrong behavior in the llint caller. Could we add a finit instructions to exit thunks of llint?
Filip Pizlo
Comment 37 2013-04-11 12:46:34 PDT
(In reply to comment #36) > (In reply to comment #33) > > (From update of attachment 197236 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=197236&action=review > > > > > Source/JavaScriptCore/llint/LowLevelInterpreter32_64.asm:708 > > > + frelease ft0, ft1 > > > > I'm not sure how much I like this new op. This feels like it could get quite fragile - we probably will be adding more stuff to LLint that uses doubles, and it would be weird to have to remember to call this. > > > > Are you sure it's a speedup over the finit approach? > > From some of the timing info I got for finit it seemed very slow. Though I can't imagine why if the fpu is unused and finit has the effect of a nop. Interesting. I'm doing full LLInt-only perf tests right now, and you're right, finit appears sooooper slow. It's quite shocking actually! My tests are still running, I'll post results here shortly! > > One thing missing from from the finit patch is that it only cleans up when calling deeper functions, not when returning from a function. So if a FP using function called into llint, that would return with a unclean fpu state that could lead to wrong behavior in the llint caller. Could we add a finit instructions to exit thunks of llint? I would add it to the ctiTrampoline, which the LLint uses.
Filip Pizlo
Comment 38 2013-04-11 13:32:05 PDT
Wow, that was pretty bad. This is the finit patch with JITs disabled. Benchmark report for SunSpider, V8Spider, Octane, Kraken, and JSRegress on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 SunSpider: 3d-cube 17.9707+-0.1456 ! 27.6701+-0.2054 ! definitely 1.5397x slower 3d-morph 24.8084+-0.6925 ! 28.5495+-0.3506 ! definitely 1.1508x slower 3d-raytrace 29.7102+-0.2377 ! 41.7799+-0.2842 ! definitely 1.4062x slower access-binary-trees 13.0314+-0.1003 ! 16.5678+-0.1654 ! definitely 1.2714x slower access-fannkuch 41.8859+-0.6660 ! 44.6337+-0.2818 ! definitely 1.0656x slower access-nbody 18.7420+-0.2359 ! 31.4470+-0.2213 ! definitely 1.6779x slower access-nsieve 9.4611+-0.1465 ? 9.6155+-0.0866 ? might be 1.0163x slower bitops-3bit-bits-in-byte 16.5080+-0.1824 ! 17.8070+-0.1768 ! definitely 1.0787x slower bitops-bits-in-byte 27.3810+-0.2920 ! 48.1828+-0.3189 ! definitely 1.7597x slower bitops-bitwise-and 52.0208+-0.2235 ! 103.8744+-0.6643 ! definitely 1.9968x slower bitops-nsieve-bits 36.8362+-1.6001 ! 58.9099+-0.3699 ! definitely 1.5992x slower controlflow-recursive 18.7358+-0.1621 ! 25.4887+-0.1959 ! definitely 1.3604x slower crypto-aes 18.7230+-0.1685 ! 21.4457+-0.1522 ! definitely 1.1454x slower crypto-md5 18.4726+-0.1677 ! 21.1467+-0.1948 ! definitely 1.1448x slower crypto-sha1 17.2736+-0.2514 ! 19.9465+-0.2236 ! definitely 1.1547x slower date-format-tofte 24.4813+-0.1295 ! 34.6394+-0.1085 ! definitely 1.4149x slower date-format-xparb 25.3811+-0.2380 ! 31.5430+-0.2089 ! definitely 1.2428x slower math-cordic 37.6304+-0.4022 ! 58.6506+-0.1875 ! definitely 1.5586x slower math-partial-sums 34.7646+-0.2600 ! 62.4406+-0.2836 ! definitely 1.7961x slower math-spectral-norm 17.7924+-0.2257 ! 22.8651+-0.2856 ! definitely 1.2851x slower regexp-dna 8.7932+-0.1061 ? 9.0205+-0.1783 ? might be 1.0259x slower string-base64 26.7820+-0.2573 ! 36.0657+-0.2347 ! definitely 1.3466x slower string-fasta 22.7089+-0.2262 ! 34.8408+-0.1085 ! definitely 1.5342x slower string-tagcloud 21.8114+-0.2932 ! 27.2479+-0.1792 ! definitely 1.2493x slower string-unpack-code 27.8768+-0.1707 ! 34.1804+-0.2007 ! definitely 1.2261x slower string-validate-input 19.5878+-0.2980 ! 26.9678+-0.3583 ! definitely 1.3768x slower <arithmetic> * 24.1989+-0.1311 ! 34.4434+-0.0844 ! definitely 1.4233x slower <geometric> 22.3012+-0.1200 ! 29.8960+-0.0934 ! definitely 1.3406x slower <harmonic> 20.4460+-0.1176 ! 25.7842+-0.1119 ! definitely 1.2611x slower TipOfTree X87 V8Spider: crypto 897.4223+-9.6310 ! 955.2835+-6.6001 ! definitely 1.0645x slower deltablue 2434.8941+-23.6702 ! 2915.4419+-17.4317 ! definitely 1.1974x slower earley-boyer 538.8258+-3.4744 ! 778.5211+-5.3388 ! definitely 1.4448x slower raytrace 284.8305+-1.8339 ! 366.3276+-2.0507 ! definitely 1.2861x slower regexp 121.5716+-0.3513 ! 133.2938+-0.4799 ! definitely 1.0964x slower richards 2478.7269+-24.6796 ! 2872.7077+-19.3525 ! definitely 1.1589x slower splay 205.5477+-0.6517 ! 274.4447+-1.2608 ! definitely 1.3352x slower <arithmetic> 994.5456+-4.1183 ! 1185.1458+-3.3382 ! definitely 1.1916x slower <geometric> * 574.9296+-1.3303 ! 701.3212+-0.8083 ! definitely 1.2198x slower <harmonic> 343.3571+-0.5671 ! 414.2277+-0.7583 ! definitely 1.2064x slower TipOfTree X87 Octane and V8v7: encrypt 5.61052+-0.00662 ! 6.06751+-0.00675 ! definitely 1.0815x slower decrypt 105.82162+-0.13131 ! 112.92338+-0.10450 ! definitely 1.0671x slower deltablue x2 16.04073+-0.12198 ! 19.20199+-0.12105 ! definitely 1.1971x slower earley 6.77426+-0.03344 ! 9.01893+-0.02794 ! definitely 1.3314x slower boyer 130.22991+-0.38527 ! 186.68642+-1.14655 ! definitely 1.4335x slower raytrace x2 47.32154+-0.08259 ! 61.00536+-0.10240 ! definitely 1.2892x slower regexp x2 42.93033+-0.19771 ! 46.60571+-0.15538 ! definitely 1.0856x slower richards x2 7.28965+-0.05698 ! 8.28408+-0.03999 ! definitely 1.1364x slower splay x2 2.80364+-0.01558 ! 3.69599+-0.02228 ! definitely 1.3183x slower navier-stokes x2 93.41132+-0.18337 ! 137.42017+-0.13622 ! definitely 1.4711x slower closure 0.32450+-0.00902 ? 0.32824+-0.01026 ? might be 1.0115x slower jquery 3.55439+-0.50068 ? 3.63746+-0.50302 ? might be 1.0234x slower gbemu x2 576.01998+-2.52232 ! 811.78436+-0.90053 ! definitely 1.4093x slower box2d x2 198.39429+-0.41165 ! 246.71447+-0.70488 ! definitely 1.2436x slower V8v7: <arithmetic> 41.75192+-0.04401 ! 54.19518+-0.04514 ! definitely 1.2980x slower <geometric> * 21.46209+-0.05070 ! 26.54716+-0.02672 ! definitely 1.2369x slower <harmonic> 10.21905+-0.04141 ! 12.62915+-0.03699 ! definitely 1.2358x slower Octane including V8v7: <arithmetic> 100.94264+-0.22823 ! 135.82210+-0.11418 ! definitely 1.3455x slower <geometric> * 26.95841+-0.19774 ! 33.16708+-0.25886 ! definitely 1.2303x slower <harmonic> 4.44173+-0.11147 ! 4.77373+-0.13823 ! definitely 1.0747x slower TipOfTree X87 Kraken: ai-astar 2915.956+-11.411 ! 4438.416+-11.309 ! definitely 1.5221x slower audio-beat-detection 1355.453+-0.817 ! 1827.941+-11.880 ! definitely 1.3486x slower audio-dft 1080.029+-1.664 ! 1398.931+-5.902 ! definitely 1.2953x slower audio-fft 1254.061+-0.409 ! 1692.906+-1.763 ! definitely 1.3499x slower audio-oscillator 1152.206+-16.154 ! 1604.303+-2.480 ! definitely 1.3924x slower imaging-darkroom 2009.602+-9.660 ! 2894.022+-6.894 ! definitely 1.4401x slower imaging-desaturate 3230.711+-21.289 ! 4436.863+-9.205 ! definitely 1.3733x slower imaging-gaussian-blur 10705.198+-128.187 10691.383+-77.066 json-parse-financial 78.417+-0.332 ^ 75.490+-0.458 ^ definitely 1.0388x faster json-stringify-tinderbox 106.198+-0.202 ! 107.703+-0.363 ! definitely 1.0142x slower stanford-crypto-aes 684.877+-1.218 ! 901.412+-3.234 ! definitely 1.3162x slower stanford-crypto-ccm 449.184+-0.355 ! 466.431+-1.235 ! definitely 1.0384x slower stanford-crypto-pbkdf2 1855.865+-3.148 ! 2550.453+-1.299 ! definitely 1.3743x slower stanford-crypto-sha256-iterative 647.911+-1.594 ! 868.874+-1.338 ! definitely 1.3410x slower <arithmetic> * 1966.119+-8.148 ! 2425.366+-6.062 ! definitely 1.2336x slower <geometric> 1020.027+-1.289 ! 1281.029+-1.316 ! definitely 1.2559x slower <harmonic> 430.629+-0.870 ! 456.109+-1.530 ! definitely 1.0592x slower TipOfTree X87 JSRegress: adapt-to-double-divide 47.9787+-0.2660 ! 94.7094+-0.2438 ! definitely 1.9740x slower aliased-arguments-getbyval 8.4379+-0.1323 ! 12.6069+-0.0897 ! definitely 1.4941x slower allocate-big-object 27.6678+-0.3890 ! 41.2623+-0.3427 ! definitely 1.4913x slower arity-mismatch-inlining 16.3782+-0.2144 ! 28.0538+-0.1212 ! definitely 1.7129x slower array-access-polymorphic-structure 54.0342+-0.6186 ! 88.3635+-0.5378 ! definitely 1.6353x slower array-with-double-add 33.7265+-0.2040 ! 46.6482+-0.2422 ! definitely 1.3831x slower array-with-double-increment 35.6488+-0.1635 ! 69.5987+-0.2775 ! definitely 1.9523x slower array-with-double-mul-add 61.1028+-0.3000 ! 83.7804+-0.1966 ! definitely 1.3711x slower array-with-double-sum 22.5687+-0.0991 ! 31.2035+-0.1861 ! definitely 1.3826x slower array-with-int32-add-sub 63.5340+-1.0317 63.3974+-0.8988 array-with-int32-or-double-sum 22.7125+-0.1767 ! 31.3637+-0.2571 ! definitely 1.3809x slower big-int-mul 132.1791+-1.6020 ! 210.6236+-0.6975 ! definitely 1.5935x slower boolean-test 44.8232+-0.5983 ! 80.7421+-1.0616 ! definitely 1.8013x slower cast-int-to-double 268.5540+-1.4758 ! 526.1327+-2.1717 ! definitely 1.9591x slower cell-argument 77.7083+-0.2848 ? 77.9413+-0.2473 ? cfg-simplify 132.1752+-0.8643 ! 225.0296+-1.2993 ! definitely 1.7025x slower cmpeq-obj-to-obj-other 119.7733+-1.0378 ! 193.0783+-0.8836 ! definitely 1.6120x slower constant-test 283.3301+-0.6698 ! 545.8044+-2.4496 ! definitely 1.9264x slower direct-arguments-getbyval 3.6786+-0.0274 ! 4.8977+-0.0603 ! definitely 1.3314x slower double-pollution-getbyval 35.7608+-0.2526 ! 60.3137+-0.2196 ! definitely 1.6866x slower double-pollution-putbyoffset 33.8859+-0.6200 ! 56.7275+-0.2107 ! definitely 1.6741x slower empty-string-plus-int 25.9678+-0.3229 ! 40.2512+-0.2047 ! definitely 1.5500x slower external-arguments-getbyval 9.2215+-0.0939 ! 13.5751+-0.1036 ! definitely 1.4721x slower external-arguments-putbyval 15.4295+-0.1779 ! 22.3645+-0.2252 ! definitely 1.4495x slower Float32Array-matrix-mult 67.8031+-0.6321 ! 102.6558+-0.7438 ! definitely 1.5140x slower fold-double-to-int 632.2122+-5.5484 ! 1078.4247+-4.0731 ! definitely 1.7058x slower function-dot-apply 82.6871+-0.6923 ! 128.9930+-0.8548 ! definitely 1.5600x slower function-test 47.3091+-0.5687 ! 85.8363+-0.5746 ! definitely 1.8144x slower get-by-id-chain-from-try-block 134.9253+-1.4403 ! 162.5439+-1.0927 ! definitely 1.2047x slower HashMap-put-get-iterate-keys 464.1388+-4.4157 ! 682.3402+-3.6444 ! definitely 1.4701x slower HashMap-put-get-iterate 427.8627+-1.8751 ! 628.6139+-2.5704 ! definitely 1.4692x slower HashMap-string-put-get-iterate 305.2206+-1.8602 ! 416.6194+-1.4381 ! definitely 1.3650x slower indexed-properties-in-objects 23.2174+-0.1440 ! 24.2929+-0.1557 ! definitely 1.0463x slower inline-arguments-access 55.0692+-0.6571 ! 84.1111+-0.4045 ! definitely 1.5274x slower inline-arguments-local-escape 104.0482+-0.4016 ! 158.8205+-0.7914 ! definitely 1.5264x slower inline-get-scoped-var 96.8276+-13.0027 ? 99.7438+-14.3146 ? might be 1.0301x slower inlined-put-by-id-transition 228.2877+-1.2493 ! 297.1468+-1.8769 ! definitely 1.3016x slower int-or-other-abs-then-get-by-val 140.1633+-0.8379 ! 226.2548+-1.5490 ! definitely 1.6142x slower int-or-other-abs-zero-then-get-by-val 282.8424+-1.8545 ! 468.3107+-2.5255 ! definitely 1.6557x slower int-or-other-add-then-get-by-val 244.1959+-1.1042 ! 379.5522+-1.2374 ! definitely 1.5543x slower int-or-other-add 226.3017+-1.5033 ! 370.0413+-1.4292 ! definitely 1.6352x slower int-or-other-div-then-get-by-val 105.5875+-0.7421 ! 167.7434+-0.4612 ! definitely 1.5887x slower int-or-other-max-then-get-by-val 139.9950+-0.4990 ! 219.6380+-0.7832 ! definitely 1.5689x slower int-or-other-min-then-get-by-val 140.4771+-0.6001 ! 221.5659+-0.6894 ! definitely 1.5772x slower int-or-other-mod-then-get-by-val 103.5357+-0.5079 ! 174.2193+-1.1815 ! definitely 1.6827x slower int-or-other-mul-then-get-by-val 125.3206+-0.9712 ! 203.7862+-0.9880 ! definitely 1.6261x slower int-or-other-neg-then-get-by-val 127.5894+-0.7397 ! 214.6598+-0.6223 ! definitely 1.6824x slower int-or-other-neg-zero-then-get-by-val 284.2008+-1.2491 ! 475.4374+-1.7580 ! definitely 1.6729x slower int-or-other-sub-then-get-by-val 243.5460+-0.8323 ! 377.8331+-1.1872 ! definitely 1.5514x slower int-or-other-sub 228.1011+-1.5760 ! 364.1205+-1.6924 ! definitely 1.5963x slower int-overflow-local 174.4003+-1.8327 ! 327.9483+-1.4215 ! definitely 1.8804x slower Int16Array-bubble-sort 1092.0413+-16.5536 ! 1650.0448+-15.0169 ! definitely 1.5110x slower Int16Array-load-int-mul 27.4004+-0.1134 ! 57.4130+-0.4792 ! definitely 2.0953x slower Int8Array-load 36.5696+-0.4347 ! 63.2672+-0.3075 ! definitely 1.7300x slower integer-divide 393.3169+-1.6278 ! 671.1779+-1.2598 ! definitely 1.7065x slower integer-modulo 9.0946+-0.0884 ! 16.9393+-0.1593 ! definitely 1.8626x slower make-indexed-storage 9.5870+-0.0822 ! 10.3789+-0.1035 ! definitely 1.0826x slower method-on-number 40.0598+-0.5277 ! 54.9131+-1.0029 ! definitely 1.3708x slower nested-function-parsing-random 393.6571+-8.4825 ! 417.0077+-7.8966 ! definitely 1.0593x slower nested-function-parsing 47.6881+-1.3580 ! 50.5424+-1.2863 ! definitely 1.0599x slower new-array-buffer-dead 1001.0281+-5.2642 ! 1377.5106+-6.0891 ! definitely 1.3761x slower new-array-buffer-push 56.3387+-0.4913 ! 77.2520+-0.4672 ! definitely 1.3712x slower new-array-dead 1001.6732+-14.3254 ! 1639.4984+-6.7574 ! definitely 1.6368x slower new-array-push 38.8717+-0.7892 ! 60.2341+-0.9933 ! definitely 1.5496x slower number-test 45.1636+-0.6120 ! 79.2752+-0.2523 ! definitely 1.7553x slower object-closure-call 185.5009+-0.6381 ! 297.4842+-1.3989 ! definitely 1.6037x slower object-test 46.7065+-0.5818 ! 84.2340+-0.2634 ! definitely 1.8035x slower poly-stricteq 978.6510+-13.0603 ! 1606.3261+-13.4500 ! definitely 1.6414x slower polymorphic-structure 1052.3085+-29.0701 ! 1752.2201+-36.4423 ! definitely 1.6651x slower polyvariant-monomorphic-get-by-id 521.1988+-9.1206 ! 745.0362+-6.9739 ! definitely 1.4295x slower rare-osr-exit-on-local 114.8795+-0.1844 ! 210.5250+-0.3408 ! definitely 1.8326x slower register-pressure-from-osr 377.8423+-2.3739 ! 510.9493+-1.3184 ! definitely 1.3523x slower simple-activation-demo 250.7149+-0.7627 ! 267.2369+-0.9739 ! definitely 1.0659x slower slow-array-profile-convergence 18.7003+-0.1632 ! 21.9731+-0.1644 ! definitely 1.1750x slower slow-convergence 15.0791+-0.1335 ! 22.0212+-0.4736 ! definitely 1.4604x slower sparse-conditional 44.7012+-1.3341 ! 70.5484+-0.2370 ! definitely 1.5782x slower splice-to-remove 108.0986+-0.7546 ! 122.5045+-0.4792 ! definitely 1.1333x slower string-concat-object 42.6399+-0.7259 ! 63.1830+-0.8520 ! definitely 1.4818x slower string-concat-pair-object 43.1884+-0.9060 ! 59.3781+-0.6565 ! definitely 1.3749x slower string-concat-pair-simple 180.0175+-6.6789 ! 328.8233+-7.8179 ! definitely 1.8266x slower string-concat-simple 198.0448+-7.2361 ! 344.0921+-7.1324 ! definitely 1.7374x slower string-cons-repeat 210.2401+-4.6946 ! 357.2951+-4.2432 ! definitely 1.6995x slower string-cons-tower 209.9568+-3.5678 ! 242.5502+-0.8849 ! definitely 1.1552x slower string-equality 447.8517+-1.3682 ! 739.4193+-4.4193 ! definitely 1.6510x slower string-hash 67.5674+-0.9066 ! 107.6550+-0.6354 ! definitely 1.5933x slower string-repeat-arith 114.0440+-0.5667 ! 152.3804+-0.3029 ! definitely 1.3362x slower string-sub 348.3055+-1.9988 ! 524.0287+-1.0700 ! definitely 1.5045x slower string-test 45.1624+-0.6353 ! 79.7162+-0.3508 ! definitely 1.7651x slower structure-hoist-over-transitions 33.4064+-0.7943 ! 46.0349+-0.3370 ! definitely 1.3780x slower tear-off-arguments-simple 51.7674+-0.4439 ! 77.7820+-0.6882 ! definitely 1.5025x slower tear-off-arguments 84.1647+-0.3593 ! 127.3961+-0.4406 ! definitely 1.5137x slower temporal-structure 1063.6193+-33.7472 ! 1740.9303+-35.1564 ! definitely 1.6368x slower to-int32-boolean 476.3257+-1.7403 ! 869.0256+-3.0583 ! definitely 1.8244x slower undefined-test 44.5179+-0.5381 ! 80.0131+-0.5642 ! definitely 1.7973x slower <arithmetic> 195.5302+-0.9561 ! 304.9568+-1.3299 ! definitely 1.5596x slower <geometric> * 95.7849+-0.4214 ! 145.5062+-0.5262 ! definitely 1.5191x slower <harmonic> 45.2604+-0.1408 ! 65.6766+-0.2329 ! definitely 1.4511x slower TipOfTree X87 All benchmarks: <arithmetic> 341.8235+-0.9171 ! 458.9000+-1.2699 ! definitely 1.3425x slower <geometric> ERROR ERROR <harmonic> 19.5051+-0.3127 ! 22.9632+-0.4543 ! definitely 1.1773x slower TipOfTree X87 Geomean of preferred means: <scaled-result> 147.8406+-0.4181 ! 195.1065+-0.4459 ! definitely 1.3197x slower
Filip Pizlo
Comment 39 2013-04-11 13:32:42 PDT
(In reply to comment #37) > (In reply to comment #36) > > (In reply to comment #33) > > > (From update of attachment 197236 [details] [details] [details]) > > > View in context: https://bugs.webkit.org/attachment.cgi?id=197236&action=review > > > > > > > Source/JavaScriptCore/llint/LowLevelInterpreter32_64.asm:708 > > > > + frelease ft0, ft1 > > > > > > I'm not sure how much I like this new op. This feels like it could get quite fragile - we probably will be adding more stuff to LLint that uses doubles, and it would be weird to have to remember to call this. > > > > > > Are you sure it's a speedup over the finit approach? > > > > From some of the timing info I got for finit it seemed very slow. Though I can't imagine why if the fpu is unused and finit has the effect of a nop. > > Interesting. I'm doing full LLInt-only perf tests right now, and you're right, finit appears sooooper slow. It's quite shocking actually! > > My tests are still running, I'll post results here shortly! > > > > > One thing missing from from the finit patch is that it only cleans up when calling deeper functions, not when returning from a function. So if a FP using function called into llint, that would return with a unclean fpu state that could lead to wrong behavior in the llint caller. Could we add a finit instructions to exit thunks of llint? > > I would add it to the ctiTrampoline, which the LLint uses. Actually, you could put it in the LLInt's prologue! That might be easier.
Filip Pizlo
Comment 40 2013-04-11 13:48:09 PDT
I hacked Allen's code and replaced finit with just ffree of st(0) and st(1), and limited the x86's backend set of FP registers to just ft0 and ft1. This works fine for LLInt right now and probably will continue to work for the foreseeable future. I think this will be more robust than having to call frelease inline in all of the places where you use floating point. Here's a quick benchmark run of LLInt-only with x87 instead of SSE, and my ffree hack: Benchmark report for SunSpider on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 3d-cube 17.9783+-0.1753 ! 26.3691+-0.1585 ! definitely 1.4667x slower 3d-morph 24.3248+-0.9822 ? 25.2635+-0.1758 ? might be 1.0386x slower 3d-raytrace 29.8836+-0.2624 ! 34.3155+-0.3130 ! definitely 1.1483x slower access-binary-trees 13.0033+-0.1874 12.9120+-0.1259 access-fannkuch 41.3548+-0.2009 ! 46.6845+-0.2882 ! definitely 1.1289x slower access-nbody 18.7540+-0.2405 ! 31.6527+-0.1826 ! definitely 1.6878x slower access-nsieve 9.3333+-0.1018 ! 9.9349+-0.1245 ! definitely 1.0645x slower bitops-3bit-bits-in-byte 16.4463+-0.2274 ! 17.8396+-0.1332 ! definitely 1.0847x slower bitops-bits-in-byte 27.1933+-0.2444 ! 28.2384+-0.3981 ! definitely 1.0384x slower bitops-bitwise-and 50.7059+-0.2201 ! 66.2652+-0.2938 ! definitely 1.3069x slower bitops-nsieve-bits 40.6929+-1.5749 ^ 36.8273+-0.1227 ^ definitely 1.1050x faster controlflow-recursive 18.7057+-0.1341 18.6117+-0.1532 crypto-aes 18.9586+-0.1842 ^ 18.4243+-0.1710 ^ definitely 1.0290x faster crypto-md5 18.4564+-0.2320 ^ 17.6882+-0.2845 ^ definitely 1.0434x faster crypto-sha1 17.2156+-0.1717 16.7456+-0.3562 might be 1.0281x faster date-format-tofte 24.7140+-0.1669 ^ 24.0985+-0.2515 ^ definitely 1.0255x faster date-format-xparb 25.1820+-0.2267 24.8168+-0.2614 might be 1.0147x faster math-cordic 37.8164+-0.3621 ! 44.1493+-0.3623 ! definitely 1.1675x slower math-partial-sums 35.1842+-0.3965 ! 39.5324+-0.3883 ! definitely 1.1236x slower math-spectral-norm 17.8386+-0.1806 ? 18.3042+-0.4146 ? might be 1.0261x slower regexp-dna 8.9110+-0.2140 ? 8.9316+-0.1848 ? string-base64 26.8302+-0.2538 ? 27.2993+-0.3646 ? might be 1.0175x slower string-fasta 22.7460+-0.1565 ! 25.8323+-0.3643 ! definitely 1.1357x slower string-tagcloud 21.9842+-0.1848 21.8362+-0.1764 string-unpack-code 27.8313+-0.2588 ? 28.1235+-0.2181 ? might be 1.0105x slower string-validate-input 19.6671+-0.3223 ? 20.0778+-0.4098 ? might be 1.0209x slower <arithmetic> * 24.2966+-0.0980 ! 26.5683+-0.0817 ! definitely 1.0935x slower <geometric> 22.3661+-0.1046 ! 24.0371+-0.0937 ! definitely 1.0747x slower <harmonic> 20.4804+-0.1135 ! 21.6847+-0.1024 ! definitely 1.0588x slower I think this is acceptable, particularly since it doesn't show up at all with all JITs enabled: Benchmark report for SunSpider on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 3d-cube 8.4606+-0.1155 ? 8.5137+-0.1353 ? 3d-morph 11.6839+-0.1243 11.6104+-0.1301 3d-raytrace 11.0022+-0.2307 10.9968+-0.2107 access-binary-trees 1.9070+-0.0387 1.8977+-0.0334 access-fannkuch 9.3917+-0.0751 9.2294+-0.1096 might be 1.0176x faster access-nbody 6.2916+-0.0990 6.2757+-0.0806 access-nsieve 4.5179+-0.0715 4.4826+-0.0484 bitops-3bit-bits-in-byte 1.5943+-0.0121 ? 1.5974+-0.0122 ? bitops-bits-in-byte 5.6738+-0.0634 ? 5.7138+-0.0460 ? bitops-bitwise-and 1.8976+-0.0699 1.8559+-0.0504 might be 1.0225x faster bitops-nsieve-bits 4.8670+-0.0680 ? 4.9450+-0.0705 ? might be 1.0160x slower controlflow-recursive 2.9818+-0.0073 2.9806+-0.0127 crypto-aes 7.9970+-0.0659 7.9870+-0.0808 crypto-md5 4.1906+-0.0994 ? 4.2745+-0.0909 ? might be 1.0200x slower crypto-sha1 3.2123+-0.0590 3.2038+-0.0671 date-format-tofte 14.1565+-0.2440 14.1345+-0.2333 date-format-xparb 9.1981+-0.1982 ? 9.2653+-0.1622 ? math-cordic 3.6778+-0.0212 ? 3.6819+-0.0187 ? math-partial-sums 11.8618+-0.1057 ? 11.9583+-0.1261 ? math-spectral-norm 2.8211+-0.0566 2.8153+-0.0467 regexp-dna 8.6979+-0.1307 ? 8.9600+-0.1483 ? might be 1.0301x slower string-base64 4.3886+-0.0219 ? 4.4054+-0.0541 ? string-fasta 10.9107+-0.1414 ? 10.9863+-0.1428 ? string-tagcloud 13.7753+-0.2281 13.6759+-0.2051 string-unpack-code 25.6134+-0.4697 25.5479+-0.4052 string-validate-input 7.6822+-0.3646 7.6761+-0.3357 <arithmetic> * 7.6328+-0.0693 ? 7.6412+-0.0711 ? might be 1.0011x slower <geometric> 6.1039+-0.0556 ? 6.1107+-0.0539 ? might be 1.0011x slower <harmonic> 4.7953+-0.0441 4.7926+-0.0410 might be 1.0006x faster I will do more extensive benchmark runs now.
Filip Pizlo
Comment 41 2013-04-11 13:50:12 PDT
Created attachment 197667 [details] my version
Allan Sandfeld Jensen
Comment 42 2013-04-11 14:18:37 PDT
Comment on attachment 197667 [details] my version View in context: https://bugs.webkit.org/attachment.cgi?id=197667&action=review > Source/JavaScriptCore/offlineasm/x86.rb:965 > + $asm.puts "fld #{operands[0].x87Operand(0)}" > + $asm.puts "frndint" > + $asm.puts "fucomip #{operands[0].x87Operand(1)}" I should probably have mentioned it. I changed these lines in the latest patch because frndint didn't behave as expected. To properly cast double to integer, you need load the integer back from the stack.
Filip Pizlo
Comment 43 2013-04-11 14:27:22 PDT
(In reply to comment #42) > (From update of attachment 197667 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=197667&action=review > > > Source/JavaScriptCore/offlineasm/x86.rb:965 > > + $asm.puts "fld #{operands[0].x87Operand(0)}" > > + $asm.puts "frndint" > > + $asm.puts "fucomip #{operands[0].x87Operand(1)}" > > I should probably have mentioned it. I changed these lines in the latest patch because frndint didn't behave as expected. To properly cast double to integer, you need load the integer back from the stack. Ah, OK! I will make this change.
Filip Pizlo
Comment 44 2013-04-11 14:27:42 PDT
Comment on attachment 197667 [details] my version Clearing r? because I need to integrate Allen's latest change.
Allan Sandfeld Jensen
Comment 45 2013-04-11 14:40:06 PDT
(In reply to comment #43) > (In reply to comment #42) > > (From update of attachment 197667 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=197667&action=review > > > > > Source/JavaScriptCore/offlineasm/x86.rb:965 > > > + $asm.puts "fld #{operands[0].x87Operand(0)}" > > > + $asm.puts "frndint" > > > + $asm.puts "fucomip #{operands[0].x87Operand(1)}" > > > > I should probably have mentioned it. I changed these lines in the latest patch because frndint didn't behave as expected. To properly cast double to integer, you need load the integer back from the stack. > > Ah, OK! I will make this change. I also removed the changes to the stack-pointer to save instructions. It does make valgrind complain and I guess it is bad practice to access below the stackpointer, but it should be safe, right?
Filip Pizlo
Comment 46 2013-04-11 14:42:41 PDT
(In reply to comment #45) > (In reply to comment #43) > > (In reply to comment #42) > > > (From update of attachment 197667 [details] [details] [details]) > > > View in context: https://bugs.webkit.org/attachment.cgi?id=197667&action=review > > > > > > > Source/JavaScriptCore/offlineasm/x86.rb:965 > > > > + $asm.puts "fld #{operands[0].x87Operand(0)}" > > > > + $asm.puts "frndint" > > > > + $asm.puts "fucomip #{operands[0].x87Operand(1)}" > > > > > > I should probably have mentioned it. I changed these lines in the latest patch because frndint didn't behave as expected. To properly cast double to integer, you need load the integer back from the stack. > > > > Ah, OK! I will make this change. > > I also removed the changes to the stack-pointer to save instructions. It does make valgrind complain and I guess it is bad practice to access below the stackpointer, but it should be safe, right? It should be safe but it depends on how big your red zone is. The only problem with accessing below the stack pointer is that if a signal fires, it will push onto the stack and possibly clobber things. But signal handling logic already respects a "red zone" of stack locations beneath the stack that user code is allowed to play with. The size of it varies by platform and calling convention. I kind of prefer to not use the red zone just because the risk is often not worth it, but I think you're totally safe here - the red zone ought to always be at least 8 bytes (it's usually much more than that!) and we only have to worry about what happens on X86 platforms. I'm fine with it.
Filip Pizlo
Comment 47 2013-04-11 14:46:57 PDT
Here are the full no-JIT results with the patch. Note that the "all benchmarks <geometric>" thingy is reporting "ERROR" because I think there's a floating point bug in my harness - lol numerical stability is hard. ;-) I will run JIT-enabled benchmarks next. Benchmark report for SunSpider, V8Spider, Octane, Kraken, and JSRegress on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 SunSpider: 3d-cube 17.9544+-0.1384 ! 26.2953+-0.2174 ! definitely 1.4646x slower 3d-morph 23.2337+-0.6940 ! 25.3911+-0.2052 ! definitely 1.0929x slower 3d-raytrace 29.8920+-0.2649 ! 34.0675+-0.1459 ! definitely 1.1397x slower access-binary-trees 12.9582+-0.0859 ? 13.0112+-0.1347 ? access-fannkuch 41.4482+-0.1920 ! 46.7403+-0.5399 ! definitely 1.1277x slower access-nbody 18.5611+-0.1642 ! 31.6303+-0.1455 ! definitely 1.7041x slower access-nsieve 9.4760+-0.1078 ! 9.8115+-0.1013 ! definitely 1.0354x slower bitops-3bit-bits-in-byte 16.3962+-0.1308 ! 17.8102+-0.1786 ! definitely 1.0862x slower bitops-bits-in-byte 26.9323+-0.1926 ? 27.6365+-0.6107 ? might be 1.0261x slower bitops-bitwise-and 52.1668+-0.3800 ! 60.3746+-2.1944 ! definitely 1.1573x slower bitops-nsieve-bits 42.2204+-1.2434 ^ 37.5594+-0.3900 ^ definitely 1.1241x faster controlflow-recursive 18.8285+-0.1660 ? 18.8898+-0.1996 ? crypto-aes 18.9806+-0.2549 ^ 18.2652+-0.1559 ^ definitely 1.0392x faster crypto-md5 18.5364+-0.1919 ^ 17.7638+-0.4439 ^ definitely 1.0435x faster crypto-sha1 17.2322+-0.1944 16.8773+-0.3320 might be 1.0210x faster date-format-tofte 24.9981+-0.2550 ^ 24.2617+-0.2743 ^ definitely 1.0304x faster date-format-xparb 25.4501+-0.1946 ^ 25.0034+-0.1719 ^ definitely 1.0179x faster math-cordic 37.6483+-0.3665 ! 44.2332+-0.3411 ! definitely 1.1749x slower math-partial-sums 34.9296+-0.2630 ! 40.1234+-0.5674 ! definitely 1.1487x slower math-spectral-norm 17.6544+-0.1390 ! 19.4503+-0.4917 ! definitely 1.1017x slower regexp-dna 8.8638+-0.1221 8.8510+-0.1542 string-base64 27.1159+-0.4925 ? 27.2110+-0.2304 ? string-fasta 22.6352+-0.1540 ! 25.4636+-0.2456 ! definitely 1.1250x slower string-tagcloud 21.7558+-0.1759 21.6285+-0.1623 string-unpack-code 27.8736+-0.2633 ? 28.3440+-0.2486 ? might be 1.0169x slower string-validate-input 19.5678+-0.3874 ? 19.8040+-0.3555 ? might be 1.0121x slower <arithmetic> * 24.3581+-0.1014 ! 26.4038+-0.1161 ! definitely 1.0840x slower <geometric> 22.3753+-0.0960 ! 24.0000+-0.1127 ! definitely 1.0726x slower <harmonic> 20.4727+-0.0906 ! 21.6854+-0.1226 ! definitely 1.0592x slower TipOfTree X87 V8Spider: crypto 901.2628+-10.5638 ! 937.3151+-11.2742 ! definitely 1.0400x slower deltablue 2434.4545+-21.2540 2425.3902+-26.9994 earley-boyer 537.9689+-2.6866 ^ 530.5799+-3.1642 ^ definitely 1.0139x faster raytrace 284.2341+-1.9883 283.2019+-1.4540 regexp 121.2950+-0.2894 ! 122.0735+-0.4668 ! definitely 1.0064x slower richards 2512.2443+-17.9037 ^ 2427.0873+-19.0544 ^ definitely 1.0351x faster splay 207.0846+-0.6688 ! 211.0563+-1.5495 ! definitely 1.0192x slower <arithmetic> 999.7920+-4.3096 ^ 990.9577+-4.3002 ^ definitely 1.0089x faster <geometric> * 576.4975+-1.2882 ? 577.2247+-1.4423 ? might be 1.0013x slower <harmonic> 343.6464+-0.6157 ! 345.8782+-0.7725 ! definitely 1.0065x slower TipOfTree X87 Octane and V8v7: encrypt 5.62788+-0.02001 ! 5.85796+-0.01966 ! definitely 1.0409x slower decrypt 106.24871+-0.38143 ! 110.18448+-0.62942 ! definitely 1.0370x slower deltablue x2 16.10377+-0.11289 ^ 15.78929+-0.04329 ^ definitely 1.0199x faster earley 6.75416+-0.01812 ^ 6.63992+-0.03241 ^ definitely 1.0172x faster boyer 130.16360+-0.45605 ^ 127.45871+-0.73964 ^ definitely 1.0212x faster raytrace x2 47.67922+-0.09124 47.40098+-0.20901 regexp x2 42.79441+-0.18373 ? 43.11715+-0.18228 ? richards x2 7.29126+-0.08431 ^ 7.01501+-0.06343 ^ definitely 1.0394x faster splay x2 2.80017+-0.01659 ? 2.80727+-0.01369 ? navier-stokes x2 93.59975+-0.11213 ! 134.18157+-0.08814 ! definitely 1.4336x slower closure 0.32426+-0.00951 0.32384+-0.00924 jquery 3.55382+-0.49951 ? 3.57177+-0.50344 ? gbemu x2 576.86359+-1.51066 ! 584.87711+-1.80031 ! definitely 1.0139x slower box2d x2 198.84708+-0.49096 ! 219.52346+-0.72270 ! definitely 1.1040x slower V8v7: <arithmetic> 41.83322+-0.03653 ! 46.92272+-0.09009 ! definitely 1.1217x slower <geometric> * 21.49163+-0.04201 ! 22.38354+-0.05644 ! definitely 1.0415x slower <harmonic> 10.21874+-0.03439 10.21714+-0.03402 might be 1.0002x faster Octane including V8v7: <arithmetic> 101.11959+-0.12112 ! 107.43001+-0.12070 ! definitely 1.0624x slower <geometric> * 26.99365+-0.20971 ! 28.09455+-0.20618 ! definitely 1.0408x slower <harmonic> 4.43981+-0.11752 4.43780+-0.11243 might be 1.0005x faster TipOfTree X87 Kraken: ai-astar 2926.207+-13.127 ! 2963.124+-12.578 ! definitely 1.0126x slower audio-beat-detection 1354.818+-0.347 ! 1751.951+-0.461 ! definitely 1.2931x slower audio-dft 1094.455+-23.305 ! 1396.675+-5.148 ! definitely 1.2761x slower audio-fft 1255.999+-2.040 ! 1655.995+-1.407 ! definitely 1.3185x slower audio-oscillator 1164.374+-14.177 ! 1327.359+-4.681 ! definitely 1.1400x slower imaging-darkroom 2029.454+-6.323 ! 2665.679+-3.510 ! definitely 1.3135x slower imaging-desaturate 3255.786+-47.958 ! 3580.350+-3.315 ! definitely 1.0997x slower imaging-gaussian-blur 10626.032+-144.452 ? 10745.350+-75.979 ? might be 1.0112x slower json-parse-financial 78.443+-0.425 ^ 75.241+-0.452 ^ definitely 1.0426x faster json-stringify-tinderbox 106.984+-0.351 ^ 105.948+-0.242 ^ definitely 1.0098x faster stanford-crypto-aes 683.796+-1.363 ^ 679.112+-1.086 ^ definitely 1.0069x faster stanford-crypto-ccm 449.706+-0.656 ^ 418.933+-0.729 ^ definitely 1.0735x faster stanford-crypto-pbkdf2 1857.815+-5.765 ! 1942.681+-2.657 ! definitely 1.0457x slower stanford-crypto-sha256-iterative 649.638+-2.537 ! 663.995+-0.468 ! definitely 1.0221x slower <arithmetic> * 1966.679+-10.376 ! 2140.885+-5.390 ! definitely 1.0886x slower <geometric> 1023.590+-1.845 ! 1118.242+-1.159 ! definitely 1.0925x slower <harmonic> 432.051+-1.079 ? 432.909+-1.201 ? might be 1.0020x slower TipOfTree X87 JSRegress: adapt-to-double-divide 47.9152+-0.2307 ! 95.5433+-1.0917 ! definitely 1.9940x slower aliased-arguments-getbyval 8.5251+-0.1143 8.2975+-0.1161 might be 1.0274x faster allocate-big-object 27.9000+-0.2032 27.6419+-0.2438 arity-mismatch-inlining 16.4505+-0.2063 ? 16.6662+-0.2167 ? might be 1.0131x slower array-access-polymorphic-structure 54.3291+-0.5280 ? 54.4588+-0.3840 ? array-with-double-add 33.7276+-0.1719 ! 45.1450+-0.2714 ! definitely 1.3385x slower array-with-double-increment 35.4592+-0.1703 ! 40.1362+-0.1328 ! definitely 1.1319x slower array-with-double-mul-add 61.8230+-1.0954 ! 81.9887+-0.3190 ! definitely 1.3262x slower array-with-double-sum 22.5105+-0.1474 ! 30.6861+-0.1444 ! definitely 1.3632x slower array-with-int32-add-sub 63.6002+-0.9658 63.5210+-0.9274 array-with-int32-or-double-sum 22.7085+-0.1855 ! 30.8559+-0.2031 ! definitely 1.3588x slower big-int-mul 130.4622+-0.7350 ? 132.3166+-1.7035 ? might be 1.0142x slower boolean-test 44.7516+-0.6647 ? 45.0121+-0.8531 ? cast-int-to-double 269.4681+-1.3547 ! 375.0991+-1.8633 ! definitely 1.3920x slower cell-argument 77.7230+-0.2840 ? 78.6507+-0.7465 ? might be 1.0119x slower cfg-simplify 133.1694+-1.8531 ? 134.0724+-2.7341 ? cmpeq-obj-to-obj-other 120.8431+-1.1685 ^ 116.8870+-1.1027 ^ definitely 1.0338x faster constant-test 283.4232+-0.6060 ! 360.5058+-1.3039 ! definitely 1.2720x slower direct-arguments-getbyval 3.6782+-0.0424 3.5864+-0.0499 might be 1.0256x faster double-pollution-getbyval 35.5837+-0.1561 ! 59.4998+-0.2744 ! definitely 1.6721x slower double-pollution-putbyoffset 33.9927+-0.4477 ! 39.0707+-0.3272 ! definitely 1.1494x slower empty-string-plus-int 26.3457+-0.6217 26.2105+-0.1600 external-arguments-getbyval 9.0183+-0.1179 8.9002+-0.0899 might be 1.0133x faster external-arguments-putbyval 15.5025+-0.2450 ? 15.7071+-0.1708 ? might be 1.0132x slower Float32Array-matrix-mult 68.1563+-0.6462 ? 69.1521+-0.5622 ? might be 1.0146x slower fold-double-to-int 632.7413+-7.1740 ! 681.0175+-4.3578 ! definitely 1.0763x slower function-dot-apply 83.6074+-0.7094 82.5575+-0.5637 might be 1.0127x faster function-test 46.8864+-0.5865 ? 47.3746+-0.4820 ? might be 1.0104x slower get-by-id-chain-from-try-block 135.0482+-1.0815 ^ 131.9623+-1.3240 ^ definitely 1.0234x faster HashMap-put-get-iterate-keys 463.8817+-2.3252 456.6323+-5.2464 might be 1.0159x faster HashMap-put-get-iterate 432.5620+-3.3130 ^ 423.0020+-2.3462 ^ definitely 1.0226x faster HashMap-string-put-get-iterate 305.0262+-1.7646 ^ 301.4069+-1.4414 ^ definitely 1.0120x faster indexed-properties-in-objects 23.1630+-0.1067 ? 23.2080+-0.1854 ? inline-arguments-access 55.4069+-0.7658 ^ 50.7680+-0.6078 ^ definitely 1.0914x faster inline-arguments-local-escape 106.0205+-2.1873 ^ 102.7597+-0.7441 ^ definitely 1.0317x faster inline-get-scoped-var 100.7289+-15.2184 99.3675+-14.3224 might be 1.0137x faster inlined-put-by-id-transition 229.6349+-1.2355 229.4402+-1.5024 int-or-other-abs-then-get-by-val 139.3494+-0.6893 ? 140.5268+-0.6068 ? int-or-other-abs-zero-then-get-by-val 284.1493+-2.4926 282.4749+-1.4470 int-or-other-add-then-get-by-val 242.9923+-1.4659 ! 249.6290+-3.2146 ! definitely 1.0273x slower int-or-other-add 226.2495+-1.6327 224.9208+-1.8598 int-or-other-div-then-get-by-val 105.2304+-0.6893 105.1252+-0.4331 int-or-other-max-then-get-by-val 140.2353+-0.6879 ? 141.5437+-0.6356 ? int-or-other-min-then-get-by-val 140.2266+-0.8224 ! 144.6763+-2.4675 ! definitely 1.0317x slower int-or-other-mod-then-get-by-val 103.8230+-0.6266 ? 104.7759+-0.4527 ? int-or-other-mul-then-get-by-val 124.4648+-0.8262 ? 125.0835+-0.9148 ? int-or-other-neg-then-get-by-val 127.2243+-1.0448 ? 128.6683+-0.8059 ? might be 1.0113x slower int-or-other-neg-zero-then-get-by-val 283.6595+-1.1308 ? 284.3401+-1.9289 ? int-or-other-sub-then-get-by-val 241.7250+-1.4257 ! 248.6575+-2.5706 ! definitely 1.0287x slower int-or-other-sub 226.2594+-1.3067 226.2140+-1.7022 int-overflow-local 175.9691+-2.6634 175.3229+-0.8638 Int16Array-bubble-sort 1067.7213+-14.2553 ! 1091.3273+-2.5565 ! definitely 1.0221x slower Int16Array-load-int-mul 27.3937+-0.1835 ! 28.1407+-0.1787 ! definitely 1.0273x slower Int8Array-load 37.1140+-0.7281 36.1678+-0.5502 might be 1.0262x faster integer-divide 391.2293+-0.9023 ! 424.9037+-0.9518 ! definitely 1.0861x slower integer-modulo 9.0282+-0.0743 ! 9.4365+-0.1568 ! definitely 1.0452x slower make-indexed-storage 9.5931+-0.1229 ? 9.7242+-0.1220 ? might be 1.0137x slower method-on-number 39.8947+-0.8139 ? 40.8484+-0.2741 ? might be 1.0239x slower nested-function-parsing-random 393.8756+-8.0650 ? 397.6181+-8.0893 ? nested-function-parsing 47.6917+-1.3551 47.6117+-1.2893 new-array-buffer-dead 997.3167+-3.5530 ! 1013.4809+-4.7945 ! definitely 1.0162x slower new-array-buffer-push 55.7689+-0.2826 ! 57.0448+-0.2980 ! definitely 1.0229x slower new-array-dead 1001.8330+-14.7759 ? 1008.6254+-11.1651 ? new-array-push 39.5733+-1.1274 38.8385+-0.8491 might be 1.0189x faster number-test 44.7356+-0.6304 43.9065+-0.3294 might be 1.0189x faster object-closure-call 185.3457+-0.5947 ^ 181.3033+-0.7721 ^ definitely 1.0223x faster object-test 46.5967+-0.4557 46.2241+-0.1728 poly-stricteq 966.2038+-4.5872 ^ 958.9021+-2.1751 ^ definitely 1.0076x faster polymorphic-structure 1079.8636+-27.9441 1062.1190+-29.5844 might be 1.0167x faster polyvariant-monomorphic-get-by-id 508.4998+-2.9552 ? 511.9921+-3.0516 ? rare-osr-exit-on-local 114.8582+-0.1350 ^ 113.4716+-0.1760 ^ definitely 1.0122x faster register-pressure-from-osr 378.7682+-1.7215 ? 378.8308+-1.3232 ? simple-activation-demo 250.3638+-0.7172 ? 250.7049+-0.7741 ? slow-array-profile-convergence 18.5058+-0.1898 18.1641+-0.1891 might be 1.0188x faster slow-convergence 15.0500+-0.1545 ? 15.1581+-0.1617 ? sparse-conditional 45.0809+-1.3548 43.3927+-0.3652 might be 1.0389x faster splice-to-remove 108.3011+-0.3777 ! 109.2319+-0.3910 ! definitely 1.0086x slower string-concat-object 43.3260+-0.8530 41.6709+-1.1433 might be 1.0397x faster string-concat-pair-object 42.8049+-0.9768 41.7154+-1.0172 might be 1.0261x faster string-concat-pair-simple 181.4127+-7.2532 ? 181.5131+-8.0114 ? string-concat-simple 195.2060+-7.2766 ? 195.7453+-7.3555 ? string-cons-repeat 211.2515+-5.9333 ? 214.0855+-6.8960 ? might be 1.0134x slower string-cons-tower 210.6660+-3.0246 210.2556+-1.4762 string-equality 444.1645+-2.8023 439.4118+-2.4630 might be 1.0108x faster string-hash 66.8630+-0.7461 ! 68.3695+-0.6929 ! definitely 1.0225x slower string-repeat-arith 114.7803+-0.5278 ? 115.6372+-0.8842 ? string-sub 346.0125+-1.0248 ! 351.4472+-3.6150 ! definitely 1.0157x slower string-test 44.6894+-0.4789 44.1954+-0.2385 might be 1.0112x faster structure-hoist-over-transitions 33.2790+-0.5486 ^ 31.5402+-0.6329 ^ definitely 1.0551x faster tear-off-arguments-simple 51.9610+-0.7462 50.7186+-0.9921 might be 1.0245x faster tear-off-arguments 85.1029+-0.7896 ^ 82.2770+-1.0577 ^ definitely 1.0343x faster temporal-structure 1062.1951+-37.4462 ? 1077.2291+-29.3359 ? might be 1.0142x slower to-int32-boolean 476.8701+-2.1493 475.9137+-3.6322 undefined-test 44.7773+-0.7267 44.7186+-0.5410 <arithmetic> 195.3075+-1.2274 ! 199.7306+-1.1282 ! definitely 1.0226x slower <geometric> * 95.8224+-0.4718 ! 99.0967+-0.4824 ! definitely 1.0342x slower <harmonic> 45.2477+-0.1644 ! 46.4847+-0.1733 ! definitely 1.0273x slower TipOfTree X87 All benchmarks: <arithmetic> 342.0177+-0.6858 ! 360.3297+-0.9751 ! definitely 1.0535x slower <geometric> ERROR ERROR <harmonic> 19.5023+-0.3298 ? 19.7968+-0.3185 ? might be 1.0151x slower TipOfTree X87 Geomean of preferred means: <scaled-result> 148.1736+-0.4090 ! 155.4717+-0.4907 ! definitely 1.0493x slower
Allan Sandfeld Jensen
Comment 48 2013-04-12 03:19:46 PDT
(In reply to comment #46) > I kind of prefer to not use the red zone just because the risk is often not worth it, but I think you're totally safe here - the red zone ought to always be at least 8 bytes (it's usually much more than that!) and we only have to worry about what happens on X86 platforms. I'm fine with it. An alternative could be to have 8 bytes allocated somewhere and use that as a the memory temporary, but it would probably need to be thread-local which would make it complicated.
Allan Sandfeld Jensen
Comment 49 2013-04-12 05:26:05 PDT
Created attachment 197748 [details] Patch Newest patch with ffree only at exit and function calls
Allan Sandfeld Jensen
Comment 50 2013-04-12 05:26:58 PDT
(In reply to comment #48) > (In reply to comment #46) > > I kind of prefer to not use the red zone just because the risk is often not worth it, but I think you're totally safe here - the red zone ought to always be at least 8 bytes (it's usually much more than that!) and we only have to worry about what happens on X86 platforms. I'm fine with it. > > An alternative could be to have 8 bytes allocated somewhere and use that as a the memory temporary, but it would probably need to be thread-local which would make it complicated. I actually would prefer this solution, but I haven't figured out how to best do that in llint. Maybe you could change that in a later patch?
Filip Pizlo
Comment 51 2013-04-12 10:30:05 PDT
Performance when all of the JITs are enabled: Benchmark report for SunSpider, V8Spider, Octane, Kraken, and JSRegress on bigmac (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quartary/OpenSource/WebKitBuild/Release/jsc (r148221) "X87" at /Volumes/Data/pizlo/secondary/OpenSource/WebKitBuild/Release/jsc (r148221) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree X87 SunSpider: 3d-cube 8.4713+-0.1406 ? 8.5584+-0.1313 ? might be 1.0103x slower 3d-morph 11.5874+-0.0879 ? 11.6877+-0.1872 ? 3d-raytrace 10.9103+-0.2066 ? 11.2307+-0.2053 ? might be 1.0294x slower access-binary-trees 1.9041+-0.0309 1.8956+-0.0330 access-fannkuch 9.2455+-0.0914 9.2301+-0.1090 access-nbody 6.2713+-0.0893 ? 6.3165+-0.0863 ? access-nsieve 4.4406+-0.0506 ? 4.4999+-0.0658 ? might be 1.0133x slower bitops-3bit-bits-in-byte 1.5981+-0.0122 1.5972+-0.0125 bitops-bits-in-byte 5.7304+-0.0618 5.6920+-0.0716 bitops-bitwise-and 1.8656+-0.0575 1.8476+-0.0357 bitops-nsieve-bits 4.9271+-0.0586 ? 4.9553+-0.0548 ? controlflow-recursive 2.9815+-0.0083 ? 2.9883+-0.0070 ? crypto-aes 7.9819+-0.0873 ? 8.0065+-0.0879 ? crypto-md5 4.1950+-0.0987 ? 4.2794+-0.0897 ? might be 1.0201x slower crypto-sha1 3.2086+-0.0632 ? 3.2259+-0.0615 ? date-format-tofte 14.1293+-0.2135 14.1269+-0.2324 date-format-xparb 9.3211+-0.1630 9.1489+-0.1612 might be 1.0188x faster math-cordic 3.6830+-0.0249 3.6790+-0.0237 math-partial-sums 11.8966+-0.1070 11.8268+-0.1286 math-spectral-norm 2.8458+-0.0626 2.8215+-0.0532 regexp-dna 8.8680+-0.1724 ? 9.0054+-0.1928 ? might be 1.0155x slower string-base64 4.3851+-0.0627 4.3769+-0.0672 string-fasta 11.0923+-0.1067 11.0117+-0.1260 string-tagcloud 13.6231+-0.1474 ? 13.6900+-0.1933 ? string-unpack-code 25.2536+-0.1884 25.2060+-0.2024 string-validate-input 7.7005+-0.2790 7.6735+-0.3608 <arithmetic> * 7.6199+-0.0559 ? 7.6376+-0.0644 ? might be 1.0023x slower <geometric> 6.1017+-0.0483 ? 6.1140+-0.0541 ? might be 1.0020x slower <harmonic> 4.7920+-0.0367 ? 4.7948+-0.0405 ? might be 1.0006x slower TipOfTree X87 V8Spider: crypto 91.5816+-0.4602 91.4182+-0.4677 deltablue 118.4197+-0.5212 ? 118.8000+-0.5584 ? earley-boyer 75.8704+-0.4073 75.8014+-0.4896 raytrace 56.3275+-0.2092 56.1597+-0.2468 regexp 90.5604+-0.4672 90.2170+-0.4606 richards 121.4483+-0.9724 121.0089+-0.6860 splay 53.0720+-0.3530 ? 53.8372+-0.6719 ? might be 1.0144x slower <arithmetic> 86.7543+-0.3187 86.7489+-0.3281 might be 1.0001x faster <geometric> * 82.9621+-0.3022 ? 83.0132+-0.3201 ? might be 1.0006x slower <harmonic> 79.1680+-0.2917 ? 79.2885+-0.3291 ? might be 1.0015x slower TipOfTree X87 Octane and V8v7: encrypt 0.48404+-0.00108 0.48315+-0.00049 decrypt 8.89867+-0.00585 ? 8.90637+-0.01130 ? deltablue x2 0.56227+-0.00159 ? 0.56709+-0.00567 ? earley 0.92620+-0.00472 0.91757+-0.00493 boyer 13.18712+-0.04556 ? 13.23834+-0.05002 ? raytrace x2 4.37337+-0.00453 4.36113+-0.01516 regexp x2 27.35892+-0.07119 27.35433+-0.09016 richards x2 0.31841+-0.00105 0.31696+-0.00086 splay x2 0.74958+-0.02239 0.74427+-0.01496 navier-stokes x2 9.40497+-0.01926 9.38817+-0.00985 closure 0.32011+-0.00899 ? 0.32026+-0.00915 ? jquery 3.89066+-0.50684 3.87249+-0.49892 gbemu x2 135.58479+-0.72822 134.98093+-0.68585 box2d x2 32.55236+-0.05705 ? 32.71983+-0.38062 ? V8v7: <arithmetic> 6.81444+-0.01240 6.81309+-0.01427 might be 1.0002x faster <geometric> * 2.39825+-0.00861 2.39503+-0.00717 might be 1.0013x faster <harmonic> 0.96455+-0.00413 0.96277+-0.00413 might be 1.0019x faster Octane including V8v7: <arithmetic> 20.43255+-0.06356 20.39107+-0.04657 might be 1.0020x faster <geometric> * 4.08882+-0.03090 4.08436+-0.03230 might be 1.0011x faster <harmonic> 1.09676+-0.00755 1.09514+-0.00822 might be 1.0015x faster TipOfTree X87 Kraken: ai-astar 469.424+-3.803 465.934+-4.717 audio-beat-detection 273.807+-0.886 ? 275.288+-0.958 ? audio-dft 375.629+-0.863 ^ 372.211+-1.117 ^ definitely 1.0092x faster audio-fft 137.547+-0.156 ? 137.770+-0.457 ? audio-oscillator 299.070+-0.835 ? 299.699+-0.754 ? imaging-darkroom 339.435+-0.786 ? 339.473+-1.241 ? imaging-desaturate 137.537+-0.814 136.756+-0.308 imaging-gaussian-blur 417.133+-0.117 ? 417.337+-0.319 ? json-parse-financial 78.675+-0.389 ^ 75.126+-0.273 ^ definitely 1.0472x faster json-stringify-tinderbox 106.582+-0.297 106.288+-0.535 stanford-crypto-aes 101.409+-0.327 ^ 100.226+-0.320 ^ definitely 1.0118x faster stanford-crypto-ccm 102.356+-1.681 100.923+-1.711 might be 1.0142x faster stanford-crypto-pbkdf2 264.290+-3.688 262.732+-2.170 stanford-crypto-sha256-iterative 110.168+-0.377 110.023+-0.254 <arithmetic> * 229.504+-0.574 228.556+-0.443 might be 1.0041x faster <geometric> 192.785+-0.518 ^ 191.488+-0.392 ^ definitely 1.0068x faster <harmonic> 162.290+-0.506 ^ 160.508+-0.452 ^ definitely 1.0111x faster TipOfTree X87 JSRegress: adapt-to-double-divide 18.6620+-0.1528 ? 18.9026+-0.2134 ? might be 1.0129x slower aliased-arguments-getbyval 0.8870+-0.0085 0.8825+-0.0082 allocate-big-object 2.0904+-0.0372 2.0904+-0.0343 arity-mismatch-inlining 0.6860+-0.0082 0.6826+-0.0083 array-access-polymorphic-structure 6.6670+-0.1333 ? 6.6884+-0.1328 ? array-with-double-add 5.5231+-0.0562 5.4432+-0.0625 might be 1.0147x faster array-with-double-increment 4.1827+-0.0372 ? 4.1924+-0.0190 ? array-with-double-mul-add 6.8065+-0.0964 6.7474+-0.0651 array-with-double-sum 7.1997+-0.0942 ? 7.3575+-0.0834 ? might be 1.0219x slower array-with-int32-add-sub 11.7241+-0.0699 ? 11.8291+-0.1340 ? array-with-int32-or-double-sum 7.3780+-0.0417 7.2711+-0.0818 might be 1.0147x faster big-int-mul 4.9054+-0.0568 4.8555+-0.0631 might be 1.0103x faster boolean-test 3.9151+-0.0052 ! 3.9461+-0.0170 ! definitely 1.0079x slower cast-int-to-double 20.0697+-0.1025 ? 20.2041+-0.1600 ? cell-argument 12.1218+-0.1381 ? 12.1944+-0.1302 ? cfg-simplify 2.8944+-0.0162 ? 2.8947+-0.0115 ? cmpeq-obj-to-obj-other 11.3485+-0.3009 ? 11.5544+-0.1935 ? might be 1.0181x slower constant-test 7.5989+-0.0821 ? 7.6972+-0.1676 ? might be 1.0129x slower direct-arguments-getbyval 0.8200+-0.0112 0.8105+-0.0076 might be 1.0117x faster double-pollution-getbyval 9.0295+-0.1284 ? 9.0451+-0.1121 ? double-pollution-putbyoffset 6.3907+-0.1157 6.3683+-0.0735 empty-string-plus-int 10.2419+-0.2578 10.1390+-0.1791 might be 1.0101x faster external-arguments-getbyval 2.1421+-0.0110 2.1319+-0.0120 external-arguments-putbyval 5.1082+-0.0681 5.0393+-0.0451 might be 1.0137x faster Float32Array-matrix-mult 12.3590+-0.1681 ? 12.4051+-0.2283 ? fold-double-to-int 23.1428+-0.2870 22.8783+-0.2126 might be 1.0116x faster function-dot-apply 2.8377+-0.0032 ? 2.8384+-0.0041 ? function-test 5.5094+-0.1015 ? 5.5402+-0.0929 ? get-by-id-chain-from-try-block 6.1146+-0.0897 6.0519+-0.0668 might be 1.0104x faster HashMap-put-get-iterate-keys 95.8641+-1.2456 95.8528+-1.0248 HashMap-put-get-iterate 97.4573+-0.9539 97.0483+-0.6366 HashMap-string-put-get-iterate 73.3810+-1.5786 71.4995+-0.7505 might be 1.0263x faster indexed-properties-in-objects 4.9999+-0.0537 4.9789+-0.0661 inline-arguments-access 1.0655+-0.0090 ? 1.0659+-0.0076 ? inline-arguments-local-escape 23.9774+-0.1634 23.8001+-0.2629 inline-get-scoped-var 7.7726+-0.0888 ? 7.7958+-0.0971 ? inlined-put-by-id-transition 13.7203+-0.1477 13.3768+-0.2447 might be 1.0257x faster int-or-other-abs-then-get-by-val 7.6878+-0.1108 7.6678+-0.0494 int-or-other-abs-zero-then-get-by-val 40.3163+-1.0903 39.5318+-0.6394 might be 1.0198x faster int-or-other-add-then-get-by-val 9.5297+-0.0970 ? 9.5403+-0.1115 ? int-or-other-add 9.6070+-0.0956 ? 9.6808+-0.1389 ? int-or-other-div-then-get-by-val 13.7068+-0.2952 13.6847+-0.2809 int-or-other-max-then-get-by-val 8.9716+-0.2515 8.8679+-0.2627 might be 1.0117x faster int-or-other-min-then-get-by-val 6.9859+-0.0964 6.9514+-0.0800 int-or-other-mod-then-get-by-val 6.5939+-0.0817 6.5403+-0.0918 int-or-other-mul-then-get-by-val 6.1754+-0.0673 ? 6.2010+-0.0725 ? int-or-other-neg-then-get-by-val 7.0743+-0.0616 ? 7.0794+-0.0886 ? int-or-other-neg-zero-then-get-by-val 40.1960+-0.8963 40.1210+-0.8703 int-or-other-sub-then-get-by-val 9.7823+-0.1034 ? 9.8527+-0.1058 ? int-or-other-sub 7.4765+-0.0990 7.4220+-0.0832 int-overflow-local 12.3823+-0.1342 12.2641+-0.1071 Int16Array-bubble-sort 47.9325+-0.1673 ? 48.0858+-0.4122 ? Int16Array-load-int-mul 1.6472+-0.0049 1.6449+-0.0057 Int8Array-load 4.0870+-0.0338 ? 4.1175+-0.0344 ? integer-divide 14.1285+-0.1962 13.9381+-0.0705 might be 1.0137x faster integer-modulo 2.2381+-0.0346 2.2331+-0.0256 make-indexed-storage 3.9175+-0.0597 3.8699+-0.0355 might be 1.0123x faster method-on-number 26.2468+-0.3698 25.7263+-0.6429 might be 1.0202x faster nested-function-parsing-random 360.3527+-7.5504 ? 364.3197+-7.7778 ? might be 1.0110x slower nested-function-parsing 43.8675+-1.3831 43.8332+-1.3596 new-array-buffer-dead 3.1047+-0.0234 3.0978+-0.0216 new-array-buffer-push 8.9484+-0.2113 8.9037+-0.0935 new-array-dead 23.6217+-0.1106 23.5961+-0.0979 new-array-push 8.1822+-0.8248 7.2123+-0.8327 might be 1.1345x faster number-test 3.9190+-0.0495 ? 3.9740+-0.0245 ? might be 1.0140x slower object-closure-call 7.4267+-0.0850 7.4054+-0.0933 object-test 5.2685+-0.0856 ? 5.3046+-0.0544 ? poly-stricteq 125.0701+-0.2014 ? 125.0846+-0.6873 ? polymorphic-structure 20.9994+-0.1814 ? 21.1096+-0.1608 ? polyvariant-monomorphic-get-by-id 10.8186+-0.6052 ? 11.3207+-1.0194 ? might be 1.0464x slower rare-osr-exit-on-local 17.4148+-0.1382 17.3323+-0.1217 register-pressure-from-osr 39.7328+-0.2049 ? 39.8345+-0.2256 ? simple-activation-demo 32.7903+-0.1577 ? 32.8377+-0.1830 ? slow-array-profile-convergence 4.9651+-0.0312 4.9423+-0.0428 slow-convergence 3.4965+-0.0096 ? 3.5097+-0.0091 ? sparse-conditional 1.0508+-0.0086 1.0503+-0.0082 splice-to-remove 81.1577+-0.6335 ? 81.3371+-0.7318 ? string-concat-object 2.6037+-0.0571 ? 2.6748+-0.0537 ? might be 1.0273x slower string-concat-pair-object 1.8310+-0.0784 1.7893+-0.0310 might be 1.0233x faster string-concat-pair-simple 9.8376+-0.1170 9.8214+-0.0899 string-concat-simple 24.0914+-0.3094 23.8376+-0.1993 might be 1.0106x faster string-cons-repeat 8.1876+-0.1630 8.0398+-0.1312 might be 1.0184x faster string-cons-tower 7.7395+-0.0931 ? 7.7585+-0.0989 ? string-equality 115.1079+-1.8495 ? 116.4246+-1.3613 ? might be 1.0114x slower string-hash 2.6314+-0.0075 ? 2.6347+-0.0099 ? string-repeat-arith 95.7668+-0.5033 95.1092+-0.4644 string-sub 170.1028+-0.5544 ? 170.1274+-0.8092 ? string-test 4.0592+-0.0092 ? 4.0662+-0.0061 ? structure-hoist-over-transitions 2.6915+-0.0275 ? 2.7018+-0.0319 ? tear-off-arguments-simple 1.7485+-0.0082 1.7456+-0.0074 tear-off-arguments 3.2541+-0.0079 3.2530+-0.0092 temporal-structure 20.8301+-0.1698 20.8226+-0.1213 to-int32-boolean 27.8484+-0.1490 ? 27.9343+-0.1371 ? undefined-test 4.0781+-0.0411 4.0312+-0.0597 might be 1.0116x faster <arithmetic> 22.6370+-0.0655 ? 22.6372+-0.0562 ? might be 1.0000x slower <geometric> * 9.2036+-0.0370 9.1799+-0.0368 might be 1.0026x faster <harmonic> 4.8231+-0.0272 4.8092+-0.0270 might be 1.0029x faster TipOfTree X87 All benchmarks: <arithmetic> 40.4653+-0.0889 40.3810+-0.0754 might be 1.0021x faster <geometric> 11.0256+-0.0478 11.0050+-0.0504 might be 1.0019x faster <harmonic> 3.6101+-0.0166 3.6034+-0.0198 might be 1.0019x faster TipOfTree X87 Geomean of preferred means: <scaled-result> 22.2551+-0.0976 22.2334+-0.1039 might be 1.0010x faster
Allan Sandfeld Jensen
Comment 52 2013-04-14 03:15:31 PDT
(In reply to comment #51) > Performance when all of the JITs are enabled: > That looks very reasonable. One of two percent up or down.
Allan Sandfeld Jensen
Comment 53 2013-04-16 04:55:08 PDT
Does you old r+ still stand or do you want to test the new patch some more?
Filip Pizlo
Comment 54 2013-04-18 17:53:41 PDT
(In reply to comment #53) > Does you old r+ still stand or do you want to test the new patch some more? Sorry for the delay! I was away since Friday. I just ran tests on your patch and see these new failures: sputnik/Conformance/08_Types/8.5_The_Number_Type/S8.5_A2.1.html [ Failure ] sputnik/Conformance/08_Types/8.5_The_Number_Type/S8.5_A2.2.html [ Failure ] Do you see these failures also, or are they Mac-only? I don't get these failures on trunk, but I do see them when I apply your patch.
Filip Pizlo
Comment 55 2013-04-18 17:55:23 PDT
(In reply to comment #54) > (In reply to comment #53) > > Does you old r+ still stand or do you want to test the new patch some more? > > Sorry for the delay! I was away since Friday. I just ran tests on your patch and see these new failures: > > sputnik/Conformance/08_Types/8.5_The_Number_Type/S8.5_A2.1.html [ Failure ] > sputnik/Conformance/08_Types/8.5_The_Number_Type/S8.5_A2.2.html [ Failure ] > > Do you see these failures also, or are they Mac-only? I don't get these failures on trunk, but I do see them when I apply your patch. For example: [pizlo@bigmac OpenSource] DYLD_FRAMEWORK_PATH=WebKitBuild/Debug/ WebKitBuild/Debug/DumpRenderTree LayoutTests/sputnik/Conformance/08_Types/8.5_The_Number_Type/S8.5_A2.1.html Content-Type: text/plain DumpMalloc: 0 S8.5_A2.1 FAIL SputnikError: #1: var x = 9007199254740994.0; var y = 1.0 - 1/65536.0; var z = x + y; var d = z - x; d === 0. Actual: 2 TEST COMPLETE #EOF (i.e. the test was expecting 'd' to be zero but it ended up being 2.)
Filip Pizlo
Comment 56 2013-04-18 17:56:01 PDT
Comment on attachment 197748 [details] Patch I think this needs a bit more love. We should figure out why it causes Sputnik regressions.
Allan Sandfeld Jensen
Comment 57 2013-04-19 05:39:48 PDT
(In reply to comment #56) > (From update of attachment 197748 [details]) > I think this needs a bit more love. We should figure out why it causes Sputnik regressions. It is due to increased precision. Those two tests test that we lose precision.
Allan Sandfeld Jensen
Comment 58 2013-04-19 05:41:57 PDT
Created attachment 198846 [details] Patch Set x87 precision to 64bit, since Linux defaults it to 80bit
Filip Pizlo
Comment 59 2013-04-19 13:24:22 PDT
Comment on attachment 198846 [details] Patch r=me! This fixes Sputnik.
WebKit Commit Bot
Comment 60 2013-04-20 03:26:19 PDT
Comment on attachment 198846 [details] Patch Clearing flags on attachment: 198846 Committed r148790: <http://trac.webkit.org/changeset/148790>
WebKit Commit Bot
Comment 61 2013-04-20 03:26:25 PDT
All reviewed patches have been landed. Closing bug.
Csaba Osztrogonác
Comment 62 2013-04-21 03:33:46 PDT
(In reply to comment #60) > (From update of attachment 198846 [details]) > Clearing flags on attachment: 198846 > > Committed r148790: <http://trac.webkit.org/changeset/148790> FYI, it broke some tests on 32 bit: http://build.webkit.sed.hu/builders/x86-32%20Linux%20Qt%20Release%20NRWT/builds/32135 Could you check and fix it, please?
Allan Sandfeld Jensen
Comment 63 2013-04-21 05:40:40 PDT
(In reply to comment #62) > (In reply to comment #60) > > (From update of attachment 198846 [details] [details]) > > Clearing flags on attachment: 198846 > > > > Committed r148790: <http://trac.webkit.org/changeset/148790> > > FYI, it broke some tests on 32 bit: http://build.webkit.sed.hu/builders/x86-32%20Linux%20Qt%20Release%20NRWT/builds/32135 > > Could you check and fix it, please? Of course. For reference it appears to be 5 canvas tests and two sputnik tests: fast/canvas/canvas-arc-360-winding.html [ Failure ] fast/canvas/canvas-fillPath-alpha-shadow.html [ Failure ] fast/canvas/canvas-fillPath-gradient-shadow.html [ Failure ] fast/canvas/canvas-fillPath-pattern-shadow.html [ Failure ] fast/canvas/canvas-strokePath-alpha-shadow.html [ Failure ] sputnik/Conformance/15_Native_Objects/15.8_Math/15.8.2/15.8.2.13_pow/S15.8.2.13_A24.html [ Failure ] sputnik/Conformance/15_Native_Objects/15.8_Math/15.8.2/15.8.2.8_exp/S15.8.2.8_A6.html [ Failure ]
Allan Sandfeld Jensen
Comment 64 2013-04-21 06:27:04 PDT
(In reply to comment #63) > (In reply to comment #62) > > (In reply to comment #60) > > > (From update of attachment 198846 [details] [details] [details]) > > > Clearing flags on attachment: 198846 > > > > > > Committed r148790: <http://trac.webkit.org/changeset/148790> > > > > FYI, it broke some tests on 32 bit: http://build.webkit.sed.hu/builders/x86-32%20Linux%20Qt%20Release%20NRWT/builds/32135 > > > > Could you check and fix it, please? > > Of course. For reference it appears to be 5 canvas tests and two sputnik tests: > fast/canvas/canvas-arc-360-winding.html [ Failure ] > fast/canvas/canvas-fillPath-alpha-shadow.html [ Failure ] > fast/canvas/canvas-fillPath-gradient-shadow.html [ Failure ] > fast/canvas/canvas-fillPath-pattern-shadow.html [ Failure ] > fast/canvas/canvas-strokePath-alpha-shadow.html [ Failure ] > sputnik/Conformance/15_Native_Objects/15.8_Math/15.8.2/15.8.2.13_pow/S15.8.2.13_A24.html [ Failure ] > sputnik/Conformance/15_Native_Objects/15.8_Math/15.8.2/15.8.2.8_exp/S15.8.2.8_A6.html [ Failure ] I have solved these before. I when hand merging the last patch I forgot to include one more change. The bcd2i instructions has to use fcomip comparison instead of fucomip comparison since fucomip specifically does not report invalid comparison when either value is +/- infinity. Which means infinity may get wrongly converted to integer.
Allan Sandfeld Jensen
Comment 65 2013-04-21 06:52:11 PDT
(In reply to comment #64) > I have solved these before. I when hand merging the last patch I forgot to include one more change. The bcd2i instructions has to use fcomip comparison instead of fucomip comparison since fucomip specifically does not report invalid comparison when either value is +/- infinity. Which means infinity may get wrongly converted to integer. Nevermind. That makes no sense. I will figure out which difference in my debug-tree makes the difference monday.
Jan
Comment 66 2013-04-23 04:36:51 PDT
The fix did land QtWebkit 2.3.1, right? Unfortunately, I still get illegal instructions. Although it seems to be working for most sites now. I can reproduce it running http://octane-benchmark.googlecode.com/svn/latest/index.html. It crashes at "Splay - Memory & GC". It could take me some time to get a debug build if it's needed to find the culprit for this one.
Allan Sandfeld Jensen
Comment 67 2013-04-23 04:47:47 PDT
(In reply to comment #66) > The fix did land QtWebkit 2.3.1, right? Unfortunately, I still get illegal instructions. Although it seems to be working for most sites now. I can reproduce it running http://octane-benchmark.googlecode.com/svn/latest/index.html. It crashes at "Splay - Memory & GC". It could take me some time to get a debug build if it's needed to find the culprit for this one. This patch is not in QtWebKit 2.3.1. I just disabled LLInt in 2.3.1 if GCC does not find the __SSE2__ define.
Jan
Comment 68 2013-04-23 04:55:23 PDT
(In reply to comment #67) > This patch is not in QtWebKit 2.3.1. I just disabled LLInt in 2.3.1 if GCC does not find the __SSE2__ define. So, I'm hitting another bug for which I should file a new bug?
Allan Sandfeld Jensen
Comment 69 2013-04-26 08:02:20 PDT
Reclose
Allan Sandfeld Jensen
Comment 70 2013-06-07 08:26:04 PDT
There is a remaining case where the stack was not reset right. Filip, have you had a chance to take a look at bug 148790 ?
Allan Sandfeld Jensen
Comment 71 2013-06-07 08:26:51 PDT
(In reply to comment #70) > There is a remaining case where the stack was not reset right. Filip, have you had a chance to take a look at bug 148790 ? Sorry bug 114913 of course
Gauvain Pocentek
Comment 72 2013-07-26 06:42:13 PDT
Hello, I'm experiencing this problem with qtwebkit 2.3.2. The build is based on ubuntu packaging, with these options build: ./Tools/Scripts/build-webkit --qt DEFINES+=ENABLE_JIT=0 DEFINES+=ENABLE_YARR_JIT=0 DEFINES+=ENABLE_ASSEMBLER=0 --no-force-sse2 The crash happens on an AMD geode CPU, with an "Illegal Instruction". The backtrace is similar to the arora trace provided in comment #1. I've tested with a personal application and with arora. Maybe the build options are not good, but it looks like the bug is still there in this configuration. Let me know if you need more information.
Allan Sandfeld Jensen
Comment 73 2013-07-26 06:58:02 PDT
(In reply to comment #72) > Hello, > > I'm experiencing this problem with qtwebkit 2.3.2. The build is based on ubuntu packaging, with these options build: > ./Tools/Scripts/build-webkit --qt DEFINES+=ENABLE_JIT=0 DEFINES+=ENABLE_YARR_JIT=0 DEFINES+=ENABLE_ASSEMBLER=0 --no-force-sse2 > > The crash happens on an AMD geode CPU, with an "Illegal Instruction". The backtrace is similar to the arora trace provided in comment #1. > > I've tested with a personal application and with arora. > > Maybe the build options are not good, but it looks like the bug is still there in this configuration. > > Let me know if you need more information. That sounds like a different bug. This particular codepath in llint does not get used if you disable JIT.
Gauvain Pocentek
Comment 74 2013-07-26 07:09:40 PDT
OK. I'm going to test a build with JIT enabled and see if it helps. I'll report an other bug if needed. Thanks.
Note You need to log in before you can comment on or make changes to this bug.