Bug 120047

Summary: Relatively slow performance on emscripten-compiled memops benchmark
Product: WebKit Reporter: Alon Zakai <alonzakai>
Component: JavaScriptCoreAssignee: Filip Pizlo <fpizlo>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: fpizlo, ggaren, oliver
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
memops benchmark
none
memops htmlified none

Description Alon Zakai 2013-08-19 18:27:22 PDT
Created attachment 209146 [details]
memops benchmark

Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are

jsc  5.532 seconds
sm   1.551
v8   1.488

It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar.
Comment 1 Geoffrey Garen 2013-08-19 23:21:30 PDT
<rdar://problem/14781573>
Comment 2 Filip Pizlo 2013-08-20 11:24:31 PDT
(In reply to comment #0)
> Created an attachment (id=209146) [details]
> memops benchmark
> 
> Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are
> 
> jsc  5.532 seconds
> sm   1.551
> v8   1.488
> 
> It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar.

Interesting, will look.
Comment 3 Filip Pizlo 2013-08-20 11:42:02 PDT
(In reply to comment #2)
> (In reply to comment #0)
> > Created an attachment (id=209146) [details] [details]
> > memops benchmark
> > 
> > Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are
> > 
> > jsc  5.532 seconds
> > sm   1.551
> > v8   1.488
> > 
> > It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar.
> 
> Interesting, will look.

OK - I think that something is terribly wrong on Linux, or V8 and SM got a lot faster.  On my 64-bit Mac, I see:

Chrome 28: 3.162 sec
A recent WebKit: 5.382 sec

So, clearly we're slower and we should fix that, but we're not *that* much slower.

Can you check, for sanity, if you see a difference between Chrome 28 (the latest Chrome on the release channel) and the V8 you have?

For posterity I post the HTML-ified version of the benchmark I used to run in browser.
Comment 4 Filip Pizlo 2013-08-20 11:42:32 PDT
Created attachment 209212 [details]
memops htmlified
Comment 5 Filip Pizlo 2013-08-20 11:49:18 PDT
Oh - never mind.  Chrome Canary scores really well: 1696 ms.

So it's a real bug, and it's not Linux-only.
Comment 6 Filip Pizlo 2013-08-20 11:58:35 PDT
(In reply to comment #5)
> Oh - never mind.  Chrome Canary scores really well: 1696 ms.
> 
> So it's a real bug, and it's not Linux-only.

Here's basically what's going wrong:

- the HEAP8/HEAP32 variables are closure variables, which makes sense.  But our closure variable support still requires *way too many loads*.

- since the DFG cannot reason about GetClosureVar in a non-block-local way, the hottest basic block of the hottest loop of the memops benchmark ends up repeatedly loading the scope, loading the registers, loading the closure variable, type checking the thing it loaded to see that it's a typed array, and then loading the typed array's vector, and then loading from that vector.  omgwtfbbq.

There are four paths for solving this:

- The obvious: FTL JIT.  https://bugs.webkit.org/show_bug.cgi?id=112840.  This higher-tier JIT will give us LICM on all of those things, resulting in inter-block reasoning about closure variables and everything else.

- The frustrating: reduce the number of loads needed to get to a closure variable.  Reduce the number of loads needed to get to a value in a typed array.  Reduce the number of loads needed to check that something is a typed array.  And so on.

- The hilarious: constant inference for closure variables, which will make us compile the code with HEAP8/HEAP32/etc being cell constants.  Look, ma, no type checks!  I don't know if this will make any sense, though.

- The outrageous: recognize the emscripten idioms.  I don't want to do this.

I'm going to add this benchmark to our repertoire and I'll use it for FTL testing.  Before I can do that, though, I'll need to finally add typed arrays to the FTL.
Comment 7 Filip Pizlo 2013-08-20 16:49:47 PDT
Added the memops benchmark to JSRegress in http://trac.webkit.org/changeset/154376.
Comment 8 Alon Zakai 2013-08-21 17:25:05 PDT
Btw, is there something emscripten could emit differently that would be easier for JSC to optimize?
Comment 9 Filip Pizlo 2013-08-21 20:49:38 PDT
(In reply to comment #8)
> Btw, is there something emscripten could emit differently that would be easier for JSC to optimize?

Good question.  I think you're emitting optimal code, and we're just doing a poor job right now - and we will fix that.

Question though:  is the 'asm = (function ...' thing really meant to be executed once, or multiple times?  If it's meant to be executed once, then what you're doing right now is going to be optimal in the long run and we just have to get that right: closure variables that arise in run-once functions, that are initialized once and never reassigned, should be optimized down to constants in our engine.  They aren't yet but they will be eventually.  It's such an obvious trick that I wouldn't be surprised if other engines did this already.  So, I wouldn't recommend changing this aspect of your code generation even though it's a pain point for us right now.
Comment 10 Alon Zakai 2013-08-21 21:03:18 PDT
(In reply to comment #9)
> Question though:  is the 'asm = (function ...' thing really meant to be executed once, or multiple times?

Exactly once. The function is not named and immediately executed, so there is in fact no way to execute it more than once. And as you say, the important locals inside it like HEAP8 are assigned to exactly once, so there is a guarantee that it is ok to optimize accesses of them.

> If it's meant to be executed once, then what you're doing right now is going to be optimal in the long run and we just have to get that right: closure variables that arise in run-once functions, that are initialized once and never reassigned, should be optimized down to constants in our engine.  They aren't yet but they will be eventually.  It's such an obvious trick that I wouldn't be surprised if other engines did this already.

Yes, sm does something along those lines. Not sure what v8 does.
Comment 11 Filip Pizlo 2014-07-02 15:31:56 PDT
I believe that this is no longer a problem.

*** This bug has been marked as a duplicate of bug 112840 ***