Bug 120047

Summary:

Relatively slow performance on emscripten-compiled memops benchmark

Product:

WebKit

Reporter:

Alon Zakai <alonzakai>

Component:

JavaScriptCore

Assignee:

Filip Pizlo <fpizlo>

Status:

RESOLVED DUPLICATE

Severity:

Normal

CC:

fpizlo, ggaren, oliver

Priority:

Keywords:

InRadar

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

Attachments:

Description	Flags
memops benchmark	none
memops htmlified	none

Alon Zakai

Reported 2013-08-19 18:27:22 PDT

Created attachment 209146 [details] memops benchmark Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are jsc 5.532 seconds sm 1.551 v8 1.488 It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar.

Attachments
memops benchmark (206.14 KB, application/x-javascript) 2013-08-19 18:27 PDT, Alon Zakai	no flags	Details
memops htmlified (49.18 KB, application/x-gzip) 2013-08-20 11:42 PDT, Filip Pizlo	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Geoffrey Garen

Comment 1 2013-08-19 23:21:30 PDT

<rdar://problem/14781573>

Filip Pizlo

Comment 2 2013-08-20 11:24:31 PDT

(In reply to comment #0) > Created an attachment (id=209146) [details] > memops benchmark > > Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are > > jsc 5.532 seconds > sm 1.551 > v8 1.488 > > It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar. Interesting, will look.

Filip Pizlo

Comment 3 2013-08-20 11:42:02 PDT

(In reply to comment #2) > (In reply to comment #0) > > Created an attachment (id=209146) [details] [details] > > memops benchmark > > > > Attached is the "memops" benchmark from emscripten. The numbers on my 64-bit linux machine are > > > > jsc 5.532 seconds > > sm 1.551 > > v8 1.488 > > > > It is possible to run the benchmark for a longer time period if that is useful, by giving it an argument of 4 or 5 (-- 4 for jsc or v8, just 4 for sm). The results are overall similar. > > Interesting, will look. OK - I think that something is terribly wrong on Linux, or V8 and SM got a lot faster. On my 64-bit Mac, I see: Chrome 28: 3.162 sec A recent WebKit: 5.382 sec So, clearly we're slower and we should fix that, but we're not *that* much slower. Can you check, for sanity, if you see a difference between Chrome 28 (the latest Chrome on the release channel) and the V8 you have? For posterity I post the HTML-ified version of the benchmark I used to run in browser.

Filip Pizlo

Comment 4 2013-08-20 11:42:32 PDT

Created attachment 209212 [details] memops htmlified

Filip Pizlo

Comment 5 2013-08-20 11:49:18 PDT

Oh - never mind. Chrome Canary scores really well: 1696 ms. So it's a real bug, and it's not Linux-only.

Filip Pizlo

Comment 6 2013-08-20 11:58:35 PDT

(In reply to comment #5) > Oh - never mind. Chrome Canary scores really well: 1696 ms. > > So it's a real bug, and it's not Linux-only. Here's basically what's going wrong: - the HEAP8/HEAP32 variables are closure variables, which makes sense. But our closure variable support still requires *way too many loads*. - since the DFG cannot reason about GetClosureVar in a non-block-local way, the hottest basic block of the hottest loop of the memops benchmark ends up repeatedly loading the scope, loading the registers, loading the closure variable, type checking the thing it loaded to see that it's a typed array, and then loading the typed array's vector, and then loading from that vector. omgwtfbbq. There are four paths for solving this: - The obvious: FTL JIT. https://bugs.webkit.org/show_bug.cgi?id=112840. This higher-tier JIT will give us LICM on all of those things, resulting in inter-block reasoning about closure variables and everything else. - The frustrating: reduce the number of loads needed to get to a closure variable. Reduce the number of loads needed to get to a value in a typed array. Reduce the number of loads needed to check that something is a typed array. And so on. - The hilarious: constant inference for closure variables, which will make us compile the code with HEAP8/HEAP32/etc being cell constants. Look, ma, no type checks! I don't know if this will make any sense, though. - The outrageous: recognize the emscripten idioms. I don't want to do this. I'm going to add this benchmark to our repertoire and I'll use it for FTL testing. Before I can do that, though, I'll need to finally add typed arrays to the FTL.

Filip Pizlo

Comment 7 2013-08-20 16:49:47 PDT

Added the memops benchmark to JSRegress in http://trac.webkit.org/changeset/154376.

Alon Zakai

Comment 8 2013-08-21 17:25:05 PDT

Btw, is there something emscripten could emit differently that would be easier for JSC to optimize?

Filip Pizlo

Comment 9 2013-08-21 20:49:38 PDT

(In reply to comment #8) > Btw, is there something emscripten could emit differently that would be easier for JSC to optimize? Good question. I think you're emitting optimal code, and we're just doing a poor job right now - and we will fix that. Question though: is the 'asm = (function ...' thing really meant to be executed once, or multiple times? If it's meant to be executed once, then what you're doing right now is going to be optimal in the long run and we just have to get that right: closure variables that arise in run-once functions, that are initialized once and never reassigned, should be optimized down to constants in our engine. They aren't yet but they will be eventually. It's such an obvious trick that I wouldn't be surprised if other engines did this already. So, I wouldn't recommend changing this aspect of your code generation even though it's a pain point for us right now.

Alon Zakai

Comment 10 2013-08-21 21:03:18 PDT

(In reply to comment #9) > Question though: is the 'asm = (function ...' thing really meant to be executed once, or multiple times? Exactly once. The function is not named and immediately executed, so there is in fact no way to execute it more than once. And as you say, the important locals inside it like HEAP8 are assigned to exactly once, so there is a guarantee that it is ok to optimize accesses of them. > If it's meant to be executed once, then what you're doing right now is going to be optimal in the long run and we just have to get that right: closure variables that arise in run-once functions, that are initialized once and never reassigned, should be optimized down to constants in our engine. They aren't yet but they will be eventually. It's such an obvious trick that I wouldn't be surprised if other engines did this already. Yes, sm does something along those lines. Not sure what v8 does.

Filip Pizlo

Comment 11 2014-07-02 15:31:56 PDT

I believe that this is no longer a problem. *** This bug has been marked as a duplicate of bug 112840 ***

Note You need to log in before you can comment on or make changes to this bug.