Bug 150777

Summary: Consider still matching an address expression even if B3 has already assigned a Tmp to it
Product: WebKit Reporter: Filip Pizlo <fpizlo>
Component: JavaScriptCoreAssignee: Filip Pizlo <fpizlo>
Status: RESOLVED FIXED    
Severity: Normal CC: barraclough, benjamin, commit-queue, ggaren, keith_miller, mark.lam, mhahnenb, msaboff, nrotem, oliver, saam, sam
Priority: P2    
Version: WebKit Nightly Build   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 152106, 150279    
Attachments:
Description Flags
the patch ggaren: review+

Description Filip Pizlo 2015-11-01 11:40:45 PST
Probably if we want to do hoisting of address expressions, we should have some better heuristics for it.
Comment 1 Filip Pizlo 2015-12-10 09:16:24 PST
Created attachment 267111 [details]
the patch
Comment 2 Geoffrey Garen 2015-12-10 15:34:51 PST
Comment on attachment 267111 [details]
the patch

I would have guessed 2.
Comment 3 Filip Pizlo 2015-12-10 18:44:29 PST
(In reply to comment #2)
> Comment on attachment 267111 [details]
> the patch
> 
> I would have guessed 2.

Depends on the architecture and a lot of other things.

On Intel, any address expression takes 1 cycle, regardless of its complexity.  So if you do this:

mov (%rax), %rax

then you will spend 1 cycle deducing that the address is simply the thing in %rax.  And if you do this:

mov 42(%rcx,%rdx,4), %rax

then you will also spend 1 cycle deducing what the address is.  Therefore, if you had to pick between the following two snippets, you would pick the one with fewer instructions even though it computes the same address twice:

This is faster:
    mov 42(%rcx,%rdx,4), %rax
    mov 46(%rcx,%rdx,4), %rdi

than this:
    lea 42(%rcx,%rdx,4), %rsi
    mov (%rsi), %rax
    mov 4(%rsi), %rdi

This would still be true even if the address expressions were identical rather than the second one being offset by +4. The first version is faster because even though we recompute the same seemingly complex address expression, it's actually free to do so.

Hence, setting the threshold to 2 is probably not a good thing.  It's far too low.  The last time I wrote a compiler, I set the threshold to +Inf.  Later on, I learned through whispers in the wind that you want to have some kind of upper limit, though I never learned the reason.  I suspect that the reason is just registers.  It's possible that in the "faster" example above, keeping %rcx and %rdx alive causes too much register pressure.  You don't know if that's an issue at the time that you do instruction selection, and usually it won't be an issue.  But you can see how if you really had a lot of uses of the same address, then the second form may be better because it requires only one register to be alive for the memory access to compute the address.

So, 2 is probably too low because it adds instructions without reducing the amount of work, but +Inf is probably too high because at some point the register pressure of keeping all of the inputs to the address computation alive is a bigger issue than the cost of the "lea" instruction.
Comment 4 Filip Pizlo 2015-12-10 19:42:31 PST
Landed in http://trac.webkit.org/changeset/193941