Bug 228781

Summary: [ARM64] Add pattern matching for Load/Store Pair
Product: WebKit Reporter: Yijia Huang <yijia_huang>
Component: JavaScriptCoreAssignee: Yijia Huang <yijia_huang>
Status: NEW ---    
Severity: Normal CC: ews-watchlist, keith_miller, mark.lam, msaboff, saam, tzagallo, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
ews-feeder: commit-queue-
Patch
none
Patch
none
Patch saam: review-

Description Yijia Huang 2021-08-04 10:14:54 PDT
...
Comment 1 Yijia Huang 2021-08-08 02:58:45 PDT
Created attachment 435146 [details]
Patch
Comment 2 Yijia Huang 2021-08-08 02:59:43 PDT
Created attachment 435147 [details]
Patch
Comment 3 Yijia Huang 2021-08-08 13:25:33 PDT
Created attachment 435153 [details]
Patch
Comment 4 Yijia Huang 2021-08-08 14:09:19 PDT
Created attachment 435155 [details]
Patch
Comment 5 Yijia Huang 2021-08-08 20:58:19 PDT
Created attachment 435165 [details]
Patch
Comment 6 Yijia Huang 2021-08-09 00:06:26 PDT
Created attachment 435170 [details]
Patch
Comment 7 Radar WebKit Bug Importer 2021-08-11 10:15:18 PDT
<rdar://problem/81800583>
Comment 8 Saam Barati 2021-08-16 20:15:35 PDT
Comment on attachment 435170 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=435170&action=review

Would be good to add some tests for the bugs I found.

> Source/JavaScriptCore/ChangeLog:21
> +            memory2 = Load(base, offset + bytes)
> +        Or:
> +            memory1 = Load(base, offset + bytes)

Worth specifying what "bytes" is here. I'm assuming 4/8 for 32/64 bit memory ops

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:60
> +    IndexSet<Value*> ignoredValues;

name nit: ignoredValues => alreadyHandledValues

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:65
> +    auto tryAddLoadStorePairCandidates = [&] (Value* value) {

I like to write code like this by having the lambda be in the loop it's called from. I think you can also make index local to the loop then, too, instead of defining it above.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:70
> +            //     ***

I like "..." over "***"

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77
> +            MemoryValue* memory2 = value->as<MemoryValue>();
> +            Value* base2 = memory2->lastChild();

name nit: I'd name this "memory" and "base", and in the loop below in tryAddLoadPairCandidates, call the other variable "memory2"

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:114
> +                if (value2->type() != Int32 && value2->type() != Int64)
> +                    return;
> +                if (!index)
> +                    return;
> +                Value* before = block->at(index - 1);
> +                if (before->opcode() != Store)
> +                    return;

See comment below. I feel like this should be doing code motion to move stores to be adjacent when possible.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:165
> +        if (ignoredValues.contains(memory1) || ignoredValues.contains(memory2) || !controlEquivalent(memory1, memory2))
> +            continue;

This isn't a sufficient test for soundness. Imagine you have:

load(p)
store(p + 8)
load(p + 8)

This program will be turned into:
load(p)
load(p + 8)
store(p + 8)
which is wrong.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:208
> +    for (auto pair : storePairCandidates) {
> +        // ***                                                  newMemory2 = Store(value2, base, offset)
> +        // memory1 = Store(value1, base, offset + bytes) --->   memory1 = Store(value1, base, offset + bytes)
> +        // memory2 = Store(value2, base, offset)                memory2 = Identity(newMemory2)
> +        MemoryValue* memory1 = pair.key;
> +        MemoryValue* memory2 = pair.value;
> +        Value* value2 = memory2->child(0);
> +        Value* base2 = memory2->child(1);
> +        MemoryValue* newMemory2 = insertionSet.insert<MemoryValue>(valueIndexInBasicBlock(memory1), Store, memory1->origin(), value2, base2);
> +        newMemory2->setOffset(memory2->offset());
> +        memory2->replaceWithIdentity(newMemory2);
> +
> +        insertionSet.execute(memory1->owner);
> +    }

I don't understand how this works at all. Stores don't produce values.

Why isn't store just doing the same thing as loads, and just doing code motion on the stores so they're next to each other? I don't get the point of just looking for adjacent values. Shouldn't the purpose of this canonicalization be to produce adjacent stores when sound?

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:2639
> +                if (m_locked.contains(memory1) || m_locked.contains(base1))

I don't think base1 can be locked, right? Since base1 == base2, and if we're generating code for memory2, base2 has to not be locked.

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3505
> +            auto tryAppendStorePair = [&] () -> bool {

some of this detection is similar to the load case. Can we have a helper that detects if it's legal to turn into load/store pair?

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523
> +                MemoryValue* memory2 = m_value->as<MemoryValue>();
> +                Value* value2 = memory2->child(0);
> +                Air::Opcode opcode = tryOpcodeForType(StorePair32, StorePair64, value2->type());
> +                if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index)
> +                    return false;
> +                Value* before = m_block->at(m_index - 1);
> +                if (before->opcode() != Store)
> +                    return false;
> +
> +                MemoryValue* memory1 = before->as<MemoryValue>();
> +                Value* value1 = memory1->child(0);
> +                Value* base1 = memory1->child(1);
> +                Value* base2 = memory2->child(1);
> +                Value::OffsetType offset1 = memory1->offset();
> +                Value::OffsetType offset2 = memory2->offset();
> +                Value::OffsetType bytes = opcode == StorePair32 ? 4 : 8;
> +                if (value1->type() != value2->type() || base1 != base2 || offset2 - offset1 != bytes)
> +                    return false;

Everything up to here seems essentially the same, maybe except for checking the type of the values, but you can easily add a branch for that.

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3524
> +                if (m_locked.contains(memory1) || m_locked.contains(value1) || m_locked.contains(value2) || m_locked.contains(base1))

similar comment here about checking about the locked base
Comment 9 Yijia Huang 2021-08-26 04:31:19 PDT
Created attachment 436496 [details]
Patch
Comment 10 Yijia Huang 2021-08-26 04:35:27 PDT
Created attachment 436497 [details]
Patch
Comment 11 Yijia Huang 2021-08-26 04:41:10 PDT
Created attachment 436498 [details]
Patch
Comment 12 Yijia Huang 2021-08-26 04:56:23 PDT
Created attachment 436499 [details]
Patch
Comment 13 Yijia Huang 2021-08-26 08:56:35 PDT
Created attachment 436516 [details]
Patch
Comment 14 Yijia Huang 2021-08-26 13:58:14 PDT
Created attachment 436562 [details]
Patch
Comment 15 Yijia Huang 2021-08-26 14:19:46 PDT
Created attachment 436565 [details]
Patch
Comment 16 Yijia Huang 2021-08-26 14:57:43 PDT
Created attachment 436574 [details]
Patch
Comment 17 Yijia Huang 2021-08-26 15:25:57 PDT
Created attachment 436577 [details]
Patch
Comment 18 Saam Barati 2021-08-26 19:06:06 PDT
Comment on attachment 436577 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review

> Source/JavaScriptCore/ChangeLog:88
> +        The equivalent Pattern is:
> +        Store Pair Pattern 1:
> +            memory1 = Store(value1, base, offset + bytes)
> +            ...
> +            memory2 = Store(value2, base, offset)
> +        Store Pair Pattern 2:
> +            memory1 = Store(value1, base, offset)
> +            ...
> +            memory2 = Store(value2, base, offset + bytes)
> +
> +        Note: The bytes means 4/8 for 32/64 bit memory ops.
> +
> +        First, we need to convert to the canonical form, which is subject to control equivalence. The
> +        conversion is shown in below:
> +
> +        Convert Store Pair Pattern 1 to the Canonical Form:
> +            memory1 = Store(value1, base, offset + bytes)            memory1 = Nop
> +            ...                                                      ...
> +            memory2 = Store(value2, base, offset)            -->     memory2 = Store(value2, base, offset)
> +            ...                                                      newMemory1 = Store(value1, base, offset + bytes)
> +
> +        Convert Store Pair Pattern 2 to the Canonical Form:
> +            memory1 = Store(value1, base, offset)                    memory1 = Nop
> +            ...                                                      ...
> +            ...                                                      newMemory1 = Store(value1, base, offset)
> +            memory2 = Store(value2, base, offset + bytes)    -->     memory2 = Store(value2, base, offset + bytes)
> +
> +        Then, lower the canonical form to Air:
> +            memory1 = Store(value1, base, offset)
> +            memory2 = Store(value2, base, offset + bytes)    -->     StorePair %value1, %value2, (%base, offset)

I think you can just say something like "Store Pair canonicalization works like Load pair, except we don't need to worry about users of the store"

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:62
> +static void allPathsUtil(BasicBlock* start, BasicBlock* end, bool* visited, unsigned* path, unsigned pathIndex, HashSet<unsigned>& blocks)
> +{
> +    if (start == end) {
> +        for (unsigned i = 1; i < pathIndex; ++i)
> +            blocks.add(path[i]);
> +        return;
> +    }
> +
> +    unsigned startBlockIndex = start->index();
> +    visited[startBlockIndex] = true;
> +    path[pathIndex] = startBlockIndex;
> +    for (BasicBlock* next : start->successorBlocks()) {
> +        if (!visited[next->index()])
> +            allPathsUtil(next, end, visited, path, pathIndex + 1, blocks);
> +    }
> +    visited[startBlockIndex] = false;
> +}

This algorithm is overkill.

We can cheat, since we know that start dominates end (we can assert it), we don't need to do anything crazy to backtrack if we don't reach end. We'll know each path leaving start will reach end. So we can just iterate successors iteratively, checking if they're visited, to add to a set of basic blocks.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77
> +    HeapRange range = m1->opcode() == Load ? m1->effects().reads : m1->effects().writes;

this isn't quite right for fenced operations. I also don't think we can move anything that's fenced around anything else that's fenced. So we need to be careful w.r.t that.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78
> +    for (unsigned i : blocks) {

Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:81
> +            if (value->effects().reads.overlaps(range)
> +                || value->effects().writes.overlaps(range))

let's call effects() once, and use its result.

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142
> +                    // we should double check in case any collisions appears in base offset hashing

why would there be a collision?

> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:240
> +        for (Value* child : value->children()) {
> +            auto iter = loadUses.find(child);
> +            if (iter != loadUses.end())
> +                iter->value.append(value);
> +        }

I think this algorithm would be easier if you just moved memory2 to memory1, since memory1 dominates memory2, you can always move memory2 to memory1's program location without violating SSA. (There are other constraints to be able to move it, of course.) Doing that would allow you to drop this loadUses stuff.

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480
> +                if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP)

add a test?


I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment.

> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523
> +                if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index || memory2->accessBank() != GP)

do we have a test for accessBank() here?

> Source/JavaScriptCore/b3/air/AirOpcode.opcodes:673
> +arm64: StorePair64 U:G:64, U:G:64, D:G:64

this is sorta weird, since it's defining 128 bits of stuff.

> Source/JavaScriptCore/b3/air/AirOpcode.opcodes:676
> +arm64: LoadPair64 U:G:64, D:G:64, D:G:64

ditto, bur reading 128 bits

> Source/JavaScriptCore/b3/air/AirOpcode.opcodes:695
> +arm64: StorePair32 U:G:32, U:G:32, ZD:G:32

ditto, this is sorta weird since it's defining 64bits of stuff

> Source/JavaScriptCore/b3/air/AirOpcode.opcodes:698
> +arm64: LoadPair32 U:G:32, ZD:G:32, ZD:G:32

ditto, but reading.
Comment 19 Yijia Huang 2021-08-27 06:51:21 PDT
Comment on attachment 436577 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review

Thanks for reviewing. :)

>> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78
>> +    for (unsigned i : blocks) {
> 
> Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects.

Yes, I agree. I was thinking to merge read-heap-range intervals and write-heap-range intervals for each block in the first preorder traversal. Then, merge and cache the read-heap-range intervals and write-heap-range intervals for each pair of <start-memory-value, end-memory-value> for further loop ups.

However, this is only good for large CFGs.

>> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142
>> +                    // we should double check in case any collisions appears in base offset hashing
> 
> why would there be a collision?

If you still remember, I was talked with Keith about this on Slack (https://a1391192.slack.com/archives/GM9EK0UGL/p1629819911340600?thread_ts=1629757843.326800&cid=GM9EK0UGL). There is no perfect solution to hash two integers, even using pairIntHash(). This is just a double-check.

>> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480
>> +                if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP)
> 
> add a test?
> 
> 
> I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment.

This one is just moving the Add value not the Memory value. Would the fence effect still bother the Memory value?
Comment 20 Yijia Huang 2021-08-27 17:10:28 PDT
The initial version of this work is done and it passes all tests. However, the patch still contains the optimizations and some potential violations (e.g. fence check for loads and stores).