RESOLVED DUPLICATE of bug 278035 228781
[ARM64] Add pattern matching for Load/Store Pair
https://bugs.webkit.org/show_bug.cgi?id=228781
Summary [ARM64] Add pattern matching for Load/Store Pair
Yijia Huang
Reported 2021-08-04 10:14:54 PDT
...
Attachments
Patch (27.91 KB, patch)
2021-08-08 02:58 PDT, Yijia Huang
no flags
Patch (27.72 KB, patch)
2021-08-08 02:59 PDT, Yijia Huang
no flags
Patch (27.72 KB, patch)
2021-08-08 13:25 PDT, Yijia Huang
no flags
Patch (31.37 KB, patch)
2021-08-08 14:09 PDT, Yijia Huang
no flags
Patch (42.88 KB, patch)
2021-08-08 20:58 PDT, Yijia Huang
no flags
Patch (42.81 KB, patch)
2021-08-09 00:06 PDT, Yijia Huang
no flags
Patch (54.68 KB, patch)
2021-08-26 04:31 PDT, Yijia Huang
no flags
Patch (54.66 KB, patch)
2021-08-26 04:35 PDT, Yijia Huang
no flags
Patch (55.08 KB, patch)
2021-08-26 04:41 PDT, Yijia Huang
no flags
Patch (55.21 KB, patch)
2021-08-26 04:56 PDT, Yijia Huang
no flags
Patch (55.52 KB, patch)
2021-08-26 08:56 PDT, Yijia Huang
no flags
Patch (56.58 KB, patch)
2021-08-26 13:58 PDT, Yijia Huang
ews-feeder: commit-queue-
Patch (57.21 KB, patch)
2021-08-26 14:19 PDT, Yijia Huang
no flags
Patch (57.42 KB, patch)
2021-08-26 14:57 PDT, Yijia Huang
no flags
Patch (56.93 KB, patch)
2021-08-26 15:25 PDT, Yijia Huang
saam: review-
Yijia Huang
Comment 1 2021-08-08 02:58:45 PDT
Yijia Huang
Comment 2 2021-08-08 02:59:43 PDT
Yijia Huang
Comment 3 2021-08-08 13:25:33 PDT
Yijia Huang
Comment 4 2021-08-08 14:09:19 PDT
Yijia Huang
Comment 5 2021-08-08 20:58:19 PDT
Yijia Huang
Comment 6 2021-08-09 00:06:26 PDT
Radar WebKit Bug Importer
Comment 7 2021-08-11 10:15:18 PDT
Saam Barati
Comment 8 2021-08-16 20:15:35 PDT
Comment on attachment 435170 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=435170&action=review Would be good to add some tests for the bugs I found. > Source/JavaScriptCore/ChangeLog:21 > + memory2 = Load(base, offset + bytes) > + Or: > + memory1 = Load(base, offset + bytes) Worth specifying what "bytes" is here. I'm assuming 4/8 for 32/64 bit memory ops > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:60 > + IndexSet<Value*> ignoredValues; name nit: ignoredValues => alreadyHandledValues > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:65 > + auto tryAddLoadStorePairCandidates = [&] (Value* value) { I like to write code like this by having the lambda be in the loop it's called from. I think you can also make index local to the loop then, too, instead of defining it above. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:70 > + // *** I like "..." over "***" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77 > + MemoryValue* memory2 = value->as<MemoryValue>(); > + Value* base2 = memory2->lastChild(); name nit: I'd name this "memory" and "base", and in the loop below in tryAddLoadPairCandidates, call the other variable "memory2" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:114 > + if (value2->type() != Int32 && value2->type() != Int64) > + return; > + if (!index) > + return; > + Value* before = block->at(index - 1); > + if (before->opcode() != Store) > + return; See comment below. I feel like this should be doing code motion to move stores to be adjacent when possible. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:165 > + if (ignoredValues.contains(memory1) || ignoredValues.contains(memory2) || !controlEquivalent(memory1, memory2)) > + continue; This isn't a sufficient test for soundness. Imagine you have: load(p) store(p + 8) load(p + 8) This program will be turned into: load(p) load(p + 8) store(p + 8) which is wrong. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:208 > + for (auto pair : storePairCandidates) { > + // *** newMemory2 = Store(value2, base, offset) > + // memory1 = Store(value1, base, offset + bytes) ---> memory1 = Store(value1, base, offset + bytes) > + // memory2 = Store(value2, base, offset) memory2 = Identity(newMemory2) > + MemoryValue* memory1 = pair.key; > + MemoryValue* memory2 = pair.value; > + Value* value2 = memory2->child(0); > + Value* base2 = memory2->child(1); > + MemoryValue* newMemory2 = insertionSet.insert<MemoryValue>(valueIndexInBasicBlock(memory1), Store, memory1->origin(), value2, base2); > + newMemory2->setOffset(memory2->offset()); > + memory2->replaceWithIdentity(newMemory2); > + > + insertionSet.execute(memory1->owner); > + } I don't understand how this works at all. Stores don't produce values. Why isn't store just doing the same thing as loads, and just doing code motion on the stores so they're next to each other? I don't get the point of just looking for adjacent values. Shouldn't the purpose of this canonicalization be to produce adjacent stores when sound? > Source/JavaScriptCore/b3/B3LowerToAir.cpp:2639 > + if (m_locked.contains(memory1) || m_locked.contains(base1)) I don't think base1 can be locked, right? Since base1 == base2, and if we're generating code for memory2, base2 has to not be locked. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3505 > + auto tryAppendStorePair = [&] () -> bool { some of this detection is similar to the load case. Can we have a helper that detects if it's legal to turn into load/store pair? > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523 > + MemoryValue* memory2 = m_value->as<MemoryValue>(); > + Value* value2 = memory2->child(0); > + Air::Opcode opcode = tryOpcodeForType(StorePair32, StorePair64, value2->type()); > + if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index) > + return false; > + Value* before = m_block->at(m_index - 1); > + if (before->opcode() != Store) > + return false; > + > + MemoryValue* memory1 = before->as<MemoryValue>(); > + Value* value1 = memory1->child(0); > + Value* base1 = memory1->child(1); > + Value* base2 = memory2->child(1); > + Value::OffsetType offset1 = memory1->offset(); > + Value::OffsetType offset2 = memory2->offset(); > + Value::OffsetType bytes = opcode == StorePair32 ? 4 : 8; > + if (value1->type() != value2->type() || base1 != base2 || offset2 - offset1 != bytes) > + return false; Everything up to here seems essentially the same, maybe except for checking the type of the values, but you can easily add a branch for that. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3524 > + if (m_locked.contains(memory1) || m_locked.contains(value1) || m_locked.contains(value2) || m_locked.contains(base1)) similar comment here about checking about the locked base
Yijia Huang
Comment 9 2021-08-26 04:31:19 PDT
Yijia Huang
Comment 10 2021-08-26 04:35:27 PDT
Yijia Huang
Comment 11 2021-08-26 04:41:10 PDT
Yijia Huang
Comment 12 2021-08-26 04:56:23 PDT
Yijia Huang
Comment 13 2021-08-26 08:56:35 PDT
Yijia Huang
Comment 14 2021-08-26 13:58:14 PDT
Yijia Huang
Comment 15 2021-08-26 14:19:46 PDT
Yijia Huang
Comment 16 2021-08-26 14:57:43 PDT
Yijia Huang
Comment 17 2021-08-26 15:25:57 PDT
Saam Barati
Comment 18 2021-08-26 19:06:06 PDT
Comment on attachment 436577 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review > Source/JavaScriptCore/ChangeLog:88 > + The equivalent Pattern is: > + Store Pair Pattern 1: > + memory1 = Store(value1, base, offset + bytes) > + ... > + memory2 = Store(value2, base, offset) > + Store Pair Pattern 2: > + memory1 = Store(value1, base, offset) > + ... > + memory2 = Store(value2, base, offset + bytes) > + > + Note: The bytes means 4/8 for 32/64 bit memory ops. > + > + First, we need to convert to the canonical form, which is subject to control equivalence. The > + conversion is shown in below: > + > + Convert Store Pair Pattern 1 to the Canonical Form: > + memory1 = Store(value1, base, offset + bytes) memory1 = Nop > + ... ... > + memory2 = Store(value2, base, offset) --> memory2 = Store(value2, base, offset) > + ... newMemory1 = Store(value1, base, offset + bytes) > + > + Convert Store Pair Pattern 2 to the Canonical Form: > + memory1 = Store(value1, base, offset) memory1 = Nop > + ... ... > + ... newMemory1 = Store(value1, base, offset) > + memory2 = Store(value2, base, offset + bytes) --> memory2 = Store(value2, base, offset + bytes) > + > + Then, lower the canonical form to Air: > + memory1 = Store(value1, base, offset) > + memory2 = Store(value2, base, offset + bytes) --> StorePair %value1, %value2, (%base, offset) I think you can just say something like "Store Pair canonicalization works like Load pair, except we don't need to worry about users of the store" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:62 > +static void allPathsUtil(BasicBlock* start, BasicBlock* end, bool* visited, unsigned* path, unsigned pathIndex, HashSet<unsigned>& blocks) > +{ > + if (start == end) { > + for (unsigned i = 1; i < pathIndex; ++i) > + blocks.add(path[i]); > + return; > + } > + > + unsigned startBlockIndex = start->index(); > + visited[startBlockIndex] = true; > + path[pathIndex] = startBlockIndex; > + for (BasicBlock* next : start->successorBlocks()) { > + if (!visited[next->index()]) > + allPathsUtil(next, end, visited, path, pathIndex + 1, blocks); > + } > + visited[startBlockIndex] = false; > +} This algorithm is overkill. We can cheat, since we know that start dominates end (we can assert it), we don't need to do anything crazy to backtrack if we don't reach end. We'll know each path leaving start will reach end. So we can just iterate successors iteratively, checking if they're visited, to add to a set of basic blocks. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77 > + HeapRange range = m1->opcode() == Load ? m1->effects().reads : m1->effects().writes; this isn't quite right for fenced operations. I also don't think we can move anything that's fenced around anything else that's fenced. So we need to be careful w.r.t that. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78 > + for (unsigned i : blocks) { Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:81 > + if (value->effects().reads.overlaps(range) > + || value->effects().writes.overlaps(range)) let's call effects() once, and use its result. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142 > + // we should double check in case any collisions appears in base offset hashing why would there be a collision? > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:240 > + for (Value* child : value->children()) { > + auto iter = loadUses.find(child); > + if (iter != loadUses.end()) > + iter->value.append(value); > + } I think this algorithm would be easier if you just moved memory2 to memory1, since memory1 dominates memory2, you can always move memory2 to memory1's program location without violating SSA. (There are other constraints to be able to move it, of course.) Doing that would allow you to drop this loadUses stuff. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480 > + if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP) add a test? I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523 > + if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index || memory2->accessBank() != GP) do we have a test for accessBank() here? > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:673 > +arm64: StorePair64 U:G:64, U:G:64, D:G:64 this is sorta weird, since it's defining 128 bits of stuff. > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:676 > +arm64: LoadPair64 U:G:64, D:G:64, D:G:64 ditto, bur reading 128 bits > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:695 > +arm64: StorePair32 U:G:32, U:G:32, ZD:G:32 ditto, this is sorta weird since it's defining 64bits of stuff > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:698 > +arm64: LoadPair32 U:G:32, ZD:G:32, ZD:G:32 ditto, but reading.
Yijia Huang
Comment 19 2021-08-27 06:51:21 PDT
Comment on attachment 436577 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review Thanks for reviewing. :) >> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78 >> + for (unsigned i : blocks) { > > Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects. Yes, I agree. I was thinking to merge read-heap-range intervals and write-heap-range intervals for each block in the first preorder traversal. Then, merge and cache the read-heap-range intervals and write-heap-range intervals for each pair of <start-memory-value, end-memory-value> for further loop ups. However, this is only good for large CFGs. >> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142 >> + // we should double check in case any collisions appears in base offset hashing > > why would there be a collision? If you still remember, I was talked with Keith about this on Slack (https://a1391192.slack.com/archives/GM9EK0UGL/p1629819911340600?thread_ts=1629757843.326800&cid=GM9EK0UGL). There is no perfect solution to hash two integers, even using pairIntHash(). This is just a double-check. >> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480 >> + if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP) > > add a test? > > > I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment. This one is just moving the Add value not the Memory value. Would the fence effect still bother the Memory value?
Yijia Huang
Comment 20 2021-08-27 17:10:28 PDT
The initial version of this work is done and it passes all tests. However, the patch still contains the optimizations and some potential violations (e.g. fence check for loads and stores).
Yijia Huang
Comment 21 2024-08-13 13:02:00 PDT
*** This bug has been marked as a duplicate of bug 278035 ***
Note You need to log in before you can comment on or make changes to this bug.