Bug 228781

Summary:

[ARM64] Add pattern matching for Load/Store Pair

Product:

WebKit

Reporter:

Yijia Huang <yijia_huang>

Component:

JavaScriptCore

Assignee:

Yijia Huang <yijia_huang>

Status:

RESOLVED DUPLICATE

Severity:

Normal

CC:

ews-watchlist, keith_miller, mark.lam, msaboff, saam, tzagallo, webkit-bug-importer

Priority:

Keywords:

InRadar

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

Attachments:

Description	Flags
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	none
Patch	ews-feeder: commit-queue-
Patch	none
Patch	none
Patch	saam: review-

Yijia Huang

Reported 2021-08-04 10:14:54 PDT

...

Attachments
Patch (27.91 KB, patch) 2021-08-08 02:58 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (27.72 KB, patch) 2021-08-08 02:59 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (27.72 KB, patch) 2021-08-08 13:25 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (31.37 KB, patch) 2021-08-08 14:09 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (42.88 KB, patch) 2021-08-08 20:58 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (42.81 KB, patch) 2021-08-09 00:06 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (54.68 KB, patch) 2021-08-26 04:31 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (54.66 KB, patch) 2021-08-26 04:35 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (55.08 KB, patch) 2021-08-26 04:41 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (55.21 KB, patch) 2021-08-26 04:56 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (55.52 KB, patch) 2021-08-26 08:56 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (56.58 KB, patch) 2021-08-26 13:58 PDT, Yijia Huang	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (57.21 KB, patch) 2021-08-26 14:19 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (57.42 KB, patch) 2021-08-26 14:57 PDT, Yijia Huang	no flags	Details Formatted Diff Diff
Patch (56.93 KB, patch) 2021-08-26 15:25 PDT, Yijia Huang	saam: review-	Details Formatted Diff Diff
Show Obsolete (14) View All Add attachment proposed patch, testcase, etc.

Yijia Huang

Comment 1 2021-08-08 02:58:45 PDT

Created attachment 435146 [details] Patch

Yijia Huang

Comment 2 2021-08-08 02:59:43 PDT

Created attachment 435147 [details] Patch

Yijia Huang

Comment 3 2021-08-08 13:25:33 PDT

Created attachment 435153 [details] Patch

Yijia Huang

Comment 4 2021-08-08 14:09:19 PDT

Created attachment 435155 [details] Patch

Yijia Huang

Comment 5 2021-08-08 20:58:19 PDT

Created attachment 435165 [details] Patch

Yijia Huang

Comment 6 2021-08-09 00:06:26 PDT

Created attachment 435170 [details] Patch

Radar WebKit Bug Importer

Comment 7 2021-08-11 10:15:18 PDT

<rdar://problem/81800583>

Saam Barati

Comment 8 2021-08-16 20:15:35 PDT

Comment on attachment 435170 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=435170&action=review Would be good to add some tests for the bugs I found. > Source/JavaScriptCore/ChangeLog:21 > + memory2 = Load(base, offset + bytes) > + Or: > + memory1 = Load(base, offset + bytes) Worth specifying what "bytes" is here. I'm assuming 4/8 for 32/64 bit memory ops > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:60 > + IndexSet<Value*> ignoredValues; name nit: ignoredValues => alreadyHandledValues > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:65 > + auto tryAddLoadStorePairCandidates = [&] (Value* value) { I like to write code like this by having the lambda be in the loop it's called from. I think you can also make index local to the loop then, too, instead of defining it above. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:70 > + // *** I like "..." over "***" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77 > + MemoryValue* memory2 = value->as<MemoryValue>(); > + Value* base2 = memory2->lastChild(); name nit: I'd name this "memory" and "base", and in the loop below in tryAddLoadPairCandidates, call the other variable "memory2" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:114 > + if (value2->type() != Int32 && value2->type() != Int64) > + return; > + if (!index) > + return; > + Value* before = block->at(index - 1); > + if (before->opcode() != Store) > + return; See comment below. I feel like this should be doing code motion to move stores to be adjacent when possible. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:165 > + if (ignoredValues.contains(memory1) || ignoredValues.contains(memory2) || !controlEquivalent(memory1, memory2)) > + continue; This isn't a sufficient test for soundness. Imagine you have: load(p) store(p + 8) load(p + 8) This program will be turned into: load(p) load(p + 8) store(p + 8) which is wrong. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:208 > + for (auto pair : storePairCandidates) { > + // *** newMemory2 = Store(value2, base, offset) > + // memory1 = Store(value1, base, offset + bytes) ---> memory1 = Store(value1, base, offset + bytes) > + // memory2 = Store(value2, base, offset) memory2 = Identity(newMemory2) > + MemoryValue* memory1 = pair.key; > + MemoryValue* memory2 = pair.value; > + Value* value2 = memory2->child(0); > + Value* base2 = memory2->child(1); > + MemoryValue* newMemory2 = insertionSet.insert<MemoryValue>(valueIndexInBasicBlock(memory1), Store, memory1->origin(), value2, base2); > + newMemory2->setOffset(memory2->offset()); > + memory2->replaceWithIdentity(newMemory2); > + > + insertionSet.execute(memory1->owner); > + } I don't understand how this works at all. Stores don't produce values. Why isn't store just doing the same thing as loads, and just doing code motion on the stores so they're next to each other? I don't get the point of just looking for adjacent values. Shouldn't the purpose of this canonicalization be to produce adjacent stores when sound? > Source/JavaScriptCore/b3/B3LowerToAir.cpp:2639 > + if (m_locked.contains(memory1) || m_locked.contains(base1)) I don't think base1 can be locked, right? Since base1 == base2, and if we're generating code for memory2, base2 has to not be locked. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3505 > + auto tryAppendStorePair = [&] () -> bool { some of this detection is similar to the load case. Can we have a helper that detects if it's legal to turn into load/store pair? > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523 > + MemoryValue* memory2 = m_value->as<MemoryValue>(); > + Value* value2 = memory2->child(0); > + Air::Opcode opcode = tryOpcodeForType(StorePair32, StorePair64, value2->type()); > + if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index) > + return false; > + Value* before = m_block->at(m_index - 1); > + if (before->opcode() != Store) > + return false; > + > + MemoryValue* memory1 = before->as<MemoryValue>(); > + Value* value1 = memory1->child(0); > + Value* base1 = memory1->child(1); > + Value* base2 = memory2->child(1); > + Value::OffsetType offset1 = memory1->offset(); > + Value::OffsetType offset2 = memory2->offset(); > + Value::OffsetType bytes = opcode == StorePair32 ? 4 : 8; > + if (value1->type() != value2->type() || base1 != base2 || offset2 - offset1 != bytes) > + return false; Everything up to here seems essentially the same, maybe except for checking the type of the values, but you can easily add a branch for that. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3524 > + if (m_locked.contains(memory1) || m_locked.contains(value1) || m_locked.contains(value2) || m_locked.contains(base1)) similar comment here about checking about the locked base

Yijia Huang

Comment 9 2021-08-26 04:31:19 PDT

Created attachment 436496 [details] Patch

Yijia Huang

Comment 10 2021-08-26 04:35:27 PDT

Created attachment 436497 [details] Patch

Yijia Huang

Comment 11 2021-08-26 04:41:10 PDT

Created attachment 436498 [details] Patch

Yijia Huang

Comment 12 2021-08-26 04:56:23 PDT

Created attachment 436499 [details] Patch

Yijia Huang

Comment 13 2021-08-26 08:56:35 PDT

Created attachment 436516 [details] Patch

Yijia Huang

Comment 14 2021-08-26 13:58:14 PDT

Created attachment 436562 [details] Patch

Yijia Huang

Comment 15 2021-08-26 14:19:46 PDT

Created attachment 436565 [details] Patch

Yijia Huang

Comment 16 2021-08-26 14:57:43 PDT

Created attachment 436574 [details] Patch

Yijia Huang

Comment 17 2021-08-26 15:25:57 PDT

Created attachment 436577 [details] Patch

Saam Barati

Comment 18 2021-08-26 19:06:06 PDT

Comment on attachment 436577 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review > Source/JavaScriptCore/ChangeLog:88 > + The equivalent Pattern is: > + Store Pair Pattern 1: > + memory1 = Store(value1, base, offset + bytes) > + ... > + memory2 = Store(value2, base, offset) > + Store Pair Pattern 2: > + memory1 = Store(value1, base, offset) > + ... > + memory2 = Store(value2, base, offset + bytes) > + > + Note: The bytes means 4/8 for 32/64 bit memory ops. > + > + First, we need to convert to the canonical form, which is subject to control equivalence. The > + conversion is shown in below: > + > + Convert Store Pair Pattern 1 to the Canonical Form: > + memory1 = Store(value1, base, offset + bytes) memory1 = Nop > + ... ... > + memory2 = Store(value2, base, offset) --> memory2 = Store(value2, base, offset) > + ... newMemory1 = Store(value1, base, offset + bytes) > + > + Convert Store Pair Pattern 2 to the Canonical Form: > + memory1 = Store(value1, base, offset) memory1 = Nop > + ... ... > + ... newMemory1 = Store(value1, base, offset) > + memory2 = Store(value2, base, offset + bytes) --> memory2 = Store(value2, base, offset + bytes) > + > + Then, lower the canonical form to Air: > + memory1 = Store(value1, base, offset) > + memory2 = Store(value2, base, offset + bytes) --> StorePair %value1, %value2, (%base, offset) I think you can just say something like "Store Pair canonicalization works like Load pair, except we don't need to worry about users of the store" > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:62 > +static void allPathsUtil(BasicBlock* start, BasicBlock* end, bool* visited, unsigned* path, unsigned pathIndex, HashSet<unsigned>& blocks) > +{ > + if (start == end) { > + for (unsigned i = 1; i < pathIndex; ++i) > + blocks.add(path[i]); > + return; > + } > + > + unsigned startBlockIndex = start->index(); > + visited[startBlockIndex] = true; > + path[pathIndex] = startBlockIndex; > + for (BasicBlock* next : start->successorBlocks()) { > + if (!visited[next->index()]) > + allPathsUtil(next, end, visited, path, pathIndex + 1, blocks); > + } > + visited[startBlockIndex] = false; > +} This algorithm is overkill. We can cheat, since we know that start dominates end (we can assert it), we don't need to do anything crazy to backtrack if we don't reach end. We'll know each path leaving start will reach end. So we can just iterate successors iteratively, checking if they're visited, to add to a set of basic blocks. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:77 > + HeapRange range = m1->opcode() == Load ? m1->effects().reads : m1->effects().writes; this isn't quite right for fenced operations. I also don't think we can move anything that's fenced around anything else that's fenced. So we need to be careful w.r.t that. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78 > + for (unsigned i : blocks) { Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:81 > + if (value->effects().reads.overlaps(range) > + || value->effects().writes.overlaps(range)) let's call effects() once, and use its result. > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142 > + // we should double check in case any collisions appears in base offset hashing why would there be a collision? > Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:240 > + for (Value* child : value->children()) { > + auto iter = loadUses.find(child); > + if (iter != loadUses.end()) > + iter->value.append(value); > + } I think this algorithm would be easier if you just moved memory2 to memory1, since memory1 dominates memory2, you can always move memory2 to memory1's program location without violating SSA. (There are other constraints to be able to move it, of course.) Doing that would allow you to drop this loadUses stuff. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480 > + if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP) add a test? I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment. > Source/JavaScriptCore/b3/B3LowerToAir.cpp:3523 > + if (!isValidForm(opcode, Arg::Tmp, Arg::Tmp, Arg::Addr) || !m_index || memory2->accessBank() != GP) do we have a test for accessBank() here? > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:673 > +arm64: StorePair64 U:G:64, U:G:64, D:G:64 this is sorta weird, since it's defining 128 bits of stuff. > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:676 > +arm64: LoadPair64 U:G:64, D:G:64, D:G:64 ditto, bur reading 128 bits > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:695 > +arm64: StorePair32 U:G:32, U:G:32, ZD:G:32 ditto, this is sorta weird since it's defining 64bits of stuff > Source/JavaScriptCore/b3/air/AirOpcode.opcodes:698 > +arm64: LoadPair32 U:G:32, ZD:G:32, ZD:G:32 ditto, but reading.

Yijia Huang

Comment 19 2021-08-27 06:51:21 PDT

Comment on attachment 436577 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436577&action=review Thanks for reviewing. :) >> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:78 >> + for (unsigned i : blocks) { > > Let's tabulate this information on a per-block basis (maybe we can do this lazily when first needed), by taking the union of all effects in an entire block. So we don't have to repeatedly iterate all values in a block. We can just our cache for the summation of a block's effects. Yes, I agree. I was thinking to merge read-heap-range intervals and write-heap-range intervals for each block in the first preorder traversal. Then, merge and cache the read-heap-range intervals and write-heap-range intervals for each pair of <start-memory-value, end-memory-value> for further loop ups. However, this is only good for large CFGs. >> Source/JavaScriptCore/b3/B3CanonicalizeLoadStorePair.cpp:142 >> + // we should double check in case any collisions appears in base offset hashing > > why would there be a collision? If you still remember, I was talked with Keith about this on Slack (https://a1391192.slack.com/archives/GM9EK0UGL/p1629819911340600?thread_ts=1629757843.326800&cid=GM9EK0UGL). There is no perfect solution to hash two integers, even using pairIntHash(). This is just a double-check. >> Source/JavaScriptCore/b3/B3LowerToAir.cpp:3480 >> + if (!isValidForm(opcode, Arg::PreIndex, Arg::Tmp) || !m_index || memory->accessBank() != GP) > > add a test? > > > I also think this code isn't considering if the memory value has a fence, and doing the wrong thing there. Same for load with increment. This one is just moving the Add value not the Memory value. Would the fence effect still bother the Memory value?

Yijia Huang

Comment 20 2021-08-27 17:10:28 PDT

The initial version of this work is done and it passes all tests. However, the patch still contains the optimizations and some potential violations (e.g. fence check for loads and stores).

Yijia Huang

Comment 21 2024-08-13 13:02:00 PDT

*** This bug has been marked as a duplicate of bug 278035 ***

Note You need to log in before you can comment on or make changes to this bug.