Summary: | Node flags should be an OptionSet | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Ryosuke Niwa <rniwa> | ||||
Component: | DOM | Assignee: | Ryosuke Niwa <rniwa> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | benjamin, cdumez, cmarcelo, darin, esprehn+autocc, ews-watchlist, kangil.han, keith_miller, koivisto, sabouhallawa, simon.fraser, webkit-bug-importer, ysuzuki, zalan | ||||
Priority: | P2 | Keywords: | InRadar | ||||
Version: | WebKit Nightly Build | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
See Also: | https://bugs.webkit.org/show_bug.cgi?id=216335 | ||||||
Bug Depends on: | 216264 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Ryosuke Niwa
2020-09-08 23:58:30 PDT
Created attachment 408317 [details]
Patch
Comment on attachment 408317 [details]
Patch
Nice!
(In reply to Antti Koivisto from comment #2) > Comment on attachment 408317 [details] > Patch > > Nice! Indeed! SO MUCH CLEANER! Committed r266776: <https://trac.webkit.org/changeset/266776> Comment on attachment 408317 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review > Source/WTF/wtf/OptionSet.h:198 > + m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage); I think this can be written also like this: value ? add(optionSet) : remove(optionSet); This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the the maximum is 7 operations. Besides it is harder to read. Comment on attachment 408317 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review >> Source/WTF/wtf/OptionSet.h:198 >> + m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage); > > I think this can be written also like this: > > value ? add(optionSet) : remove(optionSet); > > This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations > > But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the the maximum is 7 operations. Besides it is harder to read. We can prove whether it’s more efficient or not by testing. (On some architectures, branching is much more expensive than logical operations.) (In reply to Said Abou-Hallawa from comment #6) > Comment on attachment 408317 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=408317&action=review > > > Source/WTF/wtf/OptionSet.h:198 > > + m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage); > > I think this can be written also like this: > > value ? add(optionSet) : remove(optionSet); > > This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= > optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= > ~optionSet.m_storage). So the maximum is 4 operations > > But the solution of this patch would require: NEGATION, AND, CASTING, > SUBTRACT, AND, OR and ASSIGNMENT. So the the maximum is 7 operations. > Besides it is harder to read. Counting the number of operations like that is hardly useful in modern CPU. The issue here is branching. Branchless code utilizes CPU pipeline better & doesn't evict other things in the branch predictor. Having said that, it looks like clang 10.0 can optimize both of these code pretty well with cmovne. Here's output with the branch: https://godbolt.org/z/qqeb5K Here's output with the copy I moved from Node.h: https://godbolt.org/z/jvsoE7 Comment on attachment 408317 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review >>>> Source/WTF/wtf/OptionSet.h:198 >>>> + m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage); >>> >>> I think this can be written also like this: >>> >>> value ? add(optionSet) : remove(optionSet); >>> >>> This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations >>> >>> But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the the maximum is 7 operations. Besides it is harder to read. >> >> We can prove whether it’s more efficient or not by testing. (On some architectures, branching is much more expensive than logical operations.) > > Counting the number of operations like that is hardly useful in modern CPU. The issue here is branching. Branchless code utilizes CPU pipeline better & doesn't evict other things in the branch predictor. > > Having said that, it looks like clang 10.0 can optimize both of these code pretty well with cmovne. > > Here's output with the branch: https://godbolt.org/z/qqeb5K > Here's output with the copy I moved from Node.h: https://godbolt.org/z/jvsoE7 Branches that are predictable are cheaper than more instructions in my experience. Generally branches are cheaper than a conditional move because conditional moves are not speculated by the CPU. That said if you know the result of the cmov is not feeding into a branch then it might be better than filling the branch predictor. Hm... ARM outputs are interesting. I couldn't get clang to work so I only have gcc output but x86 output was basically same between clang & gcc so I'd imagine this is going to be similar. Here, the one which uses branch ends up generating 3 instructions: https://godbolt.org/z/8b5c1s Whereas the one which avoids the branch results in 4 instructions: https://godbolt.org/z/GfPjox Alright, this is confusing so I'm generating ARM64 Node::insertedIntoAncestor and Node::removedFromAncestor. This is what we get with the current code: __ZN7WebCore4Node20insertedIntoAncestorENS0_13InsertionTypeERNS_13ContainerNodeE: 0000000001031c00 stp x29, x30, [sp, #-0x10]! 0000000001031c04 mov x29, sp 0000000001031c08 tbnz w1, #0x0, 0x1031c28 0000000001031c0c ldr w8, [x0, #0x1c] 0000000001031c10 ldrb w9, [x2, #0x1d] 0000000001031c14 tbnz w9, #0x3, 0x1031c40 0000000001031c18 tbnz w8, #0xa, 0x1031c4c 0000000001031c1c mov w0, #0x0 0000000001031c20 ldp x29, x30, [sp], #0x10 0000000001031c24 ret 0000000001031c28 ldr w8, [x0, #0x1c] 0000000001031c2c orr w8, w8, #0x400 0000000001031c30 str w8, [x0, #0x1c] 0000000001031c34 mov w8, w8 0000000001031c38 ldrb w9, [x2, #0x1d] 0000000001031c3c tbz w9, #0x3, 0x1031c18 0000000001031c40 orr w8, w8, #0x800 0000000001031c44 str w8, [x0, #0x1c] 0000000001031c48 tbz w8, #0xa, 0x1031c1c 0000000001031c4c ldr x8, [x0, #0x28] 0000000001031c50 ldr x8, [x8, #0x8] 0000000001031c54 ldr x9, [x8, #0x5d0] 0000000001031c58 cbz x9, 0x1031c1c 0000000001031c5c ldrb w9, [x8, #0x96a] 0000000001031c60 cbnz w9, 0x1031c1c 0000000001031c64 ldrb w8, [x8, #0x960] 0000000001031c68 cbnz w8, 0x1031c1c 0000000001031c6c ldr x8, [x0, #0x40] 0000000001031c70 mvn x9, x8 0000000001031c74 tst x9, #0x3000000000000 0000000001031c78 b.eq 0x1031c84 0000000001031c7c orr x8, x8, #0x3000000000000 0000000001031c80 str x8, [x0, #0x40] 0000000001031c84 bl 0x102fc54 0000000001031c88 mov w0, #0x0 0000000001031c8c ldp x29, x30, [sp], #0x10 0000000001031c90 ret __ZN7WebCore4Node19removedFromAncestorENS0_11RemovalTypeERNS_13ContainerNodeE: 0000000001031c94 stp x20, x19, [sp, #-0x20]! 0000000001031c98 stp x29, x30, [sp, #0x10] 0000000001031c9c add x29, sp, #0x10 0000000001031ca0 mov x19, x0 0000000001031ca4 ldr w8, [x0, #0x1c] 0000000001031ca8 tbz w1, #0x0, 0x1031cb4 0000000001031cac and w8, w8, #0xfffffbff 0000000001031cb0 str w8, [x19, #0x1c] 0000000001031cb4 tbz w8, #0xb, 0x1031cd0 0000000001031cb8 ldr x9, [x19, #0x28] 0000000001031cbc ldr x9, [x9] 0000000001031cc0 ldrb w9, [x9, #0x1d] 0000000001031cc4 tbnz w9, #0x1, 0x1031cd0 0000000001031cc8 and w8, w8, #0xfffff7ff 0000000001031ccc str w8, [x19, #0x1c] 0000000001031cd0 and x8, x1, #0x1 0000000001031cd4 adrp x9, 5090 ; 0x2413000 0000000001031cd8 add x9, x9, #0xaf8 0000000001031cdc ldrb w9, [x9] 0000000001031ce0 cmp w9, #0x0 0000000001031ce4 ccmp x8, #0x0, #0x4, ne 0000000001031ce8 b.eq 0x1031d0c 0000000001031cec ldr x8, [x2, #0x28] 0000000001031cf0 ldr x0, [x8, #0x8] 0000000001031cf4 bl 0xfb60a8 0000000001031cf8 cbz x0, 0x1031d0c 0000000001031cfc mov x1, x19 0000000001031d00 ldp x29, x30, [sp, #0x10] 0000000001031d04 ldp x20, x19, [sp], #0x20 0000000001031d08 b 0xcd69f0 0000000001031d0c ldp x29, x30, [sp, #0x10] 0000000001031d10 ldp x20, x19, [sp], #0x20 0000000001031d14 ret Looks like we generate the identical code from the conditional version: __ZN7WebCore4Node20insertedIntoAncestorENS0_13InsertionTypeERNS_13ContainerNodeE: 0000000001031be8 stp x29, x30, [sp, #-0x10]! 0000000001031bec mov x29, sp 0000000001031bf0 tbnz w1, #0x0, 0x1031c10 0000000001031bf4 ldr w8, [x0, #0x1c] 0000000001031bf8 ldrb w9, [x2, #0x1d] 0000000001031bfc tbnz w9, #0x3, 0x1031c28 0000000001031c00 tbnz w8, #0xa, 0x1031c34 0000000001031c04 mov w0, #0x0 0000000001031c08 ldp x29, x30, [sp], #0x10 0000000001031c0c ret 0000000001031c10 ldr w8, [x0, #0x1c] 0000000001031c14 orr w8, w8, #0x400 0000000001031c18 str w8, [x0, #0x1c] 0000000001031c1c mov w8, w8 0000000001031c20 ldrb w9, [x2, #0x1d] 0000000001031c24 tbz w9, #0x3, 0x1031c00 0000000001031c28 orr w8, w8, #0x800 0000000001031c2c str w8, [x0, #0x1c] 0000000001031c30 tbz w8, #0xa, 0x1031c04 0000000001031c34 ldr x8, [x0, #0x28] 0000000001031c38 ldr x8, [x8, #0x8] 0000000001031c3c ldr x9, [x8, #0x5d0] 0000000001031c40 cbz x9, 0x1031c04 0000000001031c44 ldrb w9, [x8, #0x96a] 0000000001031c48 cbnz w9, 0x1031c04 0000000001031c4c ldrb w8, [x8, #0x960] 0000000001031c50 cbnz w8, 0x1031c04 0000000001031c54 ldr x8, [x0, #0x40] 0000000001031c58 mvn x9, x8 0000000001031c5c tst x9, #0x3000000000000 0000000001031c60 b.eq 0x1031c6c 0000000001031c64 orr x8, x8, #0x3000000000000 0000000001031c68 str x8, [x0, #0x40] 0000000001031c6c bl 0x102fc3c 0000000001031c70 mov w0, #0x0 0000000001031c74 ldp x29, x30, [sp], #0x10 0000000001031c78 ret __ZN7WebCore4Node19removedFromAncestorENS0_11RemovalTypeERNS_13ContainerNodeE: 0000000001031c7c stp x20, x19, [sp, #-0x20]! 0000000001031c80 stp x29, x30, [sp, #0x10] 0000000001031c84 add x29, sp, #0x10 0000000001031c88 mov x19, x0 0000000001031c8c ldr w8, [x0, #0x1c] 0000000001031c90 tbz w1, #0x0, 0x1031c9c 0000000001031c94 and w8, w8, #0xfffffbff 0000000001031c98 str w8, [x19, #0x1c] 0000000001031c9c tbz w8, #0xb, 0x1031cb8 0000000001031ca0 ldr x9, [x19, #0x28] 0000000001031ca4 ldr x9, [x9] 0000000001031ca8 ldrb w9, [x9, #0x1d] 0000000001031cac tbnz w9, #0x1, 0x1031cb8 0000000001031cb0 and w8, w8, #0xfffff7ff 0000000001031cb4 str w8, [x19, #0x1c] 0000000001031cb8 and x8, x1, #0x1 0000000001031cbc adrp x9, 5090 ; 0x2413000 0000000001031cc0 add x9, x9, #0xaf8 0000000001031cc4 ldrb w9, [x9] 0000000001031cc8 cmp w9, #0x0 0000000001031ccc ccmp x8, #0x0, #0x4, ne 0000000001031cd0 b.eq 0x1031cf4 0000000001031cd4 ldr x8, [x2, #0x28] 0000000001031cd8 ldr x0, [x8, #0x8] 0000000001031cdc bl 0xfb6098 0000000001031ce0 cbz x0, 0x1031cf4 0000000001031ce4 mov x1, x19 0000000001031ce8 ldp x29, x30, [sp, #0x10] 0000000001031cec ldp x20, x19, [sp], #0x20 0000000001031cf0 b 0xcd69e0 0000000001031cf4 ldp x29, x30, [sp, #0x10] 0000000001031cf8 ldp x20, x19, [sp], #0x20 0000000001031cfc ret Tracking the simplification in https://bugs.webkit.org/show_bug.cgi?id=216335. Generating identical code for both of those makes me feel good about compilers! That should be their goal in a case like this. |