Bug 216305

Summary: Node flags should be an OptionSet
Product: WebKit Reporter: Ryosuke Niwa <rniwa>
Component: DOMAssignee: Ryosuke Niwa <rniwa>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, cdumez, cmarcelo, darin, esprehn+autocc, ews-watchlist, kangil.han, keith_miller, koivisto, sabouhallawa, simon.fraser, webkit-bug-importer, ysuzuki, zalan
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=216335
Bug Depends on: 216264    
Bug Blocks:    
Attachments:
Description Flags
Patch koivisto: review+

Description Ryosuke Niwa 2020-09-08 23:58:30 PDT
Node::m_nodeFlags is currently uint32_t. We should turn it into an OptionSet
since there is no states stored there after https://trac.webkit.org/r266769.
Comment 1 Ryosuke Niwa 2020-09-09 00:15:18 PDT
Created attachment 408317 [details]
Patch
Comment 2 Antti Koivisto 2020-09-09 02:01:56 PDT
Comment on attachment 408317 [details]
Patch

Nice!
Comment 3 Ryosuke Niwa 2020-09-09 02:02:40 PDT
(In reply to Antti Koivisto from comment #2)
> Comment on attachment 408317 [details]
> Patch
> 
> Nice!

Indeed! SO MUCH CLEANER!
Comment 4 Ryosuke Niwa 2020-09-09 02:03:50 PDT
Committed r266776: <https://trac.webkit.org/changeset/266776>
Comment 5 Radar WebKit Bug Importer 2020-09-09 02:04:20 PDT
<rdar://problem/68559434>
Comment 6 Said Abou-Hallawa 2020-09-09 15:18:47 PDT
Comment on attachment 408317 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review

> Source/WTF/wtf/OptionSet.h:198
> +        m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage);

I think this can be written also like this:

    value ? add(optionSet) : remove(optionSet);

This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations

But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the  the maximum is 7 operations. Besides it is harder to read.
Comment 7 Darin Adler 2020-09-09 15:54:28 PDT
Comment on attachment 408317 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review

>> Source/WTF/wtf/OptionSet.h:198
>> +        m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage);
> 
> I think this can be written also like this:
> 
>     value ? add(optionSet) : remove(optionSet);
> 
> This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations
> 
> But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the  the maximum is 7 operations. Besides it is harder to read.

We can prove whether it’s more efficient or not by testing. (On some architectures, branching is much more expensive than logical operations.)
Comment 8 Ryosuke Niwa 2020-09-09 16:00:23 PDT
(In reply to Said Abou-Hallawa from comment #6)
> Comment on attachment 408317 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=408317&action=review
> 
> > Source/WTF/wtf/OptionSet.h:198
> > +        m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage);
> 
> I think this can be written also like this:
> 
>     value ? add(optionSet) : remove(optionSet);
> 
> This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |=
> optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &=
> ~optionSet.m_storage). So the maximum is 4 operations
> 
> But the solution of this patch would require: NEGATION, AND, CASTING,
> SUBTRACT, AND, OR and ASSIGNMENT. So the  the maximum is 7 operations.
> Besides it is harder to read.

Counting the number of operations like that is hardly useful in modern CPU. The issue here is branching. Branchless code utilizes CPU pipeline better & doesn't evict other things in the branch predictor.

Having said that, it looks like clang 10.0 can optimize both of these code pretty well with cmovne.

Here's output with the branch: https://godbolt.org/z/qqeb5K 
Here's output with the copy I moved from Node.h: https://godbolt.org/z/jvsoE7
Comment 9 Keith Miller 2020-09-09 16:08:42 PDT
Comment on attachment 408317 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408317&action=review

>>>> Source/WTF/wtf/OptionSet.h:198
>>>> +        m_storage = (m_storage & ~optionSet.m_storage) | (-static_cast<StorageType>(value) & optionSet.m_storage);
>>> 
>>> I think this can be written also like this:
>>> 
>>>     value ? add(optionSet) : remove(optionSet);
>>> 
>>> This will require one COMPARISON and either OR and ASSIGNMENT (m_storage |= optionSet.m_storage) or NEGATION, AND and ASSIGNMENT (m_storage &= ~optionSet.m_storage). So the maximum is 4 operations
>>> 
>>> But the solution of this patch would require: NEGATION, AND, CASTING, SUBTRACT, AND, OR and ASSIGNMENT. So the  the maximum is 7 operations. Besides it is harder to read.
>> 
>> We can prove whether it’s more efficient or not by testing. (On some architectures, branching is much more expensive than logical operations.)
> 
> Counting the number of operations like that is hardly useful in modern CPU. The issue here is branching. Branchless code utilizes CPU pipeline better & doesn't evict other things in the branch predictor.
> 
> Having said that, it looks like clang 10.0 can optimize both of these code pretty well with cmovne.
> 
> Here's output with the branch: https://godbolt.org/z/qqeb5K 
> Here's output with the copy I moved from Node.h: https://godbolt.org/z/jvsoE7

Branches that are predictable are cheaper than more instructions in my experience. Generally branches are cheaper than a conditional move because conditional moves are not speculated by the CPU. That said if you know the result of the cmov is not feeding into a branch then it might be better than filling the branch predictor.
Comment 10 Ryosuke Niwa 2020-09-09 16:10:24 PDT
Hm... ARM outputs are interesting. I couldn't get clang to work so I only have gcc output but x86 output was basically same between clang & gcc so I'd imagine this is going to be similar.

Here, the one which uses branch ends up generating 3 instructions: https://godbolt.org/z/8b5c1s
Whereas the one which avoids the branch results in 4 instructions:
https://godbolt.org/z/GfPjox
Comment 11 Ryosuke Niwa 2020-09-09 17:35:03 PDT
Alright, this is confusing so I'm generating ARM64 Node::insertedIntoAncestor and Node::removedFromAncestor. This is what we get with the current code:

__ZN7WebCore4Node20insertedIntoAncestorENS0_13InsertionTypeERNS_13ContainerNodeE:
0000000001031c00	stp	x29, x30, [sp, #-0x10]!
0000000001031c04	mov	x29, sp
0000000001031c08	tbnz	w1, #0x0, 0x1031c28
0000000001031c0c	ldr	w8, [x0, #0x1c]
0000000001031c10	ldrb	w9, [x2, #0x1d]
0000000001031c14	tbnz	w9, #0x3, 0x1031c40
0000000001031c18	tbnz	w8, #0xa, 0x1031c4c
0000000001031c1c	mov	w0, #0x0
0000000001031c20	ldp	x29, x30, [sp], #0x10
0000000001031c24	ret
0000000001031c28	ldr	w8, [x0, #0x1c]
0000000001031c2c	orr	w8, w8, #0x400
0000000001031c30	str	w8, [x0, #0x1c]
0000000001031c34	mov	w8, w8
0000000001031c38	ldrb	w9, [x2, #0x1d]
0000000001031c3c	tbz	w9, #0x3, 0x1031c18
0000000001031c40	orr	w8, w8, #0x800
0000000001031c44	str	w8, [x0, #0x1c]
0000000001031c48	tbz	w8, #0xa, 0x1031c1c
0000000001031c4c	ldr	x8, [x0, #0x28]
0000000001031c50	ldr	x8, [x8, #0x8]
0000000001031c54	ldr	x9, [x8, #0x5d0]
0000000001031c58	cbz	x9, 0x1031c1c
0000000001031c5c	ldrb	w9, [x8, #0x96a]
0000000001031c60	cbnz	w9, 0x1031c1c
0000000001031c64	ldrb	w8, [x8, #0x960]
0000000001031c68	cbnz	w8, 0x1031c1c
0000000001031c6c	ldr	x8, [x0, #0x40]
0000000001031c70	mvn	x9, x8
0000000001031c74	tst	x9, #0x3000000000000
0000000001031c78	b.eq	0x1031c84
0000000001031c7c	orr	x8, x8, #0x3000000000000
0000000001031c80	str	x8, [x0, #0x40]
0000000001031c84	bl	0x102fc54
0000000001031c88	mov	w0, #0x0
0000000001031c8c	ldp	x29, x30, [sp], #0x10
0000000001031c90	ret
__ZN7WebCore4Node19removedFromAncestorENS0_11RemovalTypeERNS_13ContainerNodeE:
0000000001031c94	stp	x20, x19, [sp, #-0x20]!
0000000001031c98	stp	x29, x30, [sp, #0x10]
0000000001031c9c	add	x29, sp, #0x10
0000000001031ca0	mov	x19, x0
0000000001031ca4	ldr	w8, [x0, #0x1c]
0000000001031ca8	tbz	w1, #0x0, 0x1031cb4
0000000001031cac	and	w8, w8, #0xfffffbff
0000000001031cb0	str	w8, [x19, #0x1c]
0000000001031cb4	tbz	w8, #0xb, 0x1031cd0
0000000001031cb8	ldr	x9, [x19, #0x28]
0000000001031cbc	ldr	x9, [x9]
0000000001031cc0	ldrb	w9, [x9, #0x1d]
0000000001031cc4	tbnz	w9, #0x1, 0x1031cd0
0000000001031cc8	and	w8, w8, #0xfffff7ff
0000000001031ccc	str	w8, [x19, #0x1c]
0000000001031cd0	and	x8, x1, #0x1
0000000001031cd4	adrp	x9, 5090 ; 0x2413000
0000000001031cd8	add	x9, x9, #0xaf8
0000000001031cdc	ldrb	w9, [x9]
0000000001031ce0	cmp	w9, #0x0
0000000001031ce4	ccmp	x8, #0x0, #0x4, ne
0000000001031ce8	b.eq	0x1031d0c
0000000001031cec	ldr	x8, [x2, #0x28]
0000000001031cf0	ldr	x0, [x8, #0x8]
0000000001031cf4	bl	0xfb60a8
0000000001031cf8	cbz	x0, 0x1031d0c
0000000001031cfc	mov	x1, x19
0000000001031d00	ldp	x29, x30, [sp, #0x10]
0000000001031d04	ldp	x20, x19, [sp], #0x20
0000000001031d08	b	0xcd69f0
0000000001031d0c	ldp	x29, x30, [sp, #0x10]
0000000001031d10	ldp	x20, x19, [sp], #0x20
0000000001031d14	ret

Looks like we generate the identical code from the conditional version:
__ZN7WebCore4Node20insertedIntoAncestorENS0_13InsertionTypeERNS_13ContainerNodeE:
0000000001031be8	stp	x29, x30, [sp, #-0x10]!
0000000001031bec	mov	x29, sp
0000000001031bf0	tbnz	w1, #0x0, 0x1031c10
0000000001031bf4	ldr	w8, [x0, #0x1c]
0000000001031bf8	ldrb	w9, [x2, #0x1d]
0000000001031bfc	tbnz	w9, #0x3, 0x1031c28
0000000001031c00	tbnz	w8, #0xa, 0x1031c34
0000000001031c04	mov	w0, #0x0
0000000001031c08	ldp	x29, x30, [sp], #0x10
0000000001031c0c	ret
0000000001031c10	ldr	w8, [x0, #0x1c]
0000000001031c14	orr	w8, w8, #0x400
0000000001031c18	str	w8, [x0, #0x1c]
0000000001031c1c	mov	w8, w8
0000000001031c20	ldrb	w9, [x2, #0x1d]
0000000001031c24	tbz	w9, #0x3, 0x1031c00
0000000001031c28	orr	w8, w8, #0x800
0000000001031c2c	str	w8, [x0, #0x1c]
0000000001031c30	tbz	w8, #0xa, 0x1031c04
0000000001031c34	ldr	x8, [x0, #0x28]
0000000001031c38	ldr	x8, [x8, #0x8]
0000000001031c3c	ldr	x9, [x8, #0x5d0]
0000000001031c40	cbz	x9, 0x1031c04
0000000001031c44	ldrb	w9, [x8, #0x96a]
0000000001031c48	cbnz	w9, 0x1031c04
0000000001031c4c	ldrb	w8, [x8, #0x960]
0000000001031c50	cbnz	w8, 0x1031c04
0000000001031c54	ldr	x8, [x0, #0x40]
0000000001031c58	mvn	x9, x8
0000000001031c5c	tst	x9, #0x3000000000000
0000000001031c60	b.eq	0x1031c6c
0000000001031c64	orr	x8, x8, #0x3000000000000
0000000001031c68	str	x8, [x0, #0x40]
0000000001031c6c	bl	0x102fc3c
0000000001031c70	mov	w0, #0x0
0000000001031c74	ldp	x29, x30, [sp], #0x10
0000000001031c78	ret
__ZN7WebCore4Node19removedFromAncestorENS0_11RemovalTypeERNS_13ContainerNodeE:
0000000001031c7c	stp	x20, x19, [sp, #-0x20]!
0000000001031c80	stp	x29, x30, [sp, #0x10]
0000000001031c84	add	x29, sp, #0x10
0000000001031c88	mov	x19, x0
0000000001031c8c	ldr	w8, [x0, #0x1c]
0000000001031c90	tbz	w1, #0x0, 0x1031c9c
0000000001031c94	and	w8, w8, #0xfffffbff
0000000001031c98	str	w8, [x19, #0x1c]
0000000001031c9c	tbz	w8, #0xb, 0x1031cb8
0000000001031ca0	ldr	x9, [x19, #0x28]
0000000001031ca4	ldr	x9, [x9]
0000000001031ca8	ldrb	w9, [x9, #0x1d]
0000000001031cac	tbnz	w9, #0x1, 0x1031cb8
0000000001031cb0	and	w8, w8, #0xfffff7ff
0000000001031cb4	str	w8, [x19, #0x1c]
0000000001031cb8	and	x8, x1, #0x1
0000000001031cbc	adrp	x9, 5090 ; 0x2413000
0000000001031cc0	add	x9, x9, #0xaf8
0000000001031cc4	ldrb	w9, [x9]
0000000001031cc8	cmp	w9, #0x0
0000000001031ccc	ccmp	x8, #0x0, #0x4, ne
0000000001031cd0	b.eq	0x1031cf4
0000000001031cd4	ldr	x8, [x2, #0x28]
0000000001031cd8	ldr	x0, [x8, #0x8]
0000000001031cdc	bl	0xfb6098
0000000001031ce0	cbz	x0, 0x1031cf4
0000000001031ce4	mov	x1, x19
0000000001031ce8	ldp	x29, x30, [sp, #0x10]
0000000001031cec	ldp	x20, x19, [sp], #0x20
0000000001031cf0	b	0xcd69e0
0000000001031cf4	ldp	x29, x30, [sp, #0x10]
0000000001031cf8	ldp	x20, x19, [sp], #0x20
0000000001031cfc	ret
Comment 12 Ryosuke Niwa 2020-09-09 17:39:18 PDT
Tracking the simplification in https://bugs.webkit.org/show_bug.cgi?id=216335.
Comment 13 Darin Adler 2020-09-09 17:48:16 PDT
Generating identical code for both of those makes me feel good about compilers! That should be their goal in a case like this.