RESOLVED FIXED 237988
Stop returning NodeVector from functions
https://bugs.webkit.org/show_bug.cgi?id=237988
Summary Stop returning NodeVector from functions
Chris Dumez
Reported 2022-03-16 16:19:55 PDT
Stop returning NodeVector from functions and use a out-parameter instead. While this doesn't look as modern, this is actually more efficient. This is because NodeVector has a fairly large inline buffer and is thus not that cheap to "move". This was causing functions like `ContainerNode::parserAppendChild(Node&)` to spend unnecessary time under `VectorBuffer<Ref<Node, RawPtrTraits<Node>>, 11, FastMalloc>::swap(VectorBuffer<Ref<Node, RawPtrTraits<Node>>, 11, FastMalloc>&, unsigned long, unsigned long)`
Attachments
Patch (13.56 KB, patch)
2022-03-16 16:21 PDT, Chris Dumez
no flags
Patch (13.65 KB, patch)
2022-03-16 17:51 PDT, Chris Dumez
ews-feeder: commit-queue-
Chris Dumez
Comment 1 2022-03-16 16:21:36 PDT
Darin Adler
Comment 2 2022-03-16 16:27:13 PDT
Comment on attachment 454908 [details] Patch This is definitely a measurable speedup? The return-value optimization doesn’t kick in?
Chris Dumez
Comment 3 2022-03-16 16:40:40 PDT
(In reply to Darin Adler from comment #2) > Comment on attachment 454908 [details] > Patch > > This is definitely a measurable speedup? The return-value optimization > doesn’t kick in? I am still waiting on A/B testing but I can clearly see the cost in traces when running Speedometer.
Chris Dumez
Comment 4 2022-03-16 16:41:37 PDT
(In reply to Chris Dumez from comment #3) > (In reply to Darin Adler from comment #2) > > Comment on attachment 454908 [details] > > Patch > > > > This is definitely a measurable speedup? The return-value optimization > > doesn’t kick in? > > I am still waiting on A/B testing but I can clearly see the cost in traces > when running Speedometer. Sample Count, Samples %, Normalized CPU %, Symbol 41, 0.0%, 0.0%, WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>::swap(WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>&, unsigned long, unsigned long) (in WebCore) 33, 0.0%, 0.0%, WebCore::ContainerNode::parserAppendChild(WebCore::Node&) (in WebCore) 4, 0.0%, 0.0%, WebCore::ContainerNode::appendChildWithoutPreInsertionValidityCheck(WebCore::Node&) (in WebCore) 1, 0.0%, 0.0%, WebCore::executeTask(WebCore::HTMLConstructionSiteTask&) (in WebCore) 1, 0.0%, 0.0%, WebCore::Node::insertBefore(WebCore::Node&, WebCore::Node*) (in WebCore) 1, 0.0%, 0.0%, WebCore::ContainerNode::appendChild(WebCore::Node&) (in WebCore) 1, 0.0%, 0.0%, WebCore::ContainerNode::replaceAll(WebCore::Node*) (in WebCore)
Darin Adler
Comment 5 2022-03-16 16:42:39 PDT
I’m really surprised that this involves a call to swap. Maybe we can fix that some other way. The return-value optimization should not rely on move semantics; it predates move semantics and maybe the another solution is to carefully write these functions so they get return-value optimization.
Chris Dumez
Comment 6 2022-03-16 16:44:50 PDT
(In reply to Darin Adler from comment #5) > I’m really surprised that this involves a call to swap. Maybe we can fix > that some other way. The return-value optimization should not rely on move > semantics; it predates move semantics and maybe the another solution is to > carefully write these functions so they get return-value optimization. I believe it involves swap() because that's have the Vector move constructor is implemented: ``` template<typename T, size_t inlineCapacity, typename OverflowHandler, size_t minCapacity, typename Malloc> inline Vector<T, inlineCapacity, OverflowHandler, minCapacity, Malloc>::Vector(Vector<T, inlineCapacity, OverflowHandler, minCapacity, Malloc>&& other) { swap(other); } ``` I can find dig deeper though. It is true that one would hope to get RVO.
Darin Adler
Comment 7 2022-03-16 16:44:55 PDT
Comment on attachment 454908 [details] Patch You’re welcome to fix it this way. I won’t stand in the way. But return-value optimization should have prevented this swap code from running; I want to understand why it didn’t kick in.
Darin Adler
Comment 8 2022-03-16 16:46:59 PDT
Separately, we should check and see if the move constructor can be written more efficiently than swap; seems likely that it could be.
Chris Dumez
Comment 9 2022-03-16 16:48:14 PDT
Comment on attachment 454908 [details] Patch Hmm, in some cases at least, it seems we call swap() explicitly: e.g. `nodesForInsertion.swap(removedChildNodes);` in `ContainerNode::removeSelfOrChildNodesForInsertion()`. Maybe I am wrong about the cause for the swap(), I'll double check the other instances.
Darin Adler
Comment 10 2022-03-16 16:50:29 PDT
Please also keep in mind that those functions likely use swap because the code predates move semantics; they could move instead if that helped.
Chris Dumez
Comment 11 2022-03-16 17:51:49 PDT
Chris Dumez
Comment 12 2022-03-16 18:26:11 PDT
With this patch (using an out-parameter for NodeVector), I have verified in traces that the CPU time spent in Vector::swap() is completely gone under this Node functions. However, using returning a NodeVector (avoid avoiding the explicit swap in ContainerNode::removeSelfOrChildNodesForInsertion(), I still see the CPU time under Vector::swap: Sample Count, Samples %, Normalized CPU %, Symbol 16, 0.0%, 0.0%, WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>::swap(WTF::VectorBuffer<WTF::Ref<WebCore::Node, WTF::RawPtrTraits<WebCore::Node> >, 11ul, WTF::FastMalloc>&, unsigned long, unsigned long) (in WebCore) 10, 0.0%, 0.0%, WebCore::ContainerNode::parserAppendChild(WebCore::Node&) (in WebCore) 5, 0.0%, 0.0%, WebCore::ContainerNode::appendChildWithoutPreInsertionValidityCheck(WebCore::Node&) (in WebCore) 1, 0.0%, 0.0%, WebCore::ContainerNode::removeSelfOrChildNodesForInsertion(WebCore::Node&, std::__1::optional<WebCore::Exception>&) (in WebCore) Seems to confirm my suspicion that returning a NodeVector does call the Vector's move constructor (which is slow in the NodeVector case due to the large inline buffer). I haven't been able to get rid of that CPU time while returning NodeVector in functions yet. Note that many of the functions were already using an out-parameter for NodeVector, only the couple that I fixed were using a return value.
Chris Dumez
Comment 13 2022-03-16 19:49:30 PDT
From some quick googling it seems that copy elision is only guaranteed in C++17 for prvalues. Therefore, there is no guarantee for named variables. In those functions, we are definitely returning named variables (and sometimes from multiple code paths, not a single return statement). In such cases, NRVO is apparently not guaranteed, even in C++17. There might be a way to refactor the code to get NRVO but since there is no guarantee, it sounds like it would be fragile. Given this, I think using an out-parameter is probably the way to go? It definitely gets rid of the move constructor / Vector::swap() according to the profiler (still waiting on A/B testing to see if the win is actually measurable on Speedometer).
Darin Adler
Comment 14 2022-03-17 07:28:03 PDT
OK
Darin Adler
Comment 15 2022-03-17 07:28:40 PDT
Still think Vector’s move constructor can do better than swap.
Chris Dumez
Comment 16 2022-03-17 07:36:18 PDT
(In reply to Darin Adler from comment #15) > Still think Vector’s move constructor can do better than swap. Yes, I believe so as well. However, I think this can be addressed separately? Even if the Vector's move constructor could be better than a swap, it would still be costly for vectors with an inline buffer, compared to vectors that use an out-of-line buffer (that we can simply adopt). The A/B bots have not even started processing this patch :/ From local testing I see a 0.5-0.6% progression on Speedometer 2 (iMac Pro / Intel). I also see from the profiler that the CPU time under NodeVector::swap() is completely gone.
Darin Adler
Comment 17 2022-03-17 08:01:27 PDT
(In reply to Chris Dumez from comment #16) > (In reply to Darin Adler from comment #15) > > Still think Vector’s move constructor can do better than swap. > > Yes, I believe so as well. However, I think this can be addressed separately? Yes. > Even if the Vector's move constructor could be better than a swap, it would > still be costly for vectors with an inline buffer, compared to vectors that > use an out-of-line buffer (that we can simply adopt). Agreed.
Darin Adler
Comment 18 2022-03-17 08:02:13 PDT
Sorry, thought I did review+ yesterday
EWS
Comment 19 2022-03-17 09:55:25 PDT
Committed r291416 (248543@main): <https://commits.webkit.org/248543@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 454922 [details].
Radar WebKit Bug Importer
Comment 20 2022-03-17 09:56:19 PDT
Note You need to log in before you can comment on or make changes to this bug.