http://www.robohornet.org/tests/addcol.html spends 15% of time in HTMLTableRowsCollection::rowAfter This also results in around 10% of all samples in "RoboHornetPro" (Microsoft's fork). Ryosuke suggests that we're invalidating the collection too eagerly (on every td addition) even though it should only need to be invalidated when we add a row. He mentions that HTMLCollection already knows how to only pay attention to direct children: static bool shouldOnlyIncludeDirectChildren(CollectionType type) and that invalidation just needs to be taught about that method.
I believe why Ryosuke was suggesting is making: invalidateNodeListCachesInAncestors smarter, possibly to pass an (optional) Node* along to Document::invalidateNodeListCaches, and I assume then to check that node pointer in HTMLTableRowsCollection or perhaps some baseclass which is aware of the fact that HTMLTableRowsCollection only cares about changes to its direct children (and in this case, only to addition or removal of HTMLTableRow elements. I would guess the behavior would be that if Node* == 0, then we always invalidate everything, and if a Node* is given, we only invalidate if the Node* parentNode is the root? Presumably we may have to pass the root pointer along too? I suspect rniwa has the design better sorted out in his head.
I should note: rniwa and I looked at this further and found that all of our tag-based collections (documents.images, documents.objects, table.rows, etc.) all invalidate on any element addition/removal. Fixing this would be a rather large endeavor, but is a huge win here, and likely a reasonable win for many pages.
This is by far the worst remaining bug. It accounts for about 25% of remaining RoboHornetPro samples, now that bug 98800 is resolved.
Created attachment 173759 [details] first attempt, will not work
The approach I uploaded cannot work. It was too dumb, even for the addcol.html test in question. :) I brainstormed some with Adam K, Ojan, Tony this afternoon and we've come up with a plausible approach. Two constraints: - Node lists need to know about all DOM mutations in order to be properly invalidated. - DOM mutation code is very hot, and can't be slowed down by checking every node list. Approach: - Add a fixed-size Queue for DOM mutations onto Document (or some sort of Document rare data). - Every appendChild/removeChild add a record into this mutation queue, if there is still room. - Each entry in an added/removed subtree would occupy a slot in this queue. - If the queue is ever full, an "all lists are invalid" bit is set on Document. - On any access to a nodelist/collection first we have to process any pending invalidations from our modification queue, including invalidating all node lists if the "all lists are invalid" bit is set. This should possibly make appendChild/removeChild faster, as we could presumably avoid our current ancestor walk for invalidating node-lists. It should make cases where authors are iterating over a node list and mutating the DOM much faster when their mutations are smaller than the number of entries in our mutation queue. This would shift the entire burden of invalidation to access of properties on the node lists, instead of during append/remove child. I'm told that rniwa attempted a "all lists are invalid" bit approach before (as a way of speeding up appendChild) and it was a large regression for sites using .childNodes, etc. I suspect that this O(1) mutation queue in front of this "all lists are invalid" bit, might work however. My next step is to find rniwa's old patch. :)
(In reply to comment #5) > - If the queue is ever full, an "all lists are invalid" bit is set on Document. We can’t do this. I’ve attempted this in the past and broke a bunch of websites. e.g. jQuery’s clone creates a document fragment to which it adds a whole bunch of nodes from the original DOM subtree. These additions will fill up your queue pretty quickly. Since jQuery uses getElementsByTagName("*") to iterate over all the nodes it’s trying to clone, invalidating node list in the original document will regress the jQuery performance. And there are millions of examples like this where invalidating node lists more often leads to pathological performance. > - On any access to a nodelist/collection first we have to process any pending invalidations from our modification queue, including invalidating all node lists if the "all lists are invalid" bit is set. We can’t do this either. Accessing node list is also really hot especially for things like childNodes. Adding that kind of structural checks prohibitively slows node lists / html collection. > I'm told that rniwa attempted a "all lists are invalid" bit approach before (as a way of speeding up appendChild) and it was a large regression for sites using .childNodes, etc. I suspect that this O(1) mutation queue in front of this "all lists are invalid" bit, might work however. No. We can’t lazily invalidate node lists & html collections. I spent 6 months trying to do this. Nothing worked.
Also, now that Elliot landed his patch to get rid of the rare node data map, we don’t have the performance issue we had in removeChild & appendChild/insertBefore anymore. Even when I got rid of the node list invalidation from those functions, the performance improvements we got was very tangible (1~3%). We’re much better off if we invalidated node lists / html collections less so that we can keep the cache in more cases.
I'd be interested in discussing this with you further. For context bug 73853 is where rniwa previously attempted this document-global invalidation.
(In reply to comment #6) > (In reply to comment #5) > > - If the queue is ever full, an "all lists are invalid" bit is set on Document. > > We can’t do this. I’ve attempted this in the past and broke a bunch of websites. By "broke", I assume you mean "slowed down"? > e.g. jQuery’s clone creates a document fragment to which it adds a whole bunch of nodes from the original DOM subtree. These additions will fill up your queue pretty quickly. Since jQuery uses getElementsByTagName("*") to iterate over all the nodes it’s trying to clone, invalidating node list in the original document will regress the jQuery performance. Could you be more specific? The idea is that every time we iterate to the next node in the list, that access of the next node would empty the queue. > And there are millions of examples like this where invalidating node lists more often leads to pathological performance. I'm interested in seeing more examples. You clearly have a lot of context here, which I'm interested to learn from. The goal with this proposed approach is to invalidate *much* less often. > > - On any access to a nodelist/collection first we have to process any pending invalidations from our modification queue, including invalidating all node lists if the "all lists are invalid" bit is set. > > We can’t do this either. Accessing node list is also really hot especially for things like childNodes. Adding that kind of structural checks prohibitively slows node lists / html collection. I find it hard to believe that a single branch in every node-list function to check the invalidation state of the list and the global bit would be prohibitive, but again you have more context here and I would like to learn from your time spent on this.
(In reply to comment #9) > (In reply to comment #6) > > (In reply to comment #5) > > > - If the queue is ever full, an "all lists are invalid" bit is set on Document. > > > > We can’t do this. I’ve attempted this in the past and broke a bunch of websites. > > By "broke", I assume you mean "slowed down"? > > > e.g. jQuery’s clone creates a document fragment to which it adds a whole bunch of nodes from the original DOM subtree. These additions will fill up your queue pretty quickly. Since jQuery uses getElementsByTagName("*") to iterate over all the nodes it’s trying to clone, invalidating node list in the original document will regress the jQuery performance. > > Could you be more specific? The idea is that every time we iterate to the next node in the list, that access of the next node would empty the queue. How does that work? If accessing some node list clears the queue, then how do node lists differentiate the case where we didn’t have any DOM mutations and the case where we did have DOM mutations but some other node list cleared the queue? More importantly, it implies that this optimization doesn’t work if you access multiple node lists in interchanging order while manipulating the DOM. > > We can’t do this either. Accessing node list is also really hot especially for things like childNodes. Adding that kind of structural checks prohibitively slows node lists / html collection. > > I find it hard to believe that a single branch in every node-list function to check the invalidation state of the list and the global bit would be prohibitive, but again you have more context here and I would like to learn from your time spent on this. Say, you know that a node X has been added to a container Y, and this is in the proposed queue. How does a node list figure out whether it needs to clear out the cache or not? For starters, the node list needs to know whether Y is in the subtree it cares about or not. That’s an O(n) operation. In general, invalidating lazily is not the way to go.
(In reply to comment #10) > (In reply to comment #9) > > (In reply to comment #6) > > > (In reply to comment #5) > > > e.g. jQuery’s clone creates a document fragment to which it adds a whole bunch of nodes from the original DOM subtree. These additions will fill up your queue pretty quickly. Since jQuery uses getElementsByTagName("*") to iterate over all the nodes it’s trying to clone, invalidating node list in the original document will regress the jQuery performance. > > > > Could you be more specific? The idea is that every time we iterate to the next node in the list, that access of the next node would empty the queue. > > How does that work? If accessing some node list clears the queue, then how do node lists differentiate the case where we didn’t have any DOM mutations and the case where we did have DOM mutations but some other node list cleared the queue? More importantly, it implies that this optimization doesn’t work if you access multiple node lists in interchanging order while manipulating the DOM. Sorry, I wasn't clear. The Queue is "cleared" on any node-list property access, but as a result of "processing" every add/remove/modify/whatever in the queue. The idea is that we do the same "invalidate your ancestors" style logic, we just do it when the node-lists are accessed, instead of inline as part of add/removeChild. This of course only works up to a point. We can't bother saving away arbitrarily many modifications, so we limit ourselves to say 100 dom modifications in our fixed-length queue, and when we excede that number, we just mark ourselves as having overflowed and mark all lists as needing invalidation. Marking all lists as needing invalidation is of course bad (as you discovered with your previous attempts!) But I expect that when we do more than a 100 (or whatever N we choose) per loop iteration, the cost of the mallocs, etc. for those modifications (creating new nodes, destroying old ones, etc.) far out-weighs any list invalidation costs we have. As you know, currently *any* element addition/removal invalidates all lists in your ancestor chain (including those on document). We could leave the current Attribute optimizations in-line in add/removeChild, or we coud also move attribute-modifications to this same delayed-processing if it were shown successful for the element case. > > > We can’t do this either. Accessing node list is also really hot especially for things like childNodes. Adding that kind of structural checks prohibitively slows node lists / html collection. > > > > I find it hard to believe that a single branch in every node-list function to check the invalidation state of the list and the global bit would be prohibitive, but again you have more context here and I would like to learn from your time spent on this. > > Say, you know that a node X has been added to a container Y, and this is in the proposed queue. How does a node list figure out whether it needs to clear out the cache or not? For starters, the node list needs to know whether Y is in the subtree it cares about or not. That’s an O(n) operation. > > In general, invalidating lazily is not the way to go. Sure. It does the same logic as today. Once for each item in that queue. It just does it at list-access time, instead of at add/remove child time.
(In reply to comment #11) > This of course only works up to a point. We can't bother saving away arbitrarily many modifications, so we limit ourselves to say 100 dom modifications in our fixed-length queue, and when we excede that number, we just mark ourselves as having overflowed and mark all lists as needing invalidation. I should be clear that the "invalidate all lists" is also handled at list-access time, instead of during add/removeChild, this is what the "all lists are invalid" bit on document is for. I believe your previous patch had similar logic, but I would need to read it more closely. Again, I am very interested in discussing this with you further, as you have *way* more context in this area than I do.
(In reply to comment #11) > Sorry, I wasn't clear. The Queue is "cleared" on any node-list property access, but as a result of "processing" every add/remove/modify/whatever in the queue. The idea is that we do the same "invalidate your ancestors" style logic, we just do it when the node-lists are accessed, instead of inline as part of add/removeChild. What's the point of doing it at a later time? We have less information then. > This of course only works up to a point. We can't bother saving away arbitrarily many modifications, so we limit ourselves to say 100 dom modifications in our fixed-length queue, and when we excede that number, we just mark ourselves as having overflowed and mark all lists as needing invalidation. But this invalidation only saves you one node list unless I'm misunderstanding you. Consider a DOM tree with node lists A and B. You access A[0] and B[0] and then modify DOM, and access A[1]. At this point, the queue is cleared. Whether A is invalidated or not is not important but the fact the queue is emptied is. Now, you access B[1]. Oops, we've already lost the queue! What do we do in this case? > As you know, currently *any* element addition/removal invalidates all lists in your ancestor chain (including those on document). We could leave the current Attribute optimizations in-line in add/removeChild, or we coud also move attribute-modifications to this same delayed-processing if it were shown successful for the element case. But what's the benefit of delaying the invalidation per node removal / addition? We're just postponing the work and we'll have less information when we try to decide whether we have to invalidate node lists or not. > > Say, you know that a node X has been added to a container Y, and this is in the proposed queue. How does a node list figure out whether it needs to clear out the cache or not? For starters, the node list needs to know whether Y is in the subtree it cares about or not. That’s an O(n) operation. > > > > In general, invalidating lazily is not the way to go. > > Sure. It does the same logic as today. Once for each item in that queue. It just does it at list-access time, instead of at add/remove child time. But why do we want this? This doesn't reduce the number of node lists needing to be invalidated. We're just delaying the time at which we invalidate them. Surely, we don't invalidate node lists right away but that's not the performance bottle neck in that particular test. The fact node list is invalidated is. If we can come up with a clever way to avoid invalidating the said html collection in this particular test, then let's implement that logic. But let's decide invalidate the node list right away when we're adding or removing a node.
Comment on attachment 173759 [details] first attempt, will not work View in context: https://bugs.webkit.org/attachment.cgi?id=173759&action=review > Source/WebCore/dom/ElementRareData.h:87 > + collection->invalidateCache(); This patch didn't work because you didn't delete the code to invalidate cache below. I bet it'll work if you delete the lines 93 through 100.
Comment on attachment 173759 [details] first attempt, will not work View in context: https://bugs.webkit.org/attachment.cgi?id=173759&action=review > Source/WebCore/dom/ContainerNode.cpp:738 > + if (Node* newChild = addedNode(this, beforeChild, afterChild, childCountDelta)) { We also need to take care of the case when the node is removed. While the test itself doesn't measure this time, reset() still uses it so that's probably why it's still showin up in your profiler.
(In reply to comment #14) > (From update of attachment 173759 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=173759&action=review > > > Source/WebCore/dom/ElementRareData.h:87 > > + collection->invalidateCache(); > > This patch didn't work because you didn't delete the code to invalidate cache below. I bet it'll work if you delete the lines 93 through 100. My (bogus) patch can't work, as it doesn't handle insertion of subtrees, and robohornet/addcol.html tests exactly that. :)
I believe we can resolve this bug if we polish the patch you've posted. But there are a few other optimizations we can implement for node lists. 1. Cache the entire node list when it's small. It's not atypical for node lists to contain a very small number of items. In that case, we can cache the entire thing in the node list so that random access and iterating it over multiple times is fast. When the node list is large, we can either avoid caching it altogether or cache the last 10-20 items in the list. 2. Detect to which side of the cached item the node is added or removed. When a new node is add or an existing node is removed from the subtree of the interest of a node list, we can detect whether the node added or removed appears before or after the cached node in the preorder. e.g. we can cache the node index of the children of the root node of the node list that contains the cached item (i.e. the child of the m_rootNode that contains the cached item), and as we walk up the ancestor tree we can check whether the invalidation should happen before or after the cached item. There are many clever tricks we can use to make it not regress removeChild/appendChild. 3. Cache the first item & the last item. It's very common for the author scripts to access the first or the last item of a node list / html collection. We can cache these items as we encounter for the first time.
(In reply to comment #16) > My (bogus) patch can't work, as it doesn't handle insertion of subtrees, and robohornet/addcol.html tests exactly that. :) We can traverse the subtree we're about to insert/remove, and collect all the elements. We can set some threshold (3~4 maybe?) and stop if the subtree has more than that many nodes.
(In reply to comment #13) > (In reply to comment #11) > > Sorry, I wasn't clear. The Queue is "cleared" on any node-list property access, but as a result of "processing" every add/remove/modify/whatever in the queue. The idea is that we do the same "invalidate your ancestors" style logic, we just do it when the node-lists are accessed, instead of inline as part of add/removeChild. > > What's the point of doing it at a later time? We have less information then. I'm not sure which information you're referring to? Currently we always invalidate every NodeList in your ancestor chain when an element is added or removed from the tree. This happens from childrenChanged, which contains the previous and next siblings (if relevant) as well as the childDeltaCount. I had planned to store a RefPtr to each node added in each entry in this queue, and a type of the modification. The point of doing it at a later time is to remove this logic from the hot addChild/removeChild code-paths. We don't currently handle "smart" invalidation of nodelists/collections based on what types of elements are added or removed (even though HTMLTableRowCollection only cares about tr's for instance). I could (and will need to) add this smart invalidation as part of this work. You are correct to suggest that HTMLCollection could be taught about this smart invalidation completely separately from the "queue" concept discussed above. The Queue concept is all about getting work out of the hot add/removeChild path. It not clear to me if we can do sub-tree-insertion/removal-triggered surgical-invalidation of nodelists efficiently enough to insert them into this hot-path, but it's reasonable to suggest that we should try doing them there before adding the complexity of a Queue/deferred-invalidation system. > > This of course only works up to a point. We can't bother saving away arbitrarily many modifications, so we limit ourselves to say 100 dom modifications in our fixed-length queue, and when we excede that number, we just mark ourselves as having overflowed and mark all lists as needing invalidation. > > But this invalidation only saves you one node list unless I'm misunderstanding you. Consider a DOM tree with node lists A and B. You access A[0] and B[0] and then modify DOM, and access A[1]. At this point, the queue is cleared. Whether A is invalidated or not is not important but the fact the queue is emptied is. Now, you access B[1]. Oops, we've already lost the queue! What do we do in this case? The Queue is a queue of dom modifications. We replay it in a LIFO manner invalidating appropriate node lists for each entry in the queue. The intention of the Queue design, is not to save node list invalidations, but rather to allow possibly more-expensive (and more simple invalidation code) which is moved out of the hot-hot-add/remove child paths. I could certainly be convinced that the whole Queue design is premature optimization for the feature in question here (smarter invalidation based on add/removes). > > As you know, currently *any* element addition/removal invalidates all lists in your ancestor chain (including those on document). We could leave the current Attribute optimizations in-line in add/removeChild, or we coud also move attribute-modifications to this same delayed-processing if it were shown successful for the element case. > > But what's the benefit of delaying the invalidation per node removal / addition? We're just postponing the work and we'll have less information when we try to decide whether we have to invalidate node lists or not. It's not clear to me which information we're losing. The reasoning fro deffering the invalidation work is discussed in an earlier paragraph in this comment. > > > Say, you know that a node X has been added to a container Y, and this is in the proposed queue. How does a node list figure out whether it needs to clear out the cache or not? For starters, the node list needs to know whether Y is in the subtree it cares about or not. That’s an O(n) operation. > > > > > > In general, invalidating lazily is not the way to go. > > > > Sure. It does the same logic as today. Once for each item in that queue. It just does it at list-access time, instead of at add/remove child time. > > But why do we want this? This doesn't reduce the number of node lists needing to be invalidated. We're just delaying the time at which we invalidate them. Correct. I should have been more clear earlier that the Queue is only about allowing more expensive invalidation work by deferring it out of the critical add/remove path. The smarter invalidation work is what's necessary to solve this original bug. > Surely, we don't invalidate node lists right away but that's not the performance bottle neck in that particular test. Correct! My understanding from speaking with you before about this issue is that invalidation logic is extremely perf-sensitive code however (because of it's inclusion in the super-hot add/removeChild code). > The fact node list is invalidated is. If we can come up with a clever way to avoid invalidating the said html collection in this particular test, then let's implement that logic. But let's decide invalidate the node list right away when we're adding or removing a node. Sure. And that's a very reasonable first-step for this bug. Based on previous discussions with you I feared the necessary logic (walking the inserted/removed subtree) during add/removeChild would be too expensive. My take-away from our discussion is that I should post the patches for smarter invalidation, add them to the critical path (demonstrating that the bug is fixed) and then when that's shown to be too slow to use, we can discuss the Queue approach (or other similar approaches) at greater length.
(In reply to comment #18) > (In reply to comment #16) > > My (bogus) patch can't work, as it doesn't handle insertion of subtrees, and robohornet/addcol.html tests exactly that. :) > > We can traverse the subtree we're about to insert/remove, and collect all the elements. We can set some threshold (3~4 maybe?) and stop if the subtree has more than that many nodes. Yes, this is similar to (and part of what prompted) the Queue design originally mentioned, what you've described is queuing up all the elements in a stack-array instead of a document-global array and processing the elements inline in add/remove (with the same kind of "overflow causes fallback to dumber invalidation" logic as discussed for the Queue).
(In reply to comment #19) > The point of doing it at a later time is to remove this logic from the hot addChild/removeChild code-paths. ... > The Queue is a queue of dom modifications. We replay it in a LIFO manner invalidating appropriate node lists for each entry in the queue. > > The intention of the Queue design, is not to save node list invalidations, but rather to allow possibly more-expensive (and more simple invalidation code) which is moved out of the hot-hot-add/remove child paths. I don't think there is a point in doing that. If this invalidation step is too slow to run inside addChild/removeChild, then it's certainly too slow to run when a node list item is accessed. node list's getItem is much hotter than appendChild/removeChild. > Correct! My understanding from speaking with you before about this issue is that invalidation logic is extremely perf-sensitive code however (because of it's inclusion in the super-hot add/removeChild code). Right. We certainly want to limit the extent to which we explore the subtree. However, we already traverse through the nodes in subtree we're inserting or removing to update inDocument flags and tree scopes (TreeScopeAdopter) so we might be able to add some sort of a context object there to collect the information we need to maximize the cache locality. > My take-away from our discussion is that I should post the patches for smarter invalidation, add them to the critical path (demonstrating that the bug is fixed) and then when that's shown to be too slow to use, we can discuss the Queue approach (or other similar approaches) at greater length. Sounds good to me.
What we can also do is to create a tag names to bit field map from existing node lists, and use that for O(1) comparison per node.
(In reply to comment #22) > What we can also do is to create a tag names to bit field map from existing node lists, and use that for O(1) comparison per node. I think that's an interesting thought. That could allow us to condense the relevant knowledge from a subtree walk into a single 32/63-bit type. Each NodeList could have a similar mask for what changes it cares about to elements/attributes and then the comparison per-node-list is cheap. That saves having to walk the vector of changed nodes for each nodelist you need to consider invalidating.
(In reply to comment #23) > (In reply to comment #22) > > What we can also do is to create a tag names to bit field map from existing node lists, and use that for O(1) comparison per node. > > I think that's an interesting thought. That could allow us to condense the relevant knowledge from a subtree walk into a single 32/63-bit type. Each NodeList could have a similar mask for what changes it cares about to elements/attributes and then the comparison per-node-list is cheap. That saves having to walk the vector of changed nodes for each nodelist you need to consider invalidating. Right. And when there are more than 32/64 tag names we care about, we can fall back to invalidating everything.
One optimization I always wanted to implement is to have a document level information about which kinds of node lists have a valid cached item. Such information will allow us to avoid iterating over all node lists when there are node lists with valid cache.