Bug 80490

Summary: [Microdata] Implement cache mechanism for HTMLPropertiesCollection.
Product: WebKit Reporter: Arko Saha <arko>
Component: DOMAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, rniwa, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 68609    
Attachments:
Description Flags
Patch
rniwa: review-
Updated patch
rniwa: review-
Updated patch
none
Updated patch none

Description Arko Saha 2012-03-06 22:13:39 PST
Currently HTMLPropertiesCollection item, length does not use cache. Its recomputing item, length every time. Need to implement caching mechanism for HTMLPropertiesCollection.
Comment 1 Arko Saha 2012-03-08 06:32:53 PST
Created attachment 130817 [details]
Patch
Comment 2 Adam Barth 2012-03-08 14:19:45 PST
I'm going to defer to rniwa in reviewing this patch.
Comment 3 Ryosuke Niwa 2012-03-09 08:54:12 PST
As I discussee with arko on IRC, we should push back on the spec so that property item lists are in the document order instead of being sorted.
Comment 4 Ryosuke Niwa 2012-03-09 11:30:38 PST
Wait... the spec says it's in the tree order :(

http://dev.w3.org/html5/md/#htmlpropertiescollection
collection(index)
Returns the element with index index from the collection. The items are sorted in tree order.

We shouldn't need to do any of these sorting. DOM is in the tree order by default!
Comment 5 Arko Saha 2012-03-12 00:22:50 PDT
(In reply to comment #4)
> Wait... the spec says it's in the tree order :(
> 
> http://dev.w3.org/html5/md/#htmlpropertiescollection
> collection(index)
> Returns the element with index index from the collection. The items are sorted in tree order.
> 
> We shouldn't need to do any of these sorting. DOM is in the tree order by default!

Let me try to explain the situation with below example:

<div id="id1">
<p itemprop="foo" id="a"></p>
</div>

<section itemscope itemref="id1 id2">
<p itemprop="foo" id="b"></p>
</section>

<div id="id2">
<p itemprop="foo" id="c"></p>
</div>

Here <section> has itemscope attribute, its an microdata item.

if we do item = document.getItems()[0], it will return the <section> element. first item in the document. 

item.properties.namedItem(foo) must return a NodeList object containing three <p> elements with id- a, b, c respectively. it should be in the tree order. 

With itemref attribute we can specify the additional properties of an item. Those properties will not be children of the item as they are added through itmref attribute.

Now to find the properties of an item we do the below steps as per spec :
1. Find all children of the item with itemprop attribute.
2. Find the properties which are added through the itemref attribute. In the example: find element with id 'id1', 'id2'. Then find the properties of id1, id2
3. Sort the result in the tree order.

Please note that the traversal is started from base element i.e <section> in this case and not from the document root so the node list we get will not be in tree order(without step 3). It will be b, a, c.

So without sorting how to get the properties in tree order (a, b, c)??
Comment 6 Ryosuke Niwa 2012-03-12 00:42:19 PDT
Thanks for the clarification.

(In reply to comment #5)
> With itemref attribute we can specify the additional properties of an item. Those properties will not be children of the item as they are added through itmref attribute.
> 
> Now to find the properties of an item we do the below steps as per spec :
> 1. Find all children of the item with itemprop attribute.

If we're currently in one of these children, then we can simply look for the next sibling.

> 2. Find the properties which are added through the itemref attribute. In the example: find element with id 'id1', 'id2'. Then find the properties of id1, id2

We can probably cache a sorted (in tree order) list of these itemref ids. We can then use it to find the next element with the said id in the tree order.
Comment 7 Arko Saha 2012-03-14 08:17:34 PDT
I have modified the patch so that it caches only item and itemref elements. And sort happens only on this list. Chlidren of item and itemref elements are exculded from sorting as they are already in tree order.
Comment 8 Arko Saha 2012-03-14 08:19:13 PDT
Created attachment 131853 [details]
Updated patch
Comment 9 Arko Saha 2012-03-20 07:07:27 PDT
Hi Ryosuke, can you please review the updated patch. Thanks.
Comment 10 Ryosuke Niwa 2012-03-21 23:31:27 PDT
Comment on attachment 131853 [details]
Updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=131853&action=review

r- because you probably want to address some of my comments.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:67
> +static Node* nextNodeOrSibling(Node* base, Node* node)

Could you give this function more descriptive name? It does certainly traverse next node or sibling but it doesn't explain why it does.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:69
> +    return (node == base) || (node->isHTMLElement() && !toHTMLElement(node)->fastHasAttribute(itemscopeAttr))

No need for parenthesis around node == base.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:79
> +    Node* current;
> +    if (!previous)
> +        current = base;
> +    else
> +        current = nextNodeOrSibling(base, previous);

I think you should use the ternary operator here.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:84
> +        if (!current->isHTMLElement())
> +            continue;
> +        Element* element = toHTMLElement(current);

It seems odd to check if current is a HTML element then store it to Element*.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:98
> +    if (!base()->isHTMLElement() || !toHTMLElement(base())->fastHasAttribute(itemscopeAttr))

Can base() ever return non-Element node? Maybe we want to change the type there so that we don't have to check whether it's not an element or not.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:103
> +    m_cache.itemRefElements.append(baseElement);
> +    m_cache.hasItemRefElements = true;

We probably want to add a bunch of helper functions in m_cache so that we don't have to update these member variables manually like this.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:125
> +    for (unsigned i = 0; i < m_cache.itemRefElements.size(); ++i)
> +        for (Element* element = itemAfter(m_cache.itemRefElements[i], 0); element; element = itemAfter(m_cache.itemRefElements[i], element))
> +            ++length;

You need curly brackets for the outer for loop.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:158
> +        for (unsigned i = 0; i < m_cache.itemRefElements.size(); ++i)

You need curly brackets for this for loop since the statement is more than one line.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:159
> +            if ((m_cache.current = itemAfter(m_cache.itemRefElements[i], 0))) {

No extra parenthesis needed.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:165
> +                m_cache.itemRefElementPosition = i;
> +                break;
> +            }
> +        m_cache.position = 0;
> +        if (!m_cache.current)
> +            return 0;

It seems like this entire block is redundant and can be merged into the for loop below.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:168
> +    Element* e = m_cache.current;

Please don't use one-letter variable name like e.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:174
> +            e = itemAfter(m_cache.itemRefElements[i], e);

We need to initialize e before each inner loop here (i.e. before for (unsigned pos = currentPosition;...), right?

> Source/WebCore/html/HTMLPropertiesCollection.cpp:179
> +            if (e)
> +                currentPosition++;
> +            else
> +                break;

It's probably better to bail out early as in:
if (!e)
    break;
currentPosition++;

> Source/WebCore/html/HTMLPropertiesCollection.cpp:204
> +            if (m_cache.propertyNames->isEmpty() || !m_cache.propertyNames->contains(propertyName))

I think you can just check !m_cache.propertyNames->contains(propertyName). If m_cache.propertyNames->isEmpty(), then it certainly doesn't contain anything.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:233
> +    invalidateCacheIfNeeded();
> +    updateNameCache();

It seems like updateNameCache can call invalidateCacheIfNeeded.
Comment 11 Arko Saha 2012-03-27 06:31:34 PDT
(In reply to comment #10)
> It seems like this entire block is redundant and can be merged into the for loop below.
> 

The algorithm of HTMLPropertiesCollection::item() method is as below:

In if (!m_cache.current || m_cache.position > index) {.. }

Here we reset the position to 0. set current to the 0th property in the collection. set the itemRefElementPosition to the position in itemRefElements where we find the 0th property.

The first block is responsible for finding the 0th property in the itemRefElements list. Then searching required property in the collection starts in the second block. 
So we require this block to get started with the searching in the second block.

In the second block i.e, for (unsigned i = m_cache.itemRefElementPosition..) :

1. The outer for loop picks an element from itemRefElements.
2. The inner for loop is responsible to find a property in that element i.e, itemRefElements[i]. using itemAfter() method.
itemAfter() returns the 0th property if the previous element is null else returns the next property in the itemRefElements[i]. It will return null if it already processed all properties in itemRefElements[i].
3. increment the currentPosition appropriately. 
4. If the currentPosition is equal to index then return the property. else find the next property.
5. If the next property is null that means itemRefElements[i] has no more property to process. then control will come to the outer for loop and picks the next element in the itemRefElements. And executes above steps (repeat from step 1). 

So I think if we merge two blocks here then the algorithm would become little unclear and complex.

In the following block,
if (!m_cache.current || m_cache.position > index) {m_cache.position = 0; ... }
we are resetting the position to 0, but if we do not set the m_cache.current to the first property in the collection, it might require some special handling for the 0th property to start the search with.. which might complicate the algo.
Comment 12 Ryosuke Niwa 2012-03-27 22:59:30 PDT
Thanks for the clarification.

(In reply to comment #11)
> The first block is responsible for finding the 0th property in the itemRefElements list. Then searching required property in the collection starts in the second block. 
> So we require this block to get started with the searching in the second block.
> 
> In the second block i.e, for (unsigned i = m_cache.itemRefElementPosition..) :
> 
> 1. The outer for loop picks an element from itemRefElements.
> 2. The inner for loop is responsible to find a property in that element i.e, itemRefElements[i]. using itemAfter() method.
> itemAfter() returns the 0th property if the previous element is null else returns the next property in the itemRefElements[i]. It will return null if it already processed all properties in itemRefElements[i].
> 3. increment the currentPosition appropriately. 
> 4. If the currentPosition is equal to index then return the property. else find the next property.
> 5. If the next property is null that means itemRefElements[i] has no more property to process. then control will come to the outer for loop and picks the next element in the itemRefElements. And executes above steps (repeat from step 1). 
> 
> So I think if we merge two blocks here then the algorithm would become little unclear and complex.

In that case, can extract either or both parts as a helper function(s) with some descriptive name? That'll make the semantics much clearer :)
Comment 13 Arko Saha 2012-03-29 07:03:52 PDT
Created attachment 134572 [details]
Updated patch

Incorporated review comments.
Comment 14 Ryosuke Niwa 2012-03-29 15:23:33 PDT
Comment on attachment 134572 [details]
Updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=134572&action=review

Okay, the patch looks much better. Maybe we want one more iteration before we land this.

> Source/WebCore/html/HTMLCollection.h:92
> +#if ENABLE(MICRODATA)
> +        NodeCacheMap propertyCache;
> +        Vector<Element*> itemRefElements;
> +        RefPtr<DOMStringList> propertyNames;
> +        unsigned itemRefElementPosition;
> +        bool hasItemRefElements;
> +#endif

On my second thought, it might be better to store this in HTMLPropertiesCollection or an inner class or HTMLPropertiesCollection
because the change as is will increase the size of all HTMLCollection even though those collections would never use these variables.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:67
> +static Node* nextNodeToProcess(Node* base, Node* node)

Maybe it's better to call this function nextNodeWithProperty? nextNodeToProcess sounds too generic.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:69
> +    // An Microdata item may contain properties which in turn an item itself. Properties can also

"which in turn" what? "which in turn are items themselves"?

> Source/WebCore/html/HTMLPropertiesCollection.cpp:110
> +    std::sort(itemRefElements.begin(), itemRefElements.end(), compareTreeOrder);

This is very inefficient. I think it's probably better to walk the entire document and collect nodes in the document order
since sorting nodes will end up doing the same work anyway.

> Source/WebCore/html/HTMLPropertiesCollection.cpp:184
> +        for (unsigned pos = currentPosition; pos < index; ++pos) {

What's the point of pos local variable here? Can't we just increment currentPosition?
Comment 15 Arko Saha 2012-04-06 13:40:16 PDT
Created attachment 136059 [details]
Updated patch

Incorporated review comments.
Comment 16 WebKit Review Bot 2012-04-11 08:20:23 PDT
Comment on attachment 136059 [details]
Updated patch

Clearing flags on attachment: 136059

Committed r113862: <http://trac.webkit.org/changeset/113862>
Comment 17 WebKit Review Bot 2012-04-11 08:20:28 PDT
All reviewed patches have been landed.  Closing bug.