Bug 74828 - The HTML parser doesn't enforce the "Noah's Ark condition" from the HTML5 spec
Summary: The HTML parser doesn't enforce the "Noah's Ark condition" from the HTML5 spec
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-18 17:55 PST by Adam Barth
Modified: 2011-12-19 00:28 PST (History)
2 users (show)

See Also:


Attachments
Patch (259.40 KB, patch)
2011-12-18 18:02 PST, Adam Barth
no flags Details | Formatted Diff | Diff
incorporated review comments (259.21 KB, patch)
2011-12-18 23:15 PST, Adam Barth
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2011-12-18 17:55:50 PST
The HTML parser doesn't enforce the "Noah's Ark condition" from the HTML5 spec
Comment 1 Adam Barth 2011-12-18 18:02:47 PST
Created attachment 119792 [details]
Patch
Comment 2 Darin Adler 2011-12-18 18:22:19 PST
Comment on attachment 119792 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=119792&action=review

> Source/WebCore/ChangeLog:10
> +        elements that can be in the list of active formating elements.  I'm not

Typo: formating instead of formatting

> Source/WebCore/ChangeLog:11
> +        entirely sure that enforcing this condition is worth the complexity,

Besides complexity, there is the performance cost. Did you measure that? I hope it’s negligible with the optimized code path you added here.

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:43
> +inline size_t attributeCount(Element* element)

Since this is in a cpp file it should have static so it gets internal linkage.

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:134
> +void HTMLFormattingElementList::tryToEnsureNoahsArkConditionQuickly(Element* newElement, Vector<Element*>& remainingCandiates)

Typo: remainingCandiates rather than remainingCandidates

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:138
> +    // Initially, use a vector with inline capacity to avoid a malloc in the

The use of the word “initially” here is confusing. I know it refers to this case vs. the non-quickly function, but it is not obvious.

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:142
> +    size_t newElementAttributeCount = attributeCount(newElement);

Is it possibly worth checking m_entries.size() for zero first, so we don’t have to fetch the attribute count at all in that case?

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:145
> +    for (size_t i = 0; i < m_entries.size(); ++i) {
> +        Entry& entry = m_entries[m_entries.size() - i - 1];

I would write the loop like this:

    for (size_t i = m_entries.size(); i; ) {
        --i;
        Entry& entry = m_entries[i];

It’s my new favorite way to write backwards loops. I think it’s easier to read this way.

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:152
> +        if (!newElement->hasTagName(candidate->tagQName()))
> +            continue;

I believe hasTagName here is incorrect and the correct check is even faster and stricter, element->tagQName() != candidate->tagQName(). Is that right?

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:192
> +            // FIXME: Technically we shouldn't read this information back from
> +            // the DOM. Instead, the parser should keep a copy of the information.

Could we be more specific about why reading back from the DOM is an issue? Maybe the reason is that a script that changes the DOM should have no effect on the Noah's Ark algorithm. Or maybe there’s some other reason that you know of and I do not.

> Source/WebCore/html/parser/HTMLFormattingElementList.cpp:206
> +    // formatting lement list gets permuted.

Typo: lement list

> Source/WebCore/html/parser/HTMLFormattingElementList.h:130
> +    // http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#list-of-active-formatting-elements

I think a human-readable description of what this is for could be valuable too. My comment would say something like this:

    // The "Noah's Ark" algorithm, which removes redundant mis-nested elements.
Comment 3 Adam Barth 2011-12-18 23:15:10 PST
Created attachment 119823 [details]
incorporated review comments
Comment 4 Adam Barth 2011-12-18 23:31:26 PST
> Besides complexity, there is the performance cost. Did you measure that? I hope it’s negligible with the optimized code path you added here.

Looks like about 1.5% on the HTMLParser benchmark.  That's slightly more than I would have expected but still pretty small.
Comment 5 WebKit Review Bot 2011-12-19 00:28:27 PST
Comment on attachment 119823 [details]
incorporated review comments

Clearing flags on attachment: 119823

Committed r103223: <http://trac.webkit.org/changeset/103223>
Comment 6 WebKit Review Bot 2011-12-19 00:28:31 PST
All reviewed patches have been landed.  Closing bug.