Bug 40961

Summary: REGRESSION (HTML5 Parser): Pages broken due to <tag<tag> parsing changes
Product: WebKit Reporter: William Siegrist <wsiegrist>
Component: DOMAssignee: Adam Barth <abarth>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, aestes, ap, commit-queue, CorniMac, darin, dglazkov, eric, gustavo, mike, mitz, webkit.review.bot, xan.lopez
Priority: P1 Keywords: InRadar, Regression
Version: 528+ (Nightly build)   
Hardware: Mac   
OS: OS X 10.6   
URL: http://www.macruby.org/
Bug Depends on:    
Bug Blocks: 41115, 46134    
Attachments:
Description Flags
screenshot showing bad layout
none
reduced test case
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description William Siegrist 2010-06-21 17:30:11 PDT
Created attachment 59318 [details]
screenshot showing bad layout

The MacRuby website does not layout correctly on r61502. The main body of the website is shifted to the right and lines up with the red boxes on the right instead of the left edge of the black navigation bar.  See attached screenshot.

The website looks correct on stock 10.6.4 WebKit.
Comment 1 Alexey Proskuryakov 2010-06-22 11:10:43 PDT
Regressed in r61285.
Comment 2 Alexey Proskuryakov 2010-06-22 11:25:31 PDT
Created attachment 59397 [details]
reduced test case

This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE.
Comment 3 Adam Barth 2010-06-22 11:48:38 PDT
Thanks.  I'll look into this today.
Comment 4 Adam Barth 2010-06-22 13:44:29 PDT
We seem to be doing what the spec says to do here (and we match Minefield).  I've sent an email to the HTML working group.  This issue is on our "top five" list of behavioral differences like to cause compatibility problems:

https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en

Let's give the working group a couple days to respond before taking action.
Comment 5 Adam Barth 2010-06-22 13:48:19 PDT
> This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE.

The test also fails in Opera 10.54.
Comment 6 William Siegrist 2010-06-22 13:58:14 PDT
I filed a MacRuby bug with the fix for them:

https://www.macruby.org/trac/ticket/759
Comment 7 Adam Barth 2010-06-22 14:53:52 PDT
See http://www.w3.org/Bugs/Public/show_bug.cgi?id=9985
Comment 8 Adam Barth 2010-07-09 11:16:35 PDT
*** Bug 41069 has been marked as a duplicate of this bug. ***
Comment 9 mitz 2010-07-19 11:34:20 PDT
Applications linked against the shipping version of WebKit rely on the old parser behavior. There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.
Comment 10 Adam Barth 2010-07-19 11:38:01 PDT
> Applications linked against the shipping version of WebKit rely on the old parser behavior.

Perhaps you'd be interested in <http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96>?

> There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.

Is there a sunset for how long the Mac port is required to support the old parser?  It's quite complex, and it's going to bitrot.
Comment 11 mitz 2010-07-19 11:44:38 PDT
(In reply to comment #10)
> > Applications linked against the shipping version of WebKit rely on the old parser behavior.
> 
> Perhaps you'd be interested in <http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96>?

I am aware of it, but keeping the legacy parser around is not desirable for obvious reasons.

> > There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.
> 
> Is there a sunset for how long the Mac port is required to support the old parser?  It's quite complex, and it's going to bitrot.

The Mac port is not required to support the old parser, and in fact I just said that was un undesirble way to address the issue.

Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.
Comment 12 Adam Barth 2010-07-19 11:52:34 PDT
> Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.

Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content:

https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en

Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one?

Same question: do we have a plan for sunsetting these quirks?  For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past.

Also, we'll need to test the performance impact of adding these branches to the tokenizer.  In general, you can see many of the branches in the benchmark score, so we might need to do something clever.
Comment 13 Alexey Proskuryakov 2010-07-19 11:59:24 PDT
> Same question: do we have a plan for sunsetting these quirks?  
> For example, we could make them available only to applications
> linked before a certain date and then remove the setting them when
> that date is sufficiently far in the past.

I don't think we have an equivalent mechanism for Dashboard widgets, so the quirk will likely have to be permanent.
Comment 14 mitz 2010-07-19 12:06:41 PDT
(In reply to comment #12)
> > Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.
> 
> Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content:
> 
> https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en
> 
> Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one?

I don’t have a strong opinion on this. It might depend on the context in which we’d want to enable the quirks. For now, I can’t think of a reason to separate them.

> Same question: do we have a plan for sunsetting these quirks?  For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past.

For application compatibility, the quirks should only be enabled for select application bundle IDs (currently the set contains only AIM) and only if they were not linked against WebKit-534.1 or later. There is currently no plan to ever break AIM.

For website compatibilty, if site-specific hacks are deployed, they can be revisited at any time. If specific websites are targeted, and those websites are updated to not require the quirk anymore, then it can be removed. If the quirk targets something different (for example, a specific version of some JavaScript library), then it’s harder to tell when to remove it.

> Also, we'll need to test the performance impact of adding these branches to the tokenizer.  In general, you can see many of the branches in the benchmark score, so we might need to do something clever.

Of course. In the worst case, and if we opt for a single “quirks” switch, then templatizing a sufficiently large portion of the tokenizer should eliminate branches on any hot paths.
Comment 15 Adam Barth 2010-07-19 13:02:05 PDT
Sounds like a plan.  I might need some help with the patch because I'm not overly familiar with the quirks infrastructure.
Comment 16 Mark Rowe (bdash) 2010-08-25 17:28:27 PDT
We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content.  They’ve primarily been web sites, but also email messages generated by reporting systems.
Comment 17 Adam Barth 2010-08-26 23:17:34 PDT
(In reply to comment #16)
> We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content.  They’ve primarily been web sites, but also email messages generated by reporting systems.

That's not overly surprising given that internal Apple content is more likely to be consumed with WebKit than with IE.  Presumably these pieces of content are also broken in IE and Firefox 4.
Comment 18 Andy Estes 2010-09-10 01:56:06 PDT
<rdar://problem/8154379>
Comment 19 Adam Barth 2010-09-13 01:43:39 PDT
I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works.

One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility.  The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state.  One subtly is that you'll probably need to emit the current token before re-consuming the '<' character.
Comment 20 Andy Estes 2010-09-13 20:38:19 PDT
(In reply to comment #19)
> I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works.
> 
> One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility.  The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state.  One subtly is that you'll probably need to emit the current token before re-consuming the '<' character.

Adam, thanks for the pointer in the right direction. In addition to TagNameState, there were a few other states that needed this quirk to handle cases such as <tag attr<tag> and <tag <tag>, but it is essentially the same change in each state (for <tag attr<tag>, endAttributeName() needs to be called before reconsuming the < in DataState).

I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test.

I'll upload the WebCore part now if you want to take a look at it.
Comment 21 Andy Estes 2010-09-13 20:49:04 PDT
(In reply to comment #20)
> I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test.

On second thought, enabling this quirk is probably outside the scope of this bug. I'll just focus on adding the quirk capability and open other bugs for enabling it in specific cases.
Comment 22 Eric Seidel (no email) 2010-09-13 20:51:51 PDT
Quirks like this should be easy to test using some iframes and layoutTestController.overridePreference
Comment 23 Andy Estes 2010-09-13 21:02:03 PDT
Created attachment 67516 [details]
Patch
Comment 24 Adam Barth 2010-09-13 21:12:16 PDT
Comment on attachment 67516 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch

Overall, looks spot on.  Thanks for working on this patch.

> WebCore/html/parser/HTMLDocumentParser.cpp:95
> -    , m_tokenizer(HTMLTokenizer::create())
> +    , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings()))
Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code.

> WebCore/html/parser/HTMLTokenizer.cpp:106
> -HTMLTokenizer::HTMLTokenizer()
> +HTMLTokenizer::HTMLTokenizer(Settings* settings)
Do you think passing in the settings object is better than passing in the Document?  It's slightly unclear to me which is better.

> WebCore/html/parser/HTMLTokenizer.cpp:108
> +    , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks())
useLegacyTokenizerQuirks?  I guess it depends on what granularity we want to be able to control the quirks...

Are you sure that frameless documents don't need this quirk?

> WebCore/html/parser/HTMLTokenizer.cpp:440
> +        else if (cc == '<' && m_useLegacyParserQuirks)
I bet you should put m_useLegacyParserQuirks first.  That branch will be very well predicted.

> WebCore/html/parser/HTMLTokenizer.cpp:919
> +        } else if (cc == '<' && m_useLegacyParserQuirks) {
> +            m_token->endAttributeName(source.numberOfCharactersConsumed());
> +            return emitAndReconsumeIn(source, DataState);
It would be nice if we could skip this one.  The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary).  By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
Comment 25 Eric Seidel (no email) 2010-09-13 21:20:27 PDT
Comment on attachment 67516 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch

> WebCore/html/parser/HTMLTokenizer.cpp:108
> +    , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks())
I disagree.  One flag is scary enough.  Having N different parser modes is bad news bears.   Andy is right to just have one flag IMO.
Comment 26 Andy Estes 2010-09-13 21:31:30 PDT
(In reply to comment #24)
> (From update of attachment 67516 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch
> 
> Overall, looks spot on.  Thanks for working on this patch.
> 
> > WebCore/html/parser/HTMLDocumentParser.cpp:95
> > -    , m_tokenizer(HTMLTokenizer::create())
> > +    , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings()))
> Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code.

Will do.

> 
> > WebCore/html/parser/HTMLTokenizer.cpp:106
> > -HTMLTokenizer::HTMLTokenizer()
> > +HTMLTokenizer::HTMLTokenizer(Settings* settings)
> Do you think passing in the settings object is better than passing in the Document?  It's slightly unclear to me which is better.

My thought was that Settings was more specific to what HTMLTokenizer needs than Document.

> 
> > WebCore/html/parser/HTMLTokenizer.cpp:108
> > +    , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks())
> useLegacyTokenizerQuirks?  I guess it depends on what granularity we want to be able to control the quirks...

I'd like to use this same quirk to handle the <script /> case, and I believe that will need to be patched in HTMLTreeBuilder. So, using the work tokenizer might make it too specific.

> 
> Are you sure that frameless documents don't need this quirk?

Ahh, so since document()->settings() is really document()->m_frame->settings(), frameless documents would return a NULL Settings. I didn't consider this, so I'll think about the right thing to do. I don't know a lot about frameless documents, but I don't see why they wouldn't want the quirk.

> 
> > WebCore/html/parser/HTMLTokenizer.cpp:440
> > +        else if (cc == '<' && m_useLegacyParserQuirks)
> I bet you should put m_useLegacyParserQuirks first.  That branch will be very well predicted.

Will do.

> 
> > WebCore/html/parser/HTMLTokenizer.cpp:919
> > +        } else if (cc == '<' && m_useLegacyParserQuirks) {
> > +            m_token->endAttributeName(source.numberOfCharactersConsumed());
> > +            return emitAndReconsumeIn(source, DataState);
> It would be nice if we could skip this one.  The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary).  By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.

This can probably be skipped for now, but I wouldn't be surprised if we find this case in the wild. The markup used by AIM comes close to hitting this case, but has whitespace after the end of the attribute (the attribute value is also quoted). 

Thanks for the feedback!
Comment 27 Andy Estes 2010-09-13 21:38:58 PDT
(In reply to comment #24)
> > WebCore/html/parser/HTMLTokenizer.cpp:919
> > +        } else if (cc == '<' && m_useLegacyParserQuirks) {
> > +            m_token->endAttributeName(source.numberOfCharactersConsumed());
> > +            return emitAndReconsumeIn(source, DataState);
> It would be nice if we could skip this one.  The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary).  By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.

If this turns out to be necessary, maybe it could be restricted to quoted attributes.
Comment 28 Adam Barth 2010-09-13 21:42:13 PDT
(In reply to comment #27)
> (In reply to comment #24)
> > > WebCore/html/parser/HTMLTokenizer.cpp:919
> > > +        } else if (cc == '<' && m_useLegacyParserQuirks) {
> > > +            m_token->endAttributeName(source.numberOfCharactersConsumed());
> > > +            return emitAndReconsumeIn(source, DataState);
> > It would be nice if we could skip this one.  The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary).  By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
> 
> If this turns out to be necessary, maybe it could be restricted to quoted attributes.

I'm confused.  Do you mean this case:

<div foo="hello<script>alert(1)</script>there">

or this case:

<div foo="hello"<script>alert(1)</script>

?
Comment 29 Alexey Proskuryakov 2010-09-13 22:11:25 PDT
> m_useLegacyParserQuirks

If this is going to be a single switch, it still needs a more specific name (maybe PreHTML5?). We won't want to add HTML5 behaviors here when HTML5v2 comes out, but this name will still be suggesting that we should.
Comment 30 Andy Estes 2010-09-14 21:55:16 PDT
(In reply to comment #28)
> (In reply to comment #27)
> > (In reply to comment #24)
> > > > WebCore/html/parser/HTMLTokenizer.cpp:919
> > > > +        } else if (cc == '<' && m_useLegacyParserQuirks) {
> > > > +            m_token->endAttributeName(source.numberOfCharactersConsumed());
> > > > +            return emitAndReconsumeIn(source, DataState);
> > > It would be nice if we could skip this one.  The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary).  By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
> > 
> > If this turns out to be necessary, maybe it could be restricted to quoted attributes.
> 
> I'm confused.  Do you mean this case:
> 
> <div foo="hello<script>alert(1)</script>there">
> 
> or this case:
> 
> <div foo="hello"<script>alert(1)</script>
> 
> ?

The second case. I don't believe the legacy parser emitted a script token in the first case.
Comment 31 Adam Barth 2010-09-14 21:58:55 PDT
> The second case. I don't believe the legacy parser emitted a script token in the first case.

Ok.  :)
Comment 32 Andy Estes 2010-09-17 03:10:33 PDT
Created attachment 67897 [details]
Patch
Comment 33 Andy Estes 2010-09-17 03:14:34 PDT
(In reply to comment #32)
> Created an attachment (id=67897) [details]
> Patch

A few notes:

- I couldn't decide how to handle the case of frameless document so I didn't. I'm not aware of a case where this will cause a regression, but now that I think of it I should probably add a FIXME.
- I made my best effort to add the new WebPreference to all the ports so they can run the layout tests, but I can't be sure they're all going to compile. I'll try to land this during a quiet period so I can clean up any breakage without causing too much trouble.
Comment 34 WebKit Review Bot 2010-09-17 03:47:08 PDT
Attachment 67897 [details] did not build on gtk:
Build output: http://queues.webkit.org/results/4012057
Comment 35 Andy Estes 2010-09-17 04:00:55 PDT
Created attachment 67899 [details]
Patch
Comment 36 WebKit Review Bot 2010-09-17 04:29:48 PDT
Attachment 67899 [details] did not build on gtk:
Build output: http://queues.webkit.org/results/4043039
Comment 37 Andy Estes 2010-09-17 04:33:53 PDT
Created attachment 67900 [details]
Patch
Comment 38 WebKit Review Bot 2010-09-17 04:55:43 PDT
Attachment 67897 [details] did not build on win:
Build output: http://queues.webkit.org/results/4029046
Comment 39 Andy Estes 2010-09-17 05:34:51 PDT
Created attachment 67902 [details]
Patch
Comment 40 WebKit Review Bot 2010-09-17 06:46:08 PDT
Attachment 67900 [details] did not build on win:
Build output: http://queues.webkit.org/results/4063037
Comment 41 Darin Adler 2010-09-17 10:07:48 PDT
Comment on attachment 67902 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch

> WebCore/html/parser/HTMLTokenizer.h:122
> -    static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); }
> +    static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }

It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion.

It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class.

> WebCore/html/parser/HTMLTokenizer.h:268
> +    inline bool usePreHTML5ParserQuirks();

Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see.

I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk.
Comment 42 Adam Barth 2010-09-17 10:15:57 PDT
Comment on attachment 67902 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch

> WebCore/html/parser/HTMLTokenizer.cpp:1085
> +        else if (usePreHTML5ParserQuirks() && cc == '<')
> +            return emitAndReconsumeIn(source, DataState);

It's slightly unclear to me why you chose these states for this quirk.  I would have expected it in these states:

TagNameState
BeforeAttributeNameState
AttributeNameState *
AfterAttributeNameState *
AfterAttributeValueQuotedState

I've marked the ones you didn't add with a *.  You can test those as follows:

<foo bar<qux>
<foo bar <qux>

> WebCore/html/parser/HTMLTokenizer.cpp:1679
> +inline bool HTMLTokenizer::usePreHTML5ParserQuirks()
> +{
> +    return (m_settings && m_settings->usePreHTML5ParserQuirks());
> +}

Have you run the HTML parsing benchmark on this code?  You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html.  I suspect you'll get a measurable slowdown from this patch.  It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.
Comment 43 Darin Adler 2010-09-17 10:19:13 PDT
(In reply to comment #42)
> Have you run the HTML parsing benchmark on this code?  You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html.  I suspect you'll get a measurable slowdown from this patch.  It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.

Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.
Comment 44 Adam Barth 2010-09-17 10:21:31 PDT
> Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.

Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch.  However, instead of speculating, we can just run the benchmark.
Comment 45 Andy Estes 2010-09-17 12:35:15 PDT
(In reply to comment #41)
> (From update of attachment 67902 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> 
> > WebCore/html/parser/HTMLTokenizer.h:122
> > -    static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); }
> > +    static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }
> 
> It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion.
> 
> It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class.

This was what I did in my first patch, but the issue was that layout tests enable the quirk after the parser is created, hence the current design. I probably shouldn't have designed for the pathological case, and now that I understand better how WebPreferences work, I think I can write a test that sets the preference then does its work in an iframe to ensure the right ordering.

> 
> > WebCore/html/parser/HTMLTokenizer.h:268
> > +    inline bool usePreHTML5ParserQuirks();
> 
> Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see.
> 
> I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk.

There seemed to be a consensus that a single quirk was a better approach. Would there be a case where we'd want to enable the quirk for <tag<tag> but not for <tag <tag>? To me, the three changes to the tokenizer handle variants of a single issue so a single quirk settings makes sense.
Comment 46 Andy Estes 2010-09-17 12:39:53 PDT
(In reply to comment #42)
> (From update of attachment 67902 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> 
> > WebCore/html/parser/HTMLTokenizer.cpp:1085
> > +        else if (usePreHTML5ParserQuirks() && cc == '<')
> > +            return emitAndReconsumeIn(source, DataState);
> 
> It's slightly unclear to me why you chose these states for this quirk.  I would have expected it in these states:
> 
> TagNameState
> BeforeAttributeNameState
> AttributeNameState *
> AfterAttributeNameState *
> AfterAttributeValueQuotedState
> 
> I've marked the ones you didn't add with a *.  You can test those as follows:
> 
> <foo bar<qux>

I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values?

> <foo bar <qux>

You're right, I missed this one.

> 
> > WebCore/html/parser/HTMLTokenizer.cpp:1679
> > +inline bool HTMLTokenizer::usePreHTML5ParserQuirks()
> > +{
> > +    return (m_settings && m_settings->usePreHTML5ParserQuirks());
> > +}
> 
> Have you run the HTML parsing benchmark on this code?  You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html.  I suspect you'll get a measurable slowdown from this patch.  It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.

That's the design I'm going back to.
Comment 47 Adam Barth 2010-09-17 13:09:54 PDT
Comment on attachment 67902 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch

>>> WebCore/html/parser/HTMLTokenizer.cpp:1085
>>> +            return emitAndReconsumeIn(source, DataState);
>> 
>> It's slightly unclear to me why you chose these states for this quirk.  I would have expected it in these states:
>> 
>> TagNameState
>> BeforeAttributeNameState
>> AttributeNameState *
>> AfterAttributeNameState *
>> AfterAttributeValueQuotedState
>> 
>> I've marked the ones you didn't add with a *.  You can test those as follows:
>> 
>> <foo bar<qux>
>> <foo bar <qux>
> 
> I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values?

Yes.  Unquoted attribute values are the problematic ones from a security point of view.
Comment 48 Andy Estes 2010-09-17 13:29:57 PDT
(In reply to comment #47)
> (From update of attachment 67902 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> 
> >>> WebCore/html/parser/HTMLTokenizer.cpp:1085
> >>> +            return emitAndReconsumeIn(source, DataState);
> >> 
> >> It's slightly unclear to me why you chose these states for this quirk.  I would have expected it in these states:
> >> 
> >> TagNameState
> >> BeforeAttributeNameState
> >> AttributeNameState *
> >> AfterAttributeNameState *
> >> AfterAttributeValueQuotedState
> >> 
> >> I've marked the ones you didn't add with a *.  You can test those as follows:
> >> 
> >> <foo bar<qux>
> >> <foo bar <qux>
> > 
> > I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values?
> 
> Yes.  Unquoted attribute values are the problematic ones from a security point of view.

I guess there's no such thing as a quoted attribute name :) Nevermind...fixed both cases.
Comment 49 Andy Estes 2010-09-17 18:01:24 PDT
Created attachment 67984 [details]
Patch
Comment 50 Andy Estes 2010-09-17 18:03:57 PDT
(In reply to comment #44)
> > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.
> 
> Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch.  However, instead of speculating, we can just run the benchmark.

I'll run the benchmark against the latest patch and report what I find.
Comment 51 Andy Estes 2010-09-17 23:34:57 PDT
(In reply to comment #50)
> (In reply to comment #44)
> > > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.
> > 
> > Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch.  However, instead of speculating, we can just run the benchmark.
> 
> I'll run the benchmark against the latest patch and report what I find.

I ran the benchmark a total of 6 times in a 64-bit production build of Safari, once with my patch applied and once without. Here is what I found:

With patch:

avg: 2741.5
stdev: 20.5

avg: 2739.2
stdev: 20.4

avg: 2753.2
stdev: 45.8

Without patch:

avg: 2773.3
stdev: 20

avg: 2775.9
stdev: 21.4

avg: 2770.2
stdev: 20.5
Comment 52 Adam Barth 2010-09-18 13:46:20 PDT
Looks like you made it faster by a hair.  :)
Comment 53 Darin Adler 2010-09-18 16:21:23 PDT
Comment on attachment 67984 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch

> WebCore/html/parser/HTMLTokenizer.h:123
> -    static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); }
> +    static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }

This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all.

It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded.

I suppose we can live with this, but I don’t like it.
Comment 54 Adam Barth 2010-09-18 16:28:27 PDT
Comment on attachment 67984 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch

>> WebCore/html/parser/HTMLTokenizer.h:123
>> +    static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }
> 
> This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all.
> 
> It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded.
> 
> I suppose we can live with this, but I don’t like it.

I'll post a patch to remove Frame from updateStateFor.
Comment 55 Adam Barth 2010-09-18 19:07:52 PDT
> I'll post a patch to remove Frame from updateStateFor.

I looked at this, but it seems more error prone / verbose.  :(
Comment 56 WebKit Review Bot 2010-09-18 23:11:34 PDT
Attachment 67984 [details] did not build on win:
Build output: http://queues.webkit.org/results/4009066
Comment 57 Darin Adler 2010-09-19 18:58:24 PDT
(In reply to comment #55)
> > I'll post a patch to remove Frame from updateStateFor.
> 
> I looked at this, but it seems more error prone / verbose.  :(

Yes, I noticed the same thing. Frame doesn’t express the set of tags allowed in a way that can easily be passed in, so this convenience function is only convenient with Frame*. I still think we might some day figure out a way to fix that, though. Or maybe may desire for “layering” here is misguided.
Comment 58 Andy Estes 2010-09-19 23:58:25 PDT
Created attachment 68055 [details]
Patch
Comment 59 Andy Estes 2010-09-20 00:08:30 PDT
(In reply to comment #53)
> (From update of attachment 67984 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch
> 
> > WebCore/html/parser/HTMLTokenizer.h:123
> > -    static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); }
> > +    static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }
> 
> This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all.
> 
> It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded.
> 
> I suppose we can live with this, but I don’t like it.

The latest patch moves the Settings* dependency up to HTMLDocumentParser.
Comment 60 WebKit Review Bot 2010-09-20 00:33:31 PDT
Attachment 68055 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/4092005
Comment 61 Eric Seidel (no email) 2010-09-20 00:34:19 PDT
Comment on attachment 68055 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=68055&action=review

You'll want Abarth to review the tokenizer changes.  He's more familiar with that part of the parser than I am.

> WebCore/html/parser/HTMLDocumentParser.cpp:83
> +    , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()))

Seems this should be a helper function which takes a Document*.  We used this pattern when the HTML5 parser was optional too.

> WebCore/html/parser/HTMLDocumentParser.cpp:96
> +    , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings() && fragment->document()->settings()->usePreHTML5ParserQuirks()))

Would be re-used here.

> WebCore/html/parser/HTMLPreloadScanner.cpp:128
> +    m_tokenizer = HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks());

And here.  This seems wrong.  We don't need to move the initializer out of the normal c++ pattern.  The ASSERT only helps in debug mode.  ASSERTing after we've hosed ourselves is just as good as before.

> WebCore/html/parser/HTMLViewSourceParser.cpp:37
> +    , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()))

You want your inline again here. :)  Although it's possible you won't make another copy of the inline in this file since it's only used once.  When we had the HTML5 parser optional, we had a:

static inline useLegacyHTMLParser(Document*) function which did similar to this check and handled the null frame case.  WE had two copies.  one in the treebuilder and one in the DocumentParser.  Seems you need to do the check more often than that, so maybe it needs to be a class method/static on DocumentParser or HTMLDocumentParser?

> WebKit/gtk/webkit/webkitwebsettings.cpp:111
> +    gboolean enable_pre_html5_parser_quirks;

I'm not sure any other port besides Mac wants this.  I think this could be a mac-only quirk with mac-only layout tests.  We really want this quirk to die ASAP (yes, I know that could be years).
Comment 62 WebKit Review Bot 2010-09-20 01:20:00 PDT
Attachment 68055 [details] did not build on win:
Build output: http://queues.webkit.org/results/4082033
Comment 63 Andy Estes 2010-09-20 12:07:28 PDT
(In reply to comment #61)
> (From update of attachment 68055 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=68055&action=review
> 
> You'll want Abarth to review the tokenizer changes.  He's more familiar with that part of the parser than I am.
> 
> > WebCore/html/parser/HTMLDocumentParser.cpp:83
> > +    , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()))
> 
> Seems this should be a helper function which takes a Document*.  We used this pattern when the HTML5 parser was optional too.

Done.

> 
> > WebKit/gtk/webkit/webkitwebsettings.cpp:111
> > +    gboolean enable_pre_html5_parser_quirks;
> 
> I'm not sure any other port besides Mac wants this.  I think this could be a mac-only quirk with mac-only layout tests.  We really want this quirk to die ASAP (yes, I know that could be years).

I guess skipping it on non-Mac ports makes sense since only the Mac port will enable it, although at the time I wrote it I didn't see how having all platforms run the test could hurt. Now that various iterations of this patch are breaking on the EWS bots, I see it as a maintenance burden.
Comment 64 Andy Estes 2010-09-20 14:18:27 PDT
Created attachment 68134 [details]
Patch
Comment 65 Adam Barth 2010-09-20 14:26:16 PDT
Comment on attachment 68134 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=68134&action=review

> WebCore/html/parser/HTMLDocumentParser.h:145
> +bool usePreHTML5ParserQuirks(Document*);

Generally, we prefer to have these sorts of functions be static methods of a class:

HTMLDocumentParser::usePreHTML5Quirks(...)

That makes it easier to find the implementation of the function when you see it called in a random file.

> LayoutTests/fast/parser/resources/pre-html5-parser-quirk-after-attribute-value-quoted-state.html:4
> +        <p></p attr="value"</div>

Can you add a test for <p attr=value</  ?
Comment 66 Andy Estes 2010-09-20 16:01:32 PDT
Committed r67890: <http://trac.webkit.org/changeset/67890>
Comment 67 Andy Estes 2010-09-21 00:34:20 PDT
Created attachment 68198 [details]
Patch
Comment 68 Andy Estes 2010-09-21 00:35:04 PDT
Reopening to add additional test coverage.
Comment 69 Adam Barth 2010-09-21 00:50:09 PDT
Comment on attachment 68198 [details]
Patch

Thanks.
Comment 70 WebKit Commit Bot 2010-09-21 03:21:32 PDT
Comment on attachment 68198 [details]
Patch

Clearing flags on attachment: 68198

Committed r67936: <http://trac.webkit.org/changeset/67936>
Comment 71 WebKit Commit Bot 2010-09-21 03:21:44 PDT
All reviewed patches have been landed.  Closing bug.