40961 – REGRESSION (HTML5 Parser): Pages broken due to <tag<tag> parsing changes

William Siegrist

Reported 2010-06-21 17:30:11 PDT

Created attachment 59318 [details] screenshot showing bad layout The MacRuby website does not layout correctly on r61502. The main body of the website is shifted to the right and lines up with the red boxes on the right instead of the left edge of the black navigation bar. See attached screenshot. The website looks correct on stock 10.6.4 WebKit.

Alexey Proskuryakov

Comment 1 2010-06-22 11:10:43 PDT

Regressed in r61285.

Alexey Proskuryakov

Comment 2 2010-06-22 11:25:31 PDT

Created attachment 59397 [details] reduced test case This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE.

Adam Barth

Comment 3 2010-06-22 11:48:38 PDT

Thanks. I'll look into this today.

Adam Barth

Comment 4 2010-06-22 13:44:29 PDT

We seem to be doing what the spec says to do here (and we match Minefield). I've sent an email to the HTML working group. This issue is on our "top five" list of behavioral differences like to cause compatibility problems: https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en Let's give the working group a couple days to respond before taking action.

Adam Barth

Comment 5 2010-06-22 13:48:19 PDT

> This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE. The test also fails in Opera 10.54.

William Siegrist

Comment 6 2010-06-22 13:58:14 PDT

I filed a MacRuby bug with the fix for them: https://www.macruby.org/trac/ticket/759

Adam Barth

Comment 7 2010-06-22 14:53:52 PDT

See http://www.w3.org/Bugs/Public/show_bug.cgi?id=9985

Adam Barth

Comment 8 2010-07-09 11:16:35 PDT

*** Bug 41069 has been marked as a duplicate of this bug. ***

mitz

Comment 9 2010-07-19 11:34:20 PDT

Applications linked against the shipping version of WebKit rely on the old parser behavior. There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.

Adam Barth

Comment 10 2010-07-19 11:38:01 PDT

> Applications linked against the shipping version of WebKit rely on the old parser behavior. Perhaps you'd be interested in <http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96>? > There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac. Is there a sunset for how long the Mac port is required to support the old parser? It's quite complex, and it's going to bitrot.

mitz

Comment 11 2010-07-19 11:44:38 PDT

(In reply to comment #10) > > Applications linked against the shipping version of WebKit rely on the old parser behavior. > > Perhaps you'd be interested in <http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96>? I am aware of it, but keeping the legacy parser around is not desirable for obvious reasons. > > There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac. > > Is there a sunset for how long the Mac port is required to support the old parser? It's quite complex, and it's going to bitrot. The Mac port is not required to support the old parser, and in fact I just said that was un undesirble way to address the issue. Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.

Adam Barth

Comment 12 2010-07-19 11:52:34 PDT

> Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk. Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content: https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one? Same question: do we have a plan for sunsetting these quirks? For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past. Also, we'll need to test the performance impact of adding these branches to the tokenizer. In general, you can see many of the branches in the benchmark score, so we might need to do something clever.

Alexey Proskuryakov

Comment 13 2010-07-19 11:59:24 PDT

> Same question: do we have a plan for sunsetting these quirks? > For example, we could make them available only to applications > linked before a certain date and then remove the setting them when > that date is sufficiently far in the past. I don't think we have an equivalent mechanism for Dashboard widgets, so the quirk will likely have to be permanent.

mitz

Comment 14 2010-07-19 12:06:41 PDT

(In reply to comment #12) > > Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk. > > Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content: > > https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en > > Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one? I don’t have a strong opinion on this. It might depend on the context in which we’d want to enable the quirks. For now, I can’t think of a reason to separate them. > Same question: do we have a plan for sunsetting these quirks? For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past. For application compatibility, the quirks should only be enabled for select application bundle IDs (currently the set contains only AIM) and only if they were not linked against WebKit-534.1 or later. There is currently no plan to ever break AIM. For website compatibilty, if site-specific hacks are deployed, they can be revisited at any time. If specific websites are targeted, and those websites are updated to not require the quirk anymore, then it can be removed. If the quirk targets something different (for example, a specific version of some JavaScript library), then it’s harder to tell when to remove it. > Also, we'll need to test the performance impact of adding these branches to the tokenizer. In general, you can see many of the branches in the benchmark score, so we might need to do something clever. Of course. In the worst case, and if we opt for a single “quirks” switch, then templatizing a sufficiently large portion of the tokenizer should eliminate branches on any hot paths.

Adam Barth

Comment 15 2010-07-19 13:02:05 PDT

Sounds like a plan. I might need some help with the patch because I'm not overly familiar with the quirks infrastructure.

Mark Rowe (bdash)

Comment 16 2010-08-25 17:28:27 PDT

We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content. They’ve primarily been web sites, but also email messages generated by reporting systems.

Adam Barth

Comment 17 2010-08-26 23:17:34 PDT

(In reply to comment #16) > We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content. They’ve primarily been web sites, but also email messages generated by reporting systems. That's not overly surprising given that internal Apple content is more likely to be consumed with WebKit than with IE. Presumably these pieces of content are also broken in IE and Firefox 4.

Andy Estes

Comment 18 2010-09-10 01:56:06 PDT

<rdar://problem/8154379>

Adam Barth

Comment 19 2010-09-13 01:43:39 PDT

I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works. One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility. The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state. One subtly is that you'll probably need to emit the current token before re-consuming the '<' character.

Andy Estes

Comment 20 2010-09-13 20:38:19 PDT

(In reply to comment #19) > I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works. > > One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility. The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state. One subtly is that you'll probably need to emit the current token before re-consuming the '<' character. Adam, thanks for the pointer in the right direction. In addition to TagNameState, there were a few other states that needed this quirk to handle cases such as <tag attr<tag> and <tag <tag>, but it is essentially the same change in each state (for <tag attr<tag>, endAttributeName() needs to be called before reconsuming the < in DataState). I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test. I'll upload the WebCore part now if you want to take a look at it.

Andy Estes

Comment 21 2010-09-13 20:49:04 PDT

(In reply to comment #20) > I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test. On second thought, enabling this quirk is probably outside the scope of this bug. I'll just focus on adding the quirk capability and open other bugs for enabling it in specific cases.

Eric Seidel (no email)

Comment 22 2010-09-13 20:51:51 PDT

Quirks like this should be easy to test using some iframes and layoutTestController.overridePreference

Andy Estes

Comment 23 2010-09-13 21:02:03 PDT

Created attachment 67516 [details] Patch

Adam Barth

Comment 24 2010-09-13 21:12:16 PDT

Comment on attachment 67516 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch Overall, looks spot on. Thanks for working on this patch. > WebCore/html/parser/HTMLDocumentParser.cpp:95 > - , m_tokenizer(HTMLTokenizer::create()) > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings())) Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code. > WebCore/html/parser/HTMLTokenizer.cpp:106 > -HTMLTokenizer::HTMLTokenizer() > +HTMLTokenizer::HTMLTokenizer(Settings* settings) Do you think passing in the settings object is better than passing in the Document? It's slightly unclear to me which is better. > WebCore/html/parser/HTMLTokenizer.cpp:108 > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks()) useLegacyTokenizerQuirks? I guess it depends on what granularity we want to be able to control the quirks... Are you sure that frameless documents don't need this quirk? > WebCore/html/parser/HTMLTokenizer.cpp:440 > + else if (cc == '<' && m_useLegacyParserQuirks) I bet you should put m_useLegacyParserQuirks first. That branch will be very well predicted. > WebCore/html/parser/HTMLTokenizer.cpp:919 > + } else if (cc == '<' && m_useLegacyParserQuirks) { > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > + return emitAndReconsumeIn(source, DataState); It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.

Eric Seidel (no email)

Comment 25 2010-09-13 21:20:27 PDT

Comment on attachment 67516 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch > WebCore/html/parser/HTMLTokenizer.cpp:108 > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks()) I disagree. One flag is scary enough. Having N different parser modes is bad news bears. Andy is right to just have one flag IMO.

Andy Estes

Comment 26 2010-09-13 21:31:30 PDT

(In reply to comment #24) > (From update of attachment 67516 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch > > Overall, looks spot on. Thanks for working on this patch. > > > WebCore/html/parser/HTMLDocumentParser.cpp:95 > > - , m_tokenizer(HTMLTokenizer::create()) > > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings())) > Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code. Will do. > > > WebCore/html/parser/HTMLTokenizer.cpp:106 > > -HTMLTokenizer::HTMLTokenizer() > > +HTMLTokenizer::HTMLTokenizer(Settings* settings) > Do you think passing in the settings object is better than passing in the Document? It's slightly unclear to me which is better. My thought was that Settings was more specific to what HTMLTokenizer needs than Document. > > > WebCore/html/parser/HTMLTokenizer.cpp:108 > > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks()) > useLegacyTokenizerQuirks? I guess it depends on what granularity we want to be able to control the quirks... I'd like to use this same quirk to handle the <script /> case, and I believe that will need to be patched in HTMLTreeBuilder. So, using the work tokenizer might make it too specific. > > Are you sure that frameless documents don't need this quirk? Ahh, so since document()->settings() is really document()->m_frame->settings(), frameless documents would return a NULL Settings. I didn't consider this, so I'll think about the right thing to do. I don't know a lot about frameless documents, but I don't see why they wouldn't want the quirk. > > > WebCore/html/parser/HTMLTokenizer.cpp:440 > > + else if (cc == '<' && m_useLegacyParserQuirks) > I bet you should put m_useLegacyParserQuirks first. That branch will be very well predicted. Will do. > > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > + return emitAndReconsumeIn(source, DataState); > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. This can probably be skipped for now, but I wouldn't be surprised if we find this case in the wild. The markup used by AIM comes close to hitting this case, but has whitespace after the end of the attribute (the attribute value is also quoted). Thanks for the feedback!

Andy Estes

Comment 27 2010-09-13 21:38:58 PDT

(In reply to comment #24) > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > + return emitAndReconsumeIn(source, DataState); > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. If this turns out to be necessary, maybe it could be restricted to quoted attributes.

Adam Barth

Comment 28 2010-09-13 21:42:13 PDT

(In reply to comment #27) > (In reply to comment #24) > > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > > + return emitAndReconsumeIn(source, DataState); > > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. > > If this turns out to be necessary, maybe it could be restricted to quoted attributes. I'm confused. Do you mean this case: <div foo="hello<script>alert(1)</script>there"> or this case: <div foo="hello"<script>alert(1)</script> ?

Alexey Proskuryakov

Comment 29 2010-09-13 22:11:25 PDT

> m_useLegacyParserQuirks If this is going to be a single switch, it still needs a more specific name (maybe PreHTML5?). We won't want to add HTML5 behaviors here when HTML5v2 comes out, but this name will still be suggesting that we should.

Andy Estes

Comment 30 2010-09-14 21:55:16 PDT

(In reply to comment #28) > (In reply to comment #27) > > (In reply to comment #24) > > > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > > > + return emitAndReconsumeIn(source, DataState); > > > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. > > > > If this turns out to be necessary, maybe it could be restricted to quoted attributes. > > I'm confused. Do you mean this case: > > <div foo="hello<script>alert(1)</script>there"> > > or this case: > > <div foo="hello"<script>alert(1)</script> > > ? The second case. I don't believe the legacy parser emitted a script token in the first case.

Adam Barth

Comment 31 2010-09-14 21:58:55 PDT

> The second case. I don't believe the legacy parser emitted a script token in the first case. Ok. :)

Andy Estes

Comment 32 2010-09-17 03:10:33 PDT

Created attachment 67897 [details] Patch

Andy Estes

Comment 33 2010-09-17 03:14:34 PDT

(In reply to comment #32) > Created an attachment (id=67897) [details] > Patch A few notes: - I couldn't decide how to handle the case of frameless document so I didn't. I'm not aware of a case where this will cause a regression, but now that I think of it I should probably add a FIXME. - I made my best effort to add the new WebPreference to all the ports so they can run the layout tests, but I can't be sure they're all going to compile. I'll try to land this during a quiet period so I can clean up any breakage without causing too much trouble.

WebKit Review Bot

Comment 34 2010-09-17 03:47:08 PDT

Attachment 67897 [details] did not build on gtk: Build output: http://queues.webkit.org/results/4012057

Andy Estes

Comment 35 2010-09-17 04:00:55 PDT

Created attachment 67899 [details] Patch

WebKit Review Bot

Comment 36 2010-09-17 04:29:48 PDT

Attachment 67899 [details] did not build on gtk: Build output: http://queues.webkit.org/results/4043039

Andy Estes

Comment 37 2010-09-17 04:33:53 PDT

Created attachment 67900 [details] Patch

WebKit Review Bot

Comment 38 2010-09-17 04:55:43 PDT

Attachment 67897 [details] did not build on win: Build output: http://queues.webkit.org/results/4029046

Andy Estes

Comment 39 2010-09-17 05:34:51 PDT

Created attachment 67902 [details] Patch

WebKit Review Bot

Comment 40 2010-09-17 06:46:08 PDT

Attachment 67900 [details] did not build on win: Build output: http://queues.webkit.org/results/4063037

Darin Adler

Comment 41 2010-09-17 10:07:48 PDT

Comment on attachment 67902 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch > WebCore/html/parser/HTMLTokenizer.h:122 > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion. It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class. > WebCore/html/parser/HTMLTokenizer.h:268 > + inline bool usePreHTML5ParserQuirks(); Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see. I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk.

Adam Barth

Comment 42 2010-09-17 10:15:57 PDT

Comment on attachment 67902 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch > WebCore/html/parser/HTMLTokenizer.cpp:1085 > + else if (usePreHTML5ParserQuirks() && cc == '<') > + return emitAndReconsumeIn(source, DataState); It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: TagNameState BeforeAttributeNameState AttributeNameState * AfterAttributeNameState * AfterAttributeValueQuotedState I've marked the ones you didn't add with a *. You can test those as follows: <foo bar<qux> <foo bar <qux> > WebCore/html/parser/HTMLTokenizer.cpp:1679 > +inline bool HTMLTokenizer::usePreHTML5ParserQuirks() > +{ > + return (m_settings && m_settings->usePreHTML5ParserQuirks()); > +} Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.

Darin Adler

Comment 43 2010-09-17 10:19:13 PDT

(In reply to comment #42) > Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time. Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.

Adam Barth

Comment 44 2010-09-17 10:21:31 PDT

> Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design. Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark.

Andy Estes

Comment 45 2010-09-17 12:35:15 PDT

(In reply to comment #41) > (From update of attachment 67902 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch > > > WebCore/html/parser/HTMLTokenizer.h:122 > > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion. > > It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class. This was what I did in my first patch, but the issue was that layout tests enable the quirk after the parser is created, hence the current design. I probably shouldn't have designed for the pathological case, and now that I understand better how WebPreferences work, I think I can write a test that sets the preference then does its work in an iframe to ensure the right ordering. > > > WebCore/html/parser/HTMLTokenizer.h:268 > > + inline bool usePreHTML5ParserQuirks(); > > Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see. > > I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk. There seemed to be a consensus that a single quirk was a better approach. Would there be a case where we'd want to enable the quirk for <tag<tag> but not for <tag <tag>? To me, the three changes to the tokenizer handle variants of a single issue so a single quirk settings makes sense.

Andy Estes

Comment 46 2010-09-17 12:39:53 PDT

(In reply to comment #42) > (From update of attachment 67902 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch > > > WebCore/html/parser/HTMLTokenizer.cpp:1085 > > + else if (usePreHTML5ParserQuirks() && cc == '<') > > + return emitAndReconsumeIn(source, DataState); > > It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: > > TagNameState > BeforeAttributeNameState > AttributeNameState * > AfterAttributeNameState * > AfterAttributeValueQuotedState > > I've marked the ones you didn't add with a *. You can test those as follows: > > <foo bar<qux> I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values? > <foo bar <qux> You're right, I missed this one. > > > WebCore/html/parser/HTMLTokenizer.cpp:1679 > > +inline bool HTMLTokenizer::usePreHTML5ParserQuirks() > > +{ > > + return (m_settings && m_settings->usePreHTML5ParserQuirks()); > > +} > > Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time. That's the design I'm going back to.

Adam Barth

Comment 47 2010-09-17 13:09:54 PDT

Comment on attachment 67902 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch >>> WebCore/html/parser/HTMLTokenizer.cpp:1085 >>> + return emitAndReconsumeIn(source, DataState); >> >> It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: >> >> TagNameState >> BeforeAttributeNameState >> AttributeNameState * >> AfterAttributeNameState * >> AfterAttributeValueQuotedState >> >> I've marked the ones you didn't add with a *. You can test those as follows: >> >> <foo bar<qux> >> <foo bar <qux> > > I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values? Yes. Unquoted attribute values are the problematic ones from a security point of view.

Andy Estes

Comment 48 2010-09-17 13:29:57 PDT

(In reply to comment #47) > (From update of attachment 67902 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch > > >>> WebCore/html/parser/HTMLTokenizer.cpp:1085 > >>> + return emitAndReconsumeIn(source, DataState); > >> > >> It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: > >> > >> TagNameState > >> BeforeAttributeNameState > >> AttributeNameState * > >> AfterAttributeNameState * > >> AfterAttributeValueQuotedState > >> > >> I've marked the ones you didn't add with a *. You can test those as follows: > >> > >> <foo bar<qux> > >> <foo bar <qux> > > > > I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values? > > Yes. Unquoted attribute values are the problematic ones from a security point of view. I guess there's no such thing as a quoted attribute name :) Nevermind...fixed both cases.

Andy Estes

Comment 49 2010-09-17 18:01:24 PDT

Created attachment 67984 [details] Patch

Andy Estes

Comment 50 2010-09-17 18:03:57 PDT

(In reply to comment #44) > > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design. > > Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark. I'll run the benchmark against the latest patch and report what I find.

Andy Estes

Comment 51 2010-09-17 23:34:57 PDT

(In reply to comment #50) > (In reply to comment #44) > > > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design. > > > > Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark. > > I'll run the benchmark against the latest patch and report what I find. I ran the benchmark a total of 6 times in a 64-bit production build of Safari, once with my patch applied and once without. Here is what I found: With patch: avg: 2741.5 stdev: 20.5 avg: 2739.2 stdev: 20.4 avg: 2753.2 stdev: 45.8 Without patch: avg: 2773.3 stdev: 20 avg: 2775.9 stdev: 21.4 avg: 2770.2 stdev: 20.5

Adam Barth

Comment 52 2010-09-18 13:46:20 PDT

Looks like you made it faster by a hair. :)

Darin Adler

Comment 53 2010-09-18 16:21:23 PDT

Comment on attachment 67984 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch > WebCore/html/parser/HTMLTokenizer.h:123 > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. I suppose we can live with this, but I don’t like it.

Adam Barth

Comment 54 2010-09-18 16:28:27 PDT

Comment on attachment 67984 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch >> WebCore/html/parser/HTMLTokenizer.h:123 >> + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. > > It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. > > I suppose we can live with this, but I don’t like it. I'll post a patch to remove Frame from updateStateFor.

Adam Barth

Comment 55 2010-09-18 19:07:52 PDT

> I'll post a patch to remove Frame from updateStateFor. I looked at this, but it seems more error prone / verbose. :(

WebKit Review Bot

Comment 56 2010-09-18 23:11:34 PDT

Attachment 67984 [details] did not build on win: Build output: http://queues.webkit.org/results/4009066

Darin Adler

Comment 57 2010-09-19 18:58:24 PDT

(In reply to comment #55) > > I'll post a patch to remove Frame from updateStateFor. > > I looked at this, but it seems more error prone / verbose. :( Yes, I noticed the same thing. Frame doesn’t express the set of tags allowed in a way that can easily be passed in, so this convenience function is only convenient with Frame*. I still think we might some day figure out a way to fix that, though. Or maybe may desire for “layering” here is misguided.

Andy Estes

Comment 58 2010-09-19 23:58:25 PDT

Created attachment 68055 [details] Patch

Andy Estes

Comment 59 2010-09-20 00:08:30 PDT

(In reply to comment #53) > (From update of attachment 67984 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch > > > WebCore/html/parser/HTMLTokenizer.h:123 > > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. > > It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. > > I suppose we can live with this, but I don’t like it. The latest patch moves the Settings* dependency up to HTMLDocumentParser.

WebKit Review Bot

Comment 60 2010-09-20 00:33:31 PDT

Attachment 68055 [details] did not build on chromium: Build output: http://queues.webkit.org/results/4092005

Eric Seidel (no email)

Comment 61 2010-09-20 00:34:19 PDT

Comment on attachment 68055 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=68055&action=review You'll want Abarth to review the tokenizer changes. He's more familiar with that part of the parser than I am. > WebCore/html/parser/HTMLDocumentParser.cpp:83 > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks())) Seems this should be a helper function which takes a Document*. We used this pattern when the HTML5 parser was optional too. > WebCore/html/parser/HTMLDocumentParser.cpp:96 > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings() && fragment->document()->settings()->usePreHTML5ParserQuirks())) Would be re-used here. > WebCore/html/parser/HTMLPreloadScanner.cpp:128 > + m_tokenizer = HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()); And here. This seems wrong. We don't need to move the initializer out of the normal c++ pattern. The ASSERT only helps in debug mode. ASSERTing after we've hosed ourselves is just as good as before. > WebCore/html/parser/HTMLViewSourceParser.cpp:37 > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks())) You want your inline again here. :) Although it's possible you won't make another copy of the inline in this file since it's only used once. When we had the HTML5 parser optional, we had a: static inline useLegacyHTMLParser(Document*) function which did similar to this check and handled the null frame case. WE had two copies. one in the treebuilder and one in the DocumentParser. Seems you need to do the check more often than that, so maybe it needs to be a class method/static on DocumentParser or HTMLDocumentParser? > WebKit/gtk/webkit/webkitwebsettings.cpp:111 > + gboolean enable_pre_html5_parser_quirks; I'm not sure any other port besides Mac wants this. I think this could be a mac-only quirk with mac-only layout tests. We really want this quirk to die ASAP (yes, I know that could be years).

WebKit Review Bot

Comment 62 2010-09-20 01:20:00 PDT

Attachment 68055 [details] did not build on win: Build output: http://queues.webkit.org/results/4082033

Andy Estes

Comment 63 2010-09-20 12:07:28 PDT

(In reply to comment #61) > (From update of attachment 68055 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=68055&action=review > > You'll want Abarth to review the tokenizer changes. He's more familiar with that part of the parser than I am. > > > WebCore/html/parser/HTMLDocumentParser.cpp:83 > > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks())) > > Seems this should be a helper function which takes a Document*. We used this pattern when the HTML5 parser was optional too. Done. > > > WebKit/gtk/webkit/webkitwebsettings.cpp:111 > > + gboolean enable_pre_html5_parser_quirks; > > I'm not sure any other port besides Mac wants this. I think this could be a mac-only quirk with mac-only layout tests. We really want this quirk to die ASAP (yes, I know that could be years). I guess skipping it on non-Mac ports makes sense since only the Mac port will enable it, although at the time I wrote it I didn't see how having all platforms run the test could hurt. Now that various iterations of this patch are breaking on the EWS bots, I see it as a maintenance burden.

Andy Estes

Comment 64 2010-09-20 14:18:27 PDT

Created attachment 68134 [details] Patch

Adam Barth

Comment 65 2010-09-20 14:26:16 PDT

Comment on attachment 68134 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=68134&action=review > WebCore/html/parser/HTMLDocumentParser.h:145 > +bool usePreHTML5ParserQuirks(Document*); Generally, we prefer to have these sorts of functions be static methods of a class: HTMLDocumentParser::usePreHTML5Quirks(...) That makes it easier to find the implementation of the function when you see it called in a random file. > LayoutTests/fast/parser/resources/pre-html5-parser-quirk-after-attribute-value-quoted-state.html:4 > + <p></p attr="value"</div> Can you add a test for <p attr=value</ ?

Andy Estes

Comment 66 2010-09-20 16:01:32 PDT

Committed r67890: <http://trac.webkit.org/changeset/67890>

Andy Estes

Comment 67 2010-09-21 00:34:20 PDT

Created attachment 68198 [details] Patch

Andy Estes

Comment 68 2010-09-21 00:35:04 PDT

Reopening to add additional test coverage.

Adam Barth

Comment 69 2010-09-21 00:50:09 PDT

Comment on attachment 68198 [details] Patch Thanks.

WebKit Commit Bot

Comment 70 2010-09-21 03:21:32 PDT

Comment on attachment 68198 [details] Patch Clearing flags on attachment: 68198 Committed r67936: <http://trac.webkit.org/changeset/67936>

WebKit Commit Bot

Comment 71 2010-09-21 03:21:44 PDT

All reviewed patches have been landed. Closing bug.

Attachments
screenshot showing bad layout (141.23 KB, image/png) 2010-06-21 17:30 PDT, William Siegrist	no flags	Details
reduced test case (90 bytes, text/html) 2010-06-22 11:25 PDT, Alexey Proskuryakov	no flags	Details
Patch (9.03 KB, patch) 2010-09-13 21:02 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (38.94 KB, patch) 2010-09-17 03:10 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (40.60 KB, patch) 2010-09-17 04:00 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (40.09 KB, patch) 2010-09-17 04:33 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (38.77 KB, patch) 2010-09-17 05:34 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (38.26 KB, patch) 2010-09-17 18:01 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (38.68 KB, patch) 2010-09-19 23:58 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (26.33 KB, patch) 2010-09-20 14:18 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Patch (7.48 KB, patch) 2010-09-21 00:34 PDT, Andy Estes	no flags	Details Formatted Diff Diff
Show Obsolete (8) View All Add attachment proposed patch, testcase, etc.