WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
40961
REGRESSION (HTML5 Parser): Pages broken due to <tag<tag> parsing changes
https://bugs.webkit.org/show_bug.cgi?id=40961
Summary
REGRESSION (HTML5 Parser): Pages broken due to <tag<tag> parsing changes
William Siegrist
Reported
2010-06-21 17:30:11 PDT
Created
attachment 59318
[details]
screenshot showing bad layout The MacRuby website does not layout correctly on
r61502
. The main body of the website is shifted to the right and lines up with the red boxes on the right instead of the left edge of the black navigation bar. See attached screenshot. The website looks correct on stock 10.6.4 WebKit.
Attachments
screenshot showing bad layout
(141.23 KB, image/png)
2010-06-21 17:30 PDT
,
William Siegrist
no flags
Details
reduced test case
(90 bytes, text/html)
2010-06-22 11:25 PDT
,
Alexey Proskuryakov
no flags
Details
Patch
(9.03 KB, patch)
2010-09-13 21:02 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(38.94 KB, patch)
2010-09-17 03:10 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(40.60 KB, patch)
2010-09-17 04:00 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(40.09 KB, patch)
2010-09-17 04:33 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(38.77 KB, patch)
2010-09-17 05:34 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(38.26 KB, patch)
2010-09-17 18:01 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(38.68 KB, patch)
2010-09-19 23:58 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(26.33 KB, patch)
2010-09-20 14:18 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Patch
(7.48 KB, patch)
2010-09-21 00:34 PDT
,
Andy Estes
no flags
Details
Formatted Diff
Diff
Show Obsolete
(8)
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2010-06-22 11:10:43 PDT
Regressed in
r61285
.
Alexey Proskuryakov
Comment 2
2010-06-22 11:25:31 PDT
Created
attachment 59397
[details]
reduced test case This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE.
Adam Barth
Comment 3
2010-06-22 11:48:38 PDT
Thanks. I'll look into this today.
Adam Barth
Comment 4
2010-06-22 13:44:29 PDT
We seem to be doing what the spec says to do here (and we match Minefield). I've sent an email to the HTML working group. This issue is on our "top five" list of behavioral differences like to cause compatibility problems:
https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en
Let's give the working group a couple days to respond before taking action.
Adam Barth
Comment 5
2010-06-22 13:48:19 PDT
> This test passes in Safari 5 and in Firefox 3.6.3, but not in IE8. The site is also happily broken in IE.
The test also fails in Opera 10.54.
William Siegrist
Comment 6
2010-06-22 13:58:14 PDT
I filed a MacRuby bug with the fix for them:
https://www.macruby.org/trac/ticket/759
Adam Barth
Comment 7
2010-06-22 14:53:52 PDT
See
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9985
Adam Barth
Comment 8
2010-07-09 11:16:35 PDT
***
Bug 41069
has been marked as a duplicate of this bug. ***
mitz
Comment 9
2010-07-19 11:34:20 PDT
Applications linked against the shipping version of WebKit rely on the old parser behavior. There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.
Adam Barth
Comment 10
2010-07-19 11:38:01 PDT
> Applications linked against the shipping version of WebKit rely on the old parser behavior.
Perhaps you'd be interested in <
http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96
>?
> There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac.
Is there a sunset for how long the Mac port is required to support the old parser? It's quite complex, and it's going to bitrot.
mitz
Comment 11
2010-07-19 11:44:38 PDT
(In reply to
comment #10
)
> > Applications linked against the shipping version of WebKit rely on the old parser behavior. > > Perhaps you'd be interested in <
http://trac.webkit.org/browser/trunk/WebKit/mac/WebView/WebPreferenceKeysPrivate.h#L96
>?
I am aware of it, but keeping the legacy parser around is not desirable for obvious reasons.
> > There needs to be a way to give that behavior to applications (and perhaps websites) that rely on it. One such application is AIM for Mac. > > Is there a sunset for how long the Mac port is required to support the old parser? It's quite complex, and it's going to bitrot.
The Mac port is not required to support the old parser, and in fact I just said that was un undesirble way to address the issue. Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.
Adam Barth
Comment 12
2010-07-19 11:52:34 PDT
> Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk.
Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content:
https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en
Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one? Same question: do we have a plan for sunsetting these quirks? For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past. Also, we'll need to test the performance impact of adding these branches to the tokenizer. In general, you can see many of the branches in the benchmark score, so we might need to do something clever.
Alexey Proskuryakov
Comment 13
2010-07-19 11:59:24 PDT
> Same question: do we have a plan for sunsetting these quirks? > For example, we could make them available only to applications > linked before a certain date and then remove the setting them when > that date is sufficiently far in the past.
I don't think we have an equivalent mechanism for Dashboard widgets, so the quirk will likely have to be permanent.
mitz
Comment 14
2010-07-19 12:06:41 PDT
(In reply to
comment #12
)
> > Perhaps what caused the confusion was that I used the phrase “the old parser behavior”. I was referring specifically to the one aspect of the old parser that this bug is about. I think a good solution would be to add this quirk to the parser and a setting for enabling the quirk. > > Here's a document describing the parser changes that we're worried might case compatibility problems for WebKit-only content: > >
https://docs.google.com/document/edit?id=1as5xYjyMSCph4960iz0-Kb7hZKf_L6f2vts57NMcVBI&hl=en
> > Do you think it would be better to have one setting that enabled all the quirks we need, or should we have a separate setting for each one?
I don’t have a strong opinion on this. It might depend on the context in which we’d want to enable the quirks. For now, I can’t think of a reason to separate them.
> Same question: do we have a plan for sunsetting these quirks? For example, we could make them available only to applications linked before a certain date and then remove the setting them when that date is sufficiently far in the past.
For application compatibility, the quirks should only be enabled for select application bundle IDs (currently the set contains only AIM) and only if they were not linked against WebKit-534.1 or later. There is currently no plan to ever break AIM. For website compatibilty, if site-specific hacks are deployed, they can be revisited at any time. If specific websites are targeted, and those websites are updated to not require the quirk anymore, then it can be removed. If the quirk targets something different (for example, a specific version of some JavaScript library), then it’s harder to tell when to remove it.
> Also, we'll need to test the performance impact of adding these branches to the tokenizer. In general, you can see many of the branches in the benchmark score, so we might need to do something clever.
Of course. In the worst case, and if we opt for a single “quirks” switch, then templatizing a sufficiently large portion of the tokenizer should eliminate branches on any hot paths.
Adam Barth
Comment 15
2010-07-19 13:02:05 PDT
Sounds like a plan. I might need some help with the patch because I'm not overly familiar with the quirks infrastructure.
Mark Rowe (bdash)
Comment 16
2010-08-25 17:28:27 PDT
We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content. They’ve primarily been web sites, but also email messages generated by reporting systems.
Adam Barth
Comment 17
2010-08-26 23:17:34 PDT
(In reply to
comment #16
)
> We’ve seen several more cases internal to Apple where this change in behavior has impacted existing content. They’ve primarily been web sites, but also email messages generated by reporting systems.
That's not overly surprising given that internal Apple content is more likely to be consumed with WebKit than with IE. Presumably these pieces of content are also broken in IE and Firefox 4.
Andy Estes
Comment 18
2010-09-10 01:56:06 PDT
<
rdar://problem/8154379
>
Adam Barth
Comment 19
2010-09-13 01:43:39 PDT
I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works. One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility. The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state. One subtly is that you'll probably need to emit the current token before re-consuming the '<' character.
Andy Estes
Comment 20
2010-09-13 20:38:19 PDT
(In reply to
comment #19
)
> I offered to write a patch for this bug, but Maciej thought it would be better if someone else wrote the patch to get some experience with how this code works. > > One approach to fixing this bug is to add a Setting that can be used by the embedder for app-specific compatibility. The setting instructs the HTMLTokenizer to recognize '<' in the TagNameState and to re-consume it in the data state. One subtly is that you'll probably need to emit the current token before re-consuming the '<' character.
Adam, thanks for the pointer in the right direction. In addition to TagNameState, there were a few other states that needed this quirk to handle cases such as <tag attr<tag> and <tag <tag>, but it is essentially the same change in each state (for <tag attr<tag>, endAttributeName() needs to be called before reconsuming the < in DataState). I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test. I'll upload the WebCore part now if you want to take a look at it.
Andy Estes
Comment 21
2010-09-13 20:49:04 PDT
(In reply to
comment #20
)
> I have the WebCore part of this patch written, but I still need to actually enable the quirk in WebKit and also write a layout test, which probably means also modifying DRT so that the quirk can be enabled from the test.
On second thought, enabling this quirk is probably outside the scope of this bug. I'll just focus on adding the quirk capability and open other bugs for enabling it in specific cases.
Eric Seidel (no email)
Comment 22
2010-09-13 20:51:51 PDT
Quirks like this should be easy to test using some iframes and layoutTestController.overridePreference
Andy Estes
Comment 23
2010-09-13 21:02:03 PDT
Created
attachment 67516
[details]
Patch
Adam Barth
Comment 24
2010-09-13 21:12:16 PDT
Comment on
attachment 67516
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch
Overall, looks spot on. Thanks for working on this patch.
> WebCore/html/parser/HTMLDocumentParser.cpp:95 > - , m_tokenizer(HTMLTokenizer::create()) > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings()))
Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code.
> WebCore/html/parser/HTMLTokenizer.cpp:106 > -HTMLTokenizer::HTMLTokenizer() > +HTMLTokenizer::HTMLTokenizer(Settings* settings)
Do you think passing in the settings object is better than passing in the Document? It's slightly unclear to me which is better.
> WebCore/html/parser/HTMLTokenizer.cpp:108 > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks())
useLegacyTokenizerQuirks? I guess it depends on what granularity we want to be able to control the quirks... Are you sure that frameless documents don't need this quirk?
> WebCore/html/parser/HTMLTokenizer.cpp:440 > + else if (cc == '<' && m_useLegacyParserQuirks)
I bet you should put m_useLegacyParserQuirks first. That branch will be very well predicted.
> WebCore/html/parser/HTMLTokenizer.cpp:919 > + } else if (cc == '<' && m_useLegacyParserQuirks) { > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > + return emitAndReconsumeIn(source, DataState);
It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
Eric Seidel (no email)
Comment 25
2010-09-13 21:20:27 PDT
Comment on
attachment 67516
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch
> WebCore/html/parser/HTMLTokenizer.cpp:108 > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks())
I disagree. One flag is scary enough. Having N different parser modes is bad news bears. Andy is right to just have one flag IMO.
Andy Estes
Comment 26
2010-09-13 21:31:30 PDT
(In reply to
comment #24
)
> (From update of
attachment 67516
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67516&action=prettypatch
> > Overall, looks spot on. Thanks for working on this patch. > > > WebCore/html/parser/HTMLDocumentParser.cpp:95 > > - , m_tokenizer(HTMLTokenizer::create()) > > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings())) > Be sure to add a test case that uses fragment parsing to make sure the setting properly propagates to the fragment parsing code.
Will do.
> > > WebCore/html/parser/HTMLTokenizer.cpp:106 > > -HTMLTokenizer::HTMLTokenizer() > > +HTMLTokenizer::HTMLTokenizer(Settings* settings) > Do you think passing in the settings object is better than passing in the Document? It's slightly unclear to me which is better.
My thought was that Settings was more specific to what HTMLTokenizer needs than Document.
> > > WebCore/html/parser/HTMLTokenizer.cpp:108 > > + , m_useLegacyParserQuirks(settings && settings->useLegacyParserQuirks()) > useLegacyTokenizerQuirks? I guess it depends on what granularity we want to be able to control the quirks...
I'd like to use this same quirk to handle the <script /> case, and I believe that will need to be patched in HTMLTreeBuilder. So, using the work tokenizer might make it too specific.
> > Are you sure that frameless documents don't need this quirk?
Ahh, so since document()->settings() is really document()->m_frame->settings(), frameless documents would return a NULL Settings. I didn't consider this, so I'll think about the right thing to do. I don't know a lot about frameless documents, but I don't see why they wouldn't want the quirk.
> > > WebCore/html/parser/HTMLTokenizer.cpp:440 > > + else if (cc == '<' && m_useLegacyParserQuirks) > I bet you should put m_useLegacyParserQuirks first. That branch will be very well predicted.
Will do.
> > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > + return emitAndReconsumeIn(source, DataState); > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
This can probably be skipped for now, but I wouldn't be surprised if we find this case in the wild. The markup used by AIM comes close to hitting this case, but has whitespace after the end of the attribute (the attribute value is also quoted). Thanks for the feedback!
Andy Estes
Comment 27
2010-09-13 21:38:58 PDT
(In reply to
comment #24
)
> > WebCore/html/parser/HTMLTokenizer.cpp:919 > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > + return emitAndReconsumeIn(source, DataState); > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be.
If this turns out to be necessary, maybe it could be restricted to quoted attributes.
Adam Barth
Comment 28
2010-09-13 21:42:13 PDT
(In reply to
comment #27
)
> (In reply to
comment #24
) > > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > > + return emitAndReconsumeIn(source, DataState); > > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. > > If this turns out to be necessary, maybe it could be restricted to quoted attributes.
I'm confused. Do you mean this case: <div foo="hello<script>alert(1)</script>there"> or this case: <div foo="hello"<script>alert(1)</script> ?
Alexey Proskuryakov
Comment 29
2010-09-13 22:11:25 PDT
> m_useLegacyParserQuirks
If this is going to be a single switch, it still needs a more specific name (maybe PreHTML5?). We won't want to add HTML5 behaviors here when HTML5v2 comes out, but this name will still be suggesting that we should.
Andy Estes
Comment 30
2010-09-14 21:55:16 PDT
(In reply to
comment #28
)
> (In reply to
comment #27
) > > (In reply to
comment #24
) > > > > WebCore/html/parser/HTMLTokenizer.cpp:919 > > > > + } else if (cc == '<' && m_useLegacyParserQuirks) { > > > > + m_token->endAttributeName(source.numberOfCharactersConsumed()); > > > > + return emitAndReconsumeIn(source, DataState); > > > It would be nice if we could skip this one. The characters used to terminate unquoted attributes are often security sensitive because folks sometimes allow untrusted content in unquoted attributes (even thought that's super scary). By adding "<" as a terminator character, "foo<script>alert(1)</script>" becomes an XSS vector in an unquoted attribute when it otherwise wouldn't be. > > > > If this turns out to be necessary, maybe it could be restricted to quoted attributes. > > I'm confused. Do you mean this case: > > <div foo="hello<script>alert(1)</script>there"> > > or this case: > > <div foo="hello"<script>alert(1)</script> > > ?
The second case. I don't believe the legacy parser emitted a script token in the first case.
Adam Barth
Comment 31
2010-09-14 21:58:55 PDT
> The second case. I don't believe the legacy parser emitted a script token in the first case.
Ok. :)
Andy Estes
Comment 32
2010-09-17 03:10:33 PDT
Created
attachment 67897
[details]
Patch
Andy Estes
Comment 33
2010-09-17 03:14:34 PDT
(In reply to
comment #32
)
> Created an attachment (id=67897) [details] > Patch
A few notes: - I couldn't decide how to handle the case of frameless document so I didn't. I'm not aware of a case where this will cause a regression, but now that I think of it I should probably add a FIXME. - I made my best effort to add the new WebPreference to all the ports so they can run the layout tests, but I can't be sure they're all going to compile. I'll try to land this during a quiet period so I can clean up any breakage without causing too much trouble.
WebKit Review Bot
Comment 34
2010-09-17 03:47:08 PDT
Attachment 67897
[details]
did not build on gtk: Build output:
http://queues.webkit.org/results/4012057
Andy Estes
Comment 35
2010-09-17 04:00:55 PDT
Created
attachment 67899
[details]
Patch
WebKit Review Bot
Comment 36
2010-09-17 04:29:48 PDT
Attachment 67899
[details]
did not build on gtk: Build output:
http://queues.webkit.org/results/4043039
Andy Estes
Comment 37
2010-09-17 04:33:53 PDT
Created
attachment 67900
[details]
Patch
WebKit Review Bot
Comment 38
2010-09-17 04:55:43 PDT
Attachment 67897
[details]
did not build on win: Build output:
http://queues.webkit.org/results/4029046
Andy Estes
Comment 39
2010-09-17 05:34:51 PDT
Created
attachment 67902
[details]
Patch
WebKit Review Bot
Comment 40
2010-09-17 06:46:08 PDT
Attachment 67900
[details]
did not build on win: Build output:
http://queues.webkit.org/results/4063037
Darin Adler
Comment 41
2010-09-17 10:07:48 PDT
Comment on
attachment 67902
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> WebCore/html/parser/HTMLTokenizer.h:122 > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }
It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion. It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class.
> WebCore/html/parser/HTMLTokenizer.h:268 > + inline bool usePreHTML5ParserQuirks();
Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see. I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk.
Adam Barth
Comment 42
2010-09-17 10:15:57 PDT
Comment on
attachment 67902
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> WebCore/html/parser/HTMLTokenizer.cpp:1085 > + else if (usePreHTML5ParserQuirks() && cc == '<') > + return emitAndReconsumeIn(source, DataState);
It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: TagNameState BeforeAttributeNameState AttributeNameState * AfterAttributeNameState * AfterAttributeValueQuotedState I've marked the ones you didn't add with a *. You can test those as follows: <foo bar<qux> <foo bar <qux>
> WebCore/html/parser/HTMLTokenizer.cpp:1679 > +inline bool HTMLTokenizer::usePreHTML5ParserQuirks() > +{ > + return (m_settings && m_settings->usePreHTML5ParserQuirks()); > +}
Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.
Darin Adler
Comment 43
2010-09-17 10:19:13 PDT
(In reply to
comment #42
)
> Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.
Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.
Adam Barth
Comment 44
2010-09-17 10:21:31 PDT
> Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design.
Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark.
Andy Estes
Comment 45
2010-09-17 12:35:15 PDT
(In reply to
comment #41
)
> (From update of
attachment 67902
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> > > WebCore/html/parser/HTMLTokenizer.h:122 > > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > It’s quite inelegant to tie the tokenizer and parser to document settings in this fashion. > > It would be quite a bit better to have the parser and tokenizer publish some sort of object or flags word that expresses the quirks to be used, then we could have the code that creates the parser convert the settings into that form. Having the tokenizer read the settings directly is inelegant and I think too big a change to the class.
This was what I did in my first patch, but the issue was that layout tests enable the quirk after the parser is created, hence the current design. I probably shouldn't have designed for the pathological case, and now that I understand better how WebPreferences work, I think I can write a test that sets the preference then does its work in an iframe to ensure the right ordering.
> > > WebCore/html/parser/HTMLTokenizer.h:268 > > + inline bool usePreHTML5ParserQuirks(); > > Not sure why the existing code includes these “inline” keywords. They don’t do any good that I can see. > > I don’t think that all pre-HTML5 parser quirks should be grouped in a single flag. I think we should have an explicit flag for each quirk.
There seemed to be a consensus that a single quirk was a better approach. Would there be a case where we'd want to enable the quirk for <tag<tag> but not for <tag <tag>? To me, the three changes to the tokenizer handle variants of a single issue so a single quirk settings makes sense.
Andy Estes
Comment 46
2010-09-17 12:39:53 PDT
(In reply to
comment #42
)
> (From update of
attachment 67902
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> > > WebCore/html/parser/HTMLTokenizer.cpp:1085 > > + else if (usePreHTML5ParserQuirks() && cc == '<') > > + return emitAndReconsumeIn(source, DataState); > > It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: > > TagNameState > BeforeAttributeNameState > AttributeNameState * > AfterAttributeNameState * > AfterAttributeValueQuotedState > > I've marked the ones you didn't add with a *. You can test those as follows: > > <foo bar<qux>
I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values?
> <foo bar <qux>
You're right, I missed this one.
> > > WebCore/html/parser/HTMLTokenizer.cpp:1679 > > +inline bool HTMLTokenizer::usePreHTML5ParserQuirks() > > +{ > > + return (m_settings && m_settings->usePreHTML5ParserQuirks()); > > +} > > Have you run the HTML parsing benchmark on this code? You can run the benchmark by opening the file WebCore/benchmarks/parser/html-parser.html. I suspect you'll get a measurable slowdown from this patch. It's probably faster to read the setting at construction time and put it into a member variable than it is to take the extra m_settings null check all the time.
That's the design I'm going back to.
Adam Barth
Comment 47
2010-09-17 13:09:54 PDT
Comment on
attachment 67902
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
>>> WebCore/html/parser/HTMLTokenizer.cpp:1085 >>> + return emitAndReconsumeIn(source, DataState); >> >> It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: >> >> TagNameState >> BeforeAttributeNameState >> AttributeNameState * >> AfterAttributeNameState * >> AfterAttributeValueQuotedState >> >> I've marked the ones you didn't add with a *. You can test those as follows: >> >> <foo bar<qux> >> <foo bar <qux> > > I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values?
Yes. Unquoted attribute values are the problematic ones from a security point of view.
Andy Estes
Comment 48
2010-09-17 13:29:57 PDT
(In reply to
comment #47
)
> (From update of
attachment 67902
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67902&action=prettypatch
> > >>> WebCore/html/parser/HTMLTokenizer.cpp:1085 > >>> + return emitAndReconsumeIn(source, DataState); > >> > >> It's slightly unclear to me why you chose these states for this quirk. I would have expected it in these states: > >> > >> TagNameState > >> BeforeAttributeNameState > >> AttributeNameState * > >> AfterAttributeNameState * > >> AfterAttributeValueQuotedState > >> > >> I've marked the ones you didn't add with a *. You can test those as follows: > >> > >> <foo bar<qux> > >> <foo bar <qux> > > > > I thought you cited a security concern with unquoted attributes? Were you referring specifically to unquoted attribute values? > > Yes. Unquoted attribute values are the problematic ones from a security point of view.
I guess there's no such thing as a quoted attribute name :) Nevermind...fixed both cases.
Andy Estes
Comment 49
2010-09-17 18:01:24 PDT
Created
attachment 67984
[details]
Patch
Andy Estes
Comment 50
2010-09-17 18:03:57 PDT
(In reply to
comment #44
)
> > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design. > > Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark.
I'll run the benchmark against the latest patch and report what I find.
Andy Estes
Comment 51
2010-09-17 23:34:57 PDT
(In reply to
comment #50
)
> (In reply to
comment #44
) > > > Reversing the check so the '<' test comes first before checking the setting would probably eliminate the slowdown. But I agree that putting the value of the setting into a data member is probably the right design. > > > > Our earlier thinking was that putting the quirk check first would be better since it should be a very well-predicted branch. However, instead of speculating, we can just run the benchmark. > > I'll run the benchmark against the latest patch and report what I find.
I ran the benchmark a total of 6 times in a 64-bit production build of Safari, once with my patch applied and once without. Here is what I found: With patch: avg: 2741.5 stdev: 20.5 avg: 2739.2 stdev: 20.4 avg: 2753.2 stdev: 45.8 Without patch: avg: 2773.3 stdev: 20 avg: 2775.9 stdev: 21.4 avg: 2770.2 stdev: 20.5
Adam Barth
Comment 52
2010-09-18 13:46:20 PDT
Looks like you made it faster by a hair. :)
Darin Adler
Comment 53
2010-09-18 16:21:23 PDT
Comment on
attachment 67984
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch
> WebCore/html/parser/HTMLTokenizer.h:123 > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); }
This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. I suppose we can live with this, but I don’t like it.
Adam Barth
Comment 54
2010-09-18 16:28:27 PDT
Comment on
attachment 67984
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch
>> WebCore/html/parser/HTMLTokenizer.h:123 >> + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. > > It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. > > I suppose we can live with this, but I don’t like it.
I'll post a patch to remove Frame from updateStateFor.
Adam Barth
Comment 55
2010-09-18 19:07:52 PDT
> I'll post a patch to remove Frame from updateStateFor.
I looked at this, but it seems more error prone / verbose. :(
WebKit Review Bot
Comment 56
2010-09-18 23:11:34 PDT
Attachment 67984
[details]
did not build on win: Build output:
http://queues.webkit.org/results/4009066
Darin Adler
Comment 57
2010-09-19 18:58:24 PDT
(In reply to
comment #55
)
> > I'll post a patch to remove Frame from updateStateFor. > > I looked at this, but it seems more error prone / verbose. :(
Yes, I noticed the same thing. Frame doesn’t express the set of tags allowed in a way that can easily be passed in, so this convenience function is only convenient with Frame*. I still think we might some day figure out a way to fix that, though. Or maybe may desire for “layering” here is misguided.
Andy Estes
Comment 58
2010-09-19 23:58:25 PDT
Created
attachment 68055
[details]
Patch
Andy Estes
Comment 59
2010-09-20 00:08:30 PDT
(In reply to
comment #53
)
> (From update of
attachment 67984
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=67984&action=prettypatch
> > > WebCore/html/parser/HTMLTokenizer.h:123 > > - static PassOwnPtr<HTMLTokenizer> create() { return adoptPtr(new HTMLTokenizer); } > > + static PassOwnPtr<HTMLTokenizer> create(Settings* settings) { return adoptPtr(new HTMLTokenizer(settings)); } > > This should take a usePreHTML5ParserQuirks boolean, not a Settings*. I don’t think it’s good layer for HTMLTokenizer to have a dependency on Settings at all. > > It’s the same kind of layering mistake as taking a Frame* in the updateStateFor function. And the forward declaration of class Element here is unneeded. > > I suppose we can live with this, but I don’t like it.
The latest patch moves the Settings* dependency up to HTMLDocumentParser.
WebKit Review Bot
Comment 60
2010-09-20 00:33:31 PDT
Attachment 68055
[details]
did not build on chromium: Build output:
http://queues.webkit.org/results/4092005
Eric Seidel (no email)
Comment 61
2010-09-20 00:34:19 PDT
Comment on
attachment 68055
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=68055&action=review
You'll want Abarth to review the tokenizer changes. He's more familiar with that part of the parser than I am.
> WebCore/html/parser/HTMLDocumentParser.cpp:83 > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()))
Seems this should be a helper function which takes a Document*. We used this pattern when the HTML5 parser was optional too.
> WebCore/html/parser/HTMLDocumentParser.cpp:96 > + , m_tokenizer(HTMLTokenizer::create(fragment->document()->settings() && fragment->document()->settings()->usePreHTML5ParserQuirks()))
Would be re-used here.
> WebCore/html/parser/HTMLPreloadScanner.cpp:128 > + m_tokenizer = HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks());
And here. This seems wrong. We don't need to move the initializer out of the normal c++ pattern. The ASSERT only helps in debug mode. ASSERTing after we've hosed ourselves is just as good as before.
> WebCore/html/parser/HTMLViewSourceParser.cpp:37 > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks()))
You want your inline again here. :) Although it's possible you won't make another copy of the inline in this file since it's only used once. When we had the HTML5 parser optional, we had a: static inline useLegacyHTMLParser(Document*) function which did similar to this check and handled the null frame case. WE had two copies. one in the treebuilder and one in the DocumentParser. Seems you need to do the check more often than that, so maybe it needs to be a class method/static on DocumentParser or HTMLDocumentParser?
> WebKit/gtk/webkit/webkitwebsettings.cpp:111 > + gboolean enable_pre_html5_parser_quirks;
I'm not sure any other port besides Mac wants this. I think this could be a mac-only quirk with mac-only layout tests. We really want this quirk to die ASAP (yes, I know that could be years).
WebKit Review Bot
Comment 62
2010-09-20 01:20:00 PDT
Attachment 68055
[details]
did not build on win: Build output:
http://queues.webkit.org/results/4082033
Andy Estes
Comment 63
2010-09-20 12:07:28 PDT
(In reply to
comment #61
)
> (From update of
attachment 68055
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=68055&action=review
> > You'll want Abarth to review the tokenizer changes. He's more familiar with that part of the parser than I am. > > > WebCore/html/parser/HTMLDocumentParser.cpp:83 > > + , m_tokenizer(HTMLTokenizer::create(document->settings() && document->settings()->usePreHTML5ParserQuirks())) > > Seems this should be a helper function which takes a Document*. We used this pattern when the HTML5 parser was optional too.
Done.
> > > WebKit/gtk/webkit/webkitwebsettings.cpp:111 > > + gboolean enable_pre_html5_parser_quirks; > > I'm not sure any other port besides Mac wants this. I think this could be a mac-only quirk with mac-only layout tests. We really want this quirk to die ASAP (yes, I know that could be years).
I guess skipping it on non-Mac ports makes sense since only the Mac port will enable it, although at the time I wrote it I didn't see how having all platforms run the test could hurt. Now that various iterations of this patch are breaking on the EWS bots, I see it as a maintenance burden.
Andy Estes
Comment 64
2010-09-20 14:18:27 PDT
Created
attachment 68134
[details]
Patch
Adam Barth
Comment 65
2010-09-20 14:26:16 PDT
Comment on
attachment 68134
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=68134&action=review
> WebCore/html/parser/HTMLDocumentParser.h:145 > +bool usePreHTML5ParserQuirks(Document*);
Generally, we prefer to have these sorts of functions be static methods of a class: HTMLDocumentParser::usePreHTML5Quirks(...) That makes it easier to find the implementation of the function when you see it called in a random file.
> LayoutTests/fast/parser/resources/pre-html5-parser-quirk-after-attribute-value-quoted-state.html:4 > + <p></p attr="value"</div>
Can you add a test for <p attr=value</ ?
Andy Estes
Comment 66
2010-09-20 16:01:32 PDT
Committed
r67890
: <
http://trac.webkit.org/changeset/67890
>
Andy Estes
Comment 67
2010-09-21 00:34:20 PDT
Created
attachment 68198
[details]
Patch
Andy Estes
Comment 68
2010-09-21 00:35:04 PDT
Reopening to add additional test coverage.
Adam Barth
Comment 69
2010-09-21 00:50:09 PDT
Comment on
attachment 68198
[details]
Patch Thanks.
WebKit Commit Bot
Comment 70
2010-09-21 03:21:32 PDT
Comment on
attachment 68198
[details]
Patch Clearing flags on attachment: 68198 Committed
r67936
: <
http://trac.webkit.org/changeset/67936
>
WebKit Commit Bot
Comment 71
2010-09-21 03:21:44 PDT
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug