Bug 21945

Summary: Implement HTML5 comment parsing
Product: WebKit Reporter: Masahiro Yamada <masa141421356>
Component: DOMAssignee: Julien Chaffraix <jchaffraix>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, chinmaya, eric, hamaji, mike, mjs, webkit.review.bot
Priority: P2 Keywords: InRadar
Version: 525.x (Safari 3.1)   
Hardware: All   
OS: All   
URL: http://www.ne.jp/asahi/minazuki/bakera/html/sgml/comtest
Attachments:
Description Flags
Proposed fix: Allow a white space (per XML definition) in comment ending declaration
jchaffraix: commit-queue-
Safari failing to render this page properly
none
Improved proposed fix: address Darin's comments
ap: review-, jchaffraix: commit-queue-
Proposed fix: Move comment parsing to HTML5 (matches FF) and adds more test cases for broken comments
jchaffraix: commit-queue-
Updated HTML5 parsing - added the 2 missing states, cleaned some more old code and this time with the test cases ap: review+, jchaffraix: commit-queue-

Description Masahiro Yamada 2008-10-29 03:31:40 PDT
Using: Safari 3.1.2 for Windows

According to HTML Spacification,
"<!-- comment> -- >" is valid comment.
But Safari can not detect end of comment.

Testcase
<html>
<head>test</head>
<body>
ABC<!-- DEF> GHI -- >JKL
</body>

Browser should render it as
"ABCJKL"

If you can read japanese please see
 http://www.ne.jp/asahi/minazuki/bakera/html/sgml/comtest
Comment 1 Masahiro Yamada 2008-10-29 04:13:55 PDT
See also:
http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4

It sais:
White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--"), but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">").

Webkit ignores:
 but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">").

Comment 2 Alexey Proskuryakov 2009-08-11 22:21:31 PDT
And this bug itself gets rendered incorrectly in Safari (no problem in Firefox)
Comment 3 Alexey Proskuryakov 2009-08-11 22:22:32 PDT
<rdar://problem/7136026>
Comment 4 Julien Chaffraix 2010-03-30 21:29:23 PDT
Created attachment 52125 [details]
Proposed fix: Allow a white space (per XML definition) in comment ending declaration
Comment 5 Alexey Proskuryakov 2010-03-30 21:39:50 PDT
What does HTML5 say?
Comment 6 Julien Chaffraix 2010-03-31 05:58:58 PDT
(In reply to comment #5)
> What does HTML5 say?

HTML5 closes the comment but also emits a parsing error. --!> has the same processing.

See section 10.2.4.48 to 10.2.4.52 for reference.
Comment 7 Darin Adler 2010-03-31 12:29:24 PDT
Comment on attachment 52125 [details]
Proposed fix: Allow a white space (per XML definition) in comment ending declaration

> +static inline bool isWhiteSpace(const UChar& c)

Should just be UChar, not const UChar&.

I'm concerned about another function named isWhiteSpace.

We already have isWhitespace in CompositeEditCommand.cpp, isWhitespace in PreloadScanner.cpp, isWhitespace in SVGParserUtilities.h, isWhitespace in XPathFunctions.cpp, and isWhiteSpace in Lexer.h.

We also have isCSSWhitespace in CSSParser.cpp, isCollapsibleWhitespace in TextIterator.h, and RenderStyle::isCollapsibleWhiteSpace in RenderStyle.h, isStrWhiteSpace in JSGlobalObjectFunctions.h, isTrimWhitespace in StringPrototype.cpp, isASCIISpace in ASCIICType.h, and isSpaceOrNewline in StringImpl.h.

It would be best if we gave any new functions more specific names or shared code with existing functions.
Comment 8 Alexey Proskuryakov 2010-03-31 13:43:06 PDT
> See section 10.2.4.48 to 10.2.4.52 for reference.

Thanks. And I forgot to ask another boilerplate question - what do other browsers do?
Comment 9 Julien Chaffraix 2010-04-01 08:47:11 PDT
> Thanks. And I forgot to ask another boilerplate question - what do other
> browsers do?

IE 7 & 8: do not consider -- > to be a valid comment end declaration
Opera: ditto
Firefox: consider -- > to be a valid comment declaration. It seems also to go a little bit further and accepts: --[\r\n\t ]*>. I haven't tested the new HTML5 parser though.
Comment 10 Julien Chaffraix 2010-04-01 08:58:17 PDT
> Should just be UChar, not const UChar&.

OK, I will wait for Alexey's comments about the patch and integrate that.

> We already have isWhitespace in CompositeEditCommand.cpp, isWhitespace in
> PreloadScanner.cpp, isWhitespace in SVGParserUtilities.h, isWhitespace in
> XPathFunctions.cpp, and isWhiteSpace in Lexer.h.
> 
> We also have isCSSWhitespace in CSSParser.cpp, isCollapsibleWhitespace in
> TextIterator.h, and RenderStyle::isCollapsibleWhiteSpace in RenderStyle.h,
> isStrWhiteSpace in JSGlobalObjectFunctions.h, isTrimWhitespace in
> StringPrototype.cpp, isASCIISpace in ASCIICType.h, and isSpaceOrNewline in
> StringImpl.h.

That's a lot of is*WhiteSpace methods! I agree with you that it would be good to hoist those methods somewhere (platform/text or wtf).

> It would be best if we gave any new functions more specific names or shared
> code with existing functions.

Indeed, I would vote for isXMLWhiteSpace as it matches XML white space definition.
Comment 11 Eric Seidel (no email) 2010-04-01 16:51:55 PDT
This bug renders poorly in WebKit.  We should probably change the Title.
Comment 12 Eric Seidel (no email) 2010-04-01 16:55:28 PDT
Created attachment 52354 [details]
Safari failing to render this page properly
Comment 13 Julien Chaffraix 2010-04-15 07:28:04 PDT
Created attachment 53436 [details]
Improved proposed fix: address Darin's comments

I got Alexey's unofficial OK about this patch.

I don't feel comfortable enough to share the isXMLWhiteSpace implementation as the meaning is fine for the HTML tokenizer but may not be so clear for other part of the code (the editing code for example). I will file a bug about this as it is a good refactoring task.
Comment 14 Alexey Proskuryakov 2010-04-19 10:19:54 PDT
-       * <pre>elements too</pre>, <pre>elements <!-- comment -- > too</pre> have a different
+       * <pre>elements too</pre>, <pre>elements &gt;!-- comment --&lt; too</pre> have a different

I don't understand this change. Currently, Firefox and Safari display the same thing for "data:text/html,<pre>elements <!-- comment -- > too</pre>" - will our behavior change incompatibly?

+// Returns true if c is an XML white spaces (XML specification - section 2.3)
+static inline bool isXMLWhiteSpace(UChar c)

If it's XML whitespace by definition, why is this function in HTMLParser? The name looks confusing. HTML5 doesn't seem to defer to XML for whitespace definition.
Comment 15 Julien Chaffraix 2010-04-19 18:29:39 PDT
(In reply to comment #14)
> -       * <pre>elements too</pre>, <pre>elements <!-- comment -- > too</pre>
> have a different
> +       * <pre>elements too</pre>, <pre>elements &gt;!-- comment --&lt;
> too</pre> have a different
> 
> I don't understand this change. 

What you may not see in the change is that we are inside a comment, thus the "-- >" would end the comment and we would badly fail the test. I could have added more spaces but I thought that entities would be better.

>Currently, Firefox and Safari display the same
> thing for "data:text/html,<pre>elements <!-- comment -- > too</pre>" - will our
> behavior change incompatibly?

I think this is unrelated. For some reason the parser closes the comment on the '>' in your example. The following will lead to the same result as the code above in Safari and FF:

data:text/html,<pre>elements <!-- comment> too</pre>

I think the change should have no impact but I will need to test it to really say.

> +// Returns true if c is an XML white spaces (XML specification - section 2.3)
> +static inline bool isXMLWhiteSpace(UChar c)
> 
> If it's XML whitespace by definition, why is this function in HTMLParser? The
> name looks confusing. HTML5 doesn't seem to defer to XML for whitespace
> definition.

HTML definition of space character[1] matches XML white space. I had a look at XML when making this change because this is an SGML property and thus sided with XML. Would isHTMLSpaceCharacter be OK with you?

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character
Comment 16 Alexey Proskuryakov 2010-04-20 10:05:03 PDT
Comment on attachment 53436 [details]
Improved proposed fix: address Darin's comments

> What you may not see in the change is that we are inside a comment, thus the
> "-- >" would end the comment and we would badly fail the test. I could have
> added more spaces but I thought that entities would be better.

Indeed, I didn't see that. 

Entities are fine with me, but here are a few more ideas:
1. "- ->" instead of "-- >".
2. [#comment].
3. Change the whole block to a <script>, and use "//" comments for text.

> HTML definition of space character[1] matches XML white space. I had a look at
> XML when making this change because this is an SGML property and thus sided
> with XML. 

HTML comments are not really SGML, see <http://ln.hixie.ch/?start=1137799947&count=1> for some amusing history.

> Would isHTMLSpaceCharacter be OK with you?

This name sounds good to me. But HTML5 includes U+000C FORM FEED (FF) in the set of space characters, which you don't have here:

+    return c == '\r' || c == '\n' || c == '\t' || c == ' ';

Please add a test for the form feed character.

+                       && m_scriptCode[m_scriptCodeSize - 3] == '-' && isXMLWhiteSpace(m_scriptCode[m_scriptCodeSize - 2])) {
+                // HTML4 states that: White space (...) is permitted between the comment close delimiter ("--")
+                // and the markup declaration close delimiter (">"). See https://bugs.webkit.org/show_bug.cgi?id=21945

HTML4 should not be a reason for us to make changes. Other browsers' behavior and HTML5 draft (as long as it correctly captures browser behavior) would be better references for this comment.

I'm very surprised that we will be only allowing a single space character between "--" and ">". This doesn't seem to match any specification or any browser. Am I missing something?

+                // FIXME: HTML5 indicates that we need to emit a parsing error in this case.

I believe this just means logging an error to developer console. Would that be hard to do?

Marking r-, as this needs some changes.
Comment 17 Julien Chaffraix 2010-05-04 08:29:32 PDT
> Entities are fine with me, but here are a few more ideas:
> 1. "- ->" instead of "-- >".

This one looks like the least confusing. I will switch to that.

> > HTML definition of space character[1] matches XML white space. I had a look at
> > XML when making this change because this is an SGML property and thus sided
> > with XML. 
> 
> HTML comments are not really SGML, see
> <http://ln.hixie.ch/?start=1137799947&count=1> for some amusing history.

Ok, we have to be extra careful with such changes.

> > Would isHTMLSpaceCharacter be OK with you?
> 
> This name sounds good to me. But HTML5 includes U+000C FORM FEED (FF) in the
> set of space characters, which you don't have here:
> 
> +    return c == '\r' || c == '\n' || c == '\t' || c == ' ';
> 
> Please add a test for the form feed character.

Actually the HTML parsing algorithm "white space" definition does not match the definition of a space character (I wonder why). I have dropped this method and inlined at the 2 call sites to avoid confusing name.

> +                       && m_scriptCode[m_scriptCodeSize - 3] == '-' &&
> isXMLWhiteSpace(m_scriptCode[m_scriptCodeSize - 2])) {
> +                // HTML4 states that: White space (...) is permitted between
> the comment close delimiter ("--")
> +                // and the markup declaration close delimiter (">"). See
> https://bugs.webkit.org/show_bug.cgi?id=21945
> 
> HTML4 should not be a reason for us to make changes. Other browsers' behavior
> and HTML5 draft (as long as it correctly captures browser behavior) would be
> better references for this comment.

Sure.

> I'm very surprised that we will be only allowing a single space character
> between "--" and ">". This doesn't seem to match any specification or any
> browser. Am I missing something?

No, you are right. This change was made to correct this peculiar bug and to avoid having a huge change. I can see why it would be unfortunate so I will make us abide by the HTML5 standard for comment parsing. I have the change mostly done with more extensive testing.

> +                // FIXME: HTML5 indicates that we need to emit a parsing error
> in this case.
> 
> I believe this just means logging an error to developer console. Would that be
> hard to do?

This needs to change the logic behind reporting errors: currently only the HTMLParser can report errors. We just need to create a common method so that we can use it from the HTMLTokenizer too.

All in all, there is much changes involved. I will start filing bugs to handle some of the side issues.
Comment 18 Julien Chaffraix 2010-05-10 09:11:29 PDT
> All in all, there is much changes involved. I will start filing bugs to handle some of the side issues.

Filed https://bugs.webkit.org/show_bug.cgi?id=38848 about the tokenizer access to the console.

Filed https://bugs.webkit.org/show_bug.cgi?id=38850 about Darin's concern over isWhiteSpace duplication.
Comment 19 Julien Chaffraix 2010-05-10 09:14:47 PDT
Created attachment 55560 [details]
Proposed fix: Move comment parsing to HTML5 (matches FF) and adds more test cases for broken comments

This patch contains more than a dozen test cases. If needed they can be pushed to another bug to ease review (the patch got quite lengthy).
Comment 20 WebKit Review Bot 2010-05-10 09:19:09 PDT
Attachment 55560 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
WebCore/html/HTMLTokenizer.cpp:646:  An else if statement should be written as an if statement when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
WebCore/html/HTMLTokenizer.cpp:689:  One line control clauses should not use braces.  [whitespace/braces] [4]
WebCore/html/HTMLTokenizer.cpp:702:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Total errors found: 3 in 8 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 21 Alexey Proskuryakov 2010-05-10 15:47:28 PDT
Comment on attachment 55560 [details]
Proposed fix: Move comment parsing to HTML5 (matches FF) and adds more test cases for broken comments

+        Implemented the HTML comment parsing algorithm so that we match HTML5 and
+        FF when parsing comments. Missing from this patch is

fast/parser/comments.html says "Output of this test should match WinIE". Is this still true?

+        the parser errors, which will be added in a follow up patch.
+.
+
+        Added tests cases for broken comments.

Extra line with a dot here.

+    // FIXME: This code should not be called for these states!

Is "this code" the emitCommentToken() function? Are "these states" the ones just below the comment? But a comment in the else clause tells us that this code is "needed to properly parse broken comments", which doesn't seem like a bad thing.

I don't understand what this FIXME is trying to tell me. It seems to be related to a FIXME in HTMLTokenizer::parseComment(), but with a strange twist about broken comments in else clause.

+    // HTML5 Comment state (see section 10.2.4.48 to 10.2.4.52).

The section numbers change all the time.

+    // FIXME: We are missing CommentStartState and CommentStartDashState.

Why does this need to be fixed, what is the observable effect? Would it be better to add these states, and add FIXMEs at the points where we need to enter and handle these?

r=me
Comment 22 Julien Chaffraix 2010-05-11 08:31:10 PDT
(In reply to comment #21)
> (From update of attachment 55560 [details])
> +        Implemented the HTML comment parsing algorithm so that we match HTML5 and
> +        FF when parsing comments. Missing from this patch is
> 
> fast/parser/comments.html says "Output of this test should match WinIE". Is this still true?

Not anymore indeed.

> +        the parser errors, which will be added in a follow up patch.
> +.
> +
> +        Added tests cases for broken comments.
> 
> Extra line with a dot here.

Ok, I will remove it.

> +    // FIXME: This code should not be called for these states!
> 
> Is "this code" the emitCommentToken() function? Are "these states" the ones just below the comment? But a comment in the else clause tells us that this code is "needed to properly parse broken comments", which doesn't seem like a bad thing.

This comment is about the states just below.

> I don't understand what this FIXME is trying to tell me. It seems to be related to a FIXME in HTMLTokenizer::parseComment(), but with a strange twist about broken comments in else clause.

The whole parseComment logic should not be called for these states as we are not expected HTML in those states (that's the FIXME you are mentioning). The FIXME I had is quite redundant with the one in parseComment.
The comment in the else clause just clarifies why we are not closing the comment: so that the code (mostly parseNonHTMLContent) can recover from broken comments. The right fix would be not to parse them as HTML comments in the first place.

> +    // HTML5 Comment state (see section 10.2.4.48 to 10.2.4.52).
> 
> The section numbers change all the time.

OK, I will just remove the sections numbers then.

> +    // FIXME: We are missing CommentStartState and CommentStartDashState.
> 
> Why does this need to be fixed, what is the observable effect?

It was breaking some test cases IIRC. I will investigate this a bit more before replying.

> Would it be better to add these states, and add FIXMEs at the points where we need to enter and handle these?

I am fine with both.
Comment 23 Alexey Proskuryakov 2010-05-11 09:45:41 PDT
> The right fix would be not to parse them as HTML comments in the first place.

That sounds like a good thing to say in a FIXME comment.
Comment 24 Julien Chaffraix 2010-05-11 22:00:03 PDT
> > +    // FIXME: We are missing CommentStartState and CommentStartDashState.
> > 
> > Why does this need to be fixed, what is the observable effect?
> 
> It was breaking some test cases IIRC. I will investigate this a bit more before replying.

I checked and it would indeed break a lot of test cases. We consume the 2 '-' in the comment start (<!--) which the HTML5 algorithm assumes are present. I have added the 2 states with this comment.
Comment 25 Alexey Proskuryakov 2010-05-11 22:19:53 PDT
Does HTML5 need to be corrected?
Comment 26 Julien Chaffraix 2010-05-12 08:38:08 PDT
(In reply to comment #25)
> Does HTML5 need to be corrected?

It does not. You can just discard my previous comment about those states not being reached and the dashes being consumed. After sleeping on that, we need to implement the 2 states to close such comments: <!--> or <!--->. I will add some testing for those and post the updated patch soon.
Comment 27 Julien Chaffraix 2010-05-12 09:24:54 PDT
Created attachment 55857 [details]
Updated HTML5 parsing - added the 2 missing states, cleaned some more old code and this time with the test cases
Comment 28 Alexey Proskuryakov 2010-05-12 09:26:29 PDT
Renaming to match the scope of work in this fix.

I'd appreciate it if you could post an overview of where we'll stand compatibility-wise. My understanding is that we'll match HTML5, but what the new differences with IE and with Firefox will be?
Comment 29 Julien Chaffraix 2010-05-14 07:39:23 PDT
> I'd appreciate it if you could post an overview of where we'll stand compatibility-wise. My understanding is that we'll match HTML5, but what the new differences with IE and with Firefox will be?

Your understanding is right: we match HTML5 and will match the next release of Gecko on most cases.

- Compared to IE:
  * we accept white space in the comment end (which was what this bug is about)

- Compared to Grand Paradiso (FF nightly as it does not make sense to compare to FF without the HTML5 parsing model):
  * FF is more lenient as to when closing the comment (--ouch> does close a comment)
  * FF has a bug where the comment end delimiter is not properly consumed (filed https://bugzilla.mozilla.org/show_bug.cgi?id=565896 about that)
  * looking at the code, they are a bit more lenient in their parsing. One example is that they are accepting '\r' in the comment end delimiter.

- Common:
  *  we emit comments without an end delimiter (as tested in the new tests in the patch) whereas IE and FF just don't recognize them as comments. Not sure if we should amend HTML5 or file a Mozilla bug about that.
Comment 30 Julien Chaffraix 2010-05-14 09:07:18 PDT
I have to amend FF results as I checked the wrong nightly (3.0.19pre vs 3.7):

> - Compared to Grand Paradiso (FF nightly as it does not make sense to compare to FF without the HTML5 parsing model):
>   * FF is more lenient as to when closing the comment (--ouch> does close a comment)
>   * FF has a bug where the comment end delimiter is not properly consumed (filed https://bugzilla.mozilla.org/show_bug.cgi?id=565896 about that)

Those have been solved.

>   * looking at the code, they are a bit more lenient in their parsing. One example is that they are accepting '\r' in the comment end delimiter.

This still holds as it was code inspection.

New difference:

  * FF does consider parse comments in <script>, <style> and <title> as if it was HTML. 

This is a difference with our parsing model.

> - Common:
>   *  we emit comments without an end delimiter (as tested in the new tests in the patch) whereas IE and FF just don't recognize them as comments. Not sure if we should amend HTML5 or file a Mozilla bug about that.

We match FF on this one so they emit comment if they reach EOF.
Comment 31 Alexey Proskuryakov 2010-05-17 13:03:25 PDT
>This is a difference with our parsing model.

Are they also differences with HTML5? It would be good to file bugs against Firefox then.
Comment 32 Julien Chaffraix 2010-05-18 11:24:57 PDT
(In reply to comment #31)
> >This is a difference with our parsing model.
> 
> Are they also differences with HTML5? It would be good to file bugs against Firefox then.

It is a difference on our side from HTML5. I did not want to correct that as it will lead to big compatibility issues (we would recognize comment starting or ending inside a <script> or a <style> which we do not currently).
Comment 33 Alexey Proskuryakov 2010-05-18 12:42:32 PDT
Comment on attachment 55857 [details]
Updated HTML5 parsing - added the 2 missing states, cleaned some more old code and this time with the test cases

> It is a difference on our side from HTML5. I did not want to correct that as it will lead
> to big compatibility issues (we would recognize comment starting or ending inside a
> <script> or a <style> which we do not currently).

Sounds like this should be considered in a separate bug then. In any case, the goal is to have an implementation that matches HTML5, and it can be achieved by modifying either.

-Output of this test should match WinIE (no strict SGML comment parsing).
+Output of this test should match HTML5 (no strict SGML comment parsing).

Did you test with html5lib (e.g. <http://james.html5.org/parsetree.html>)?

r=me
Comment 34 Adam Barth 2010-05-23 11:15:52 PDT
BTW, there are a handful of tests of HTML5 comment parsing in the patch on this bug:

https://bugs.webkit.org/show_bug.cgi?id=39537

That patch is in the middle of a moderate size patch stream, so it might take some creativity to run them against your patch.
Comment 35 Alexey Proskuryakov 2010-06-02 12:42:06 PDT
Julien, do you intend to land this patch? I think that iteratively improving the existing parser is still a good thing to do, even though a larger scale effort is underway. If nothing else, we'll get more testing by making this area HTML5 conformant sooner rather than later.
Comment 36 Adam Barth 2010-06-02 13:22:37 PDT
Yeah, I think incremental improvements to the existing parser are quite valuable as well.  In fact, I've been hoping for this patch to land for a while.  :)
Comment 37 Julien Chaffraix 2010-06-04 08:05:51 PDT
(In reply to comment #35)
> Julien, do you intend to land this patch? I think that iteratively improving the existing parser is still a good thing to do, even though a larger scale effort is underway. If nothing else, we'll get more testing by making this area HTML5 conformant sooner rather than later.

I want to land it too, it is just that I have just been really busy IRL and I have not been able to come around the pace of the new tests added with the new HTML5 parser. My patch made us progressed some of them but I need to investigate the regression I see. This week-end should be calm enough so that I can squeeze some hours to do so.
Comment 38 Adam Barth 2010-06-04 09:37:38 PDT
If this patch is correct, it should be a progression on all the new parsing tests because they have their expectations set to the HTML5 behavior.  Feel free to ping me if you have questions.
Comment 39 WebKit Review Bot 2010-06-07 08:09:49 PDT
http://trac.webkit.org/changeset/60781 might have broken Qt Linux Release
Comment 40 Julien Chaffraix 2010-06-07 08:28:04 PDT
Thanks Adam for the help but I managed to find the issue. We were not recovering properly comments in <style>, <script>... (there is a lot of legacy code because of that).

I tried to land the patch with the previous tweak in r60781 but I had to revert it in 60786 as it broke one Gtk and Qt test in non trivial ways. 

See:
http://build.webkit.org/results/Qt%20Linux%20Release/r60781%20%2812912%29/fast/encoding/japanese-encoding-mix-pretty-diff.html

http://build.webkit.org/results/GTK%20Linux%2032-bit%20Release/r60783%20%2813854%29/fast/multicol/span/span-as-nested-columns-child-pretty-diff.html
Comment 41 Adam Barth 2010-06-09 11:52:09 PDT
What's the status of this patch?  I'd like to get this landed.  :)
Comment 42 Julien Chaffraix 2010-06-09 18:16:38 PDT
(In reply to comment #41)
> What's the status of this patch?  I'd like to get this landed.  :)

I have landed the full patch but it broke 2 platforms in non-trivial ways (3 if you include the Chromium canaries). I have been able to reproduce the Qt failure locally but I am still investigating the origin of it.

If you really want to have it landed (ie it is blocking some of your work on the new HTML5 parser), I could skip the tests I mentioned while I investigate why we have such strange regressions but I would rather not do that.

I have set aside some time tomorrow to look into this.
Comment 43 Adam Barth 2010-06-09 22:53:46 PDT
> If you really want to have it landed (ie it is blocking some of your work on the new HTML5 parser), I could skip the tests I mentioned while I investigate why we have such strange regressions but I would rather not do that.
> 
> I have set aside some time tomorrow to look into this.

Oh, I don't mean to rush you.  I didn't realize the issues were non-trivial.
Comment 44 Julien Chaffraix 2010-06-15 07:59:32 PDT
Finally got around those 2 bugs:
> 
> See:
> http://build.webkit.org/results/Qt%20Linux%20Release/r60781%20%2812912%29/fast/encoding/japanese-encoding-mix-pretty-diff.html

There is a bug in the way Qt handle the page's text. The parser does not see the first dash in the comment end tag and thus does not close the tag.
This behaviour is unchanged by this bug's patch but the old parser used to recover from such broken comments. I will open a bug and add this test to Qt's Skipped list as it is a platform specific issue.

> http://build.webkit.org/results/GTK%20Linux%2032-bit%20Release/r60783%20%2813854%29/fast/multicol/span/span-as-nested-columns-child-pretty-diff.html

This test contains a long comment that is making Gtk fail. I think this is the same underlying cause as the other one: we used to recover from such broken comments but are not. The comment is not part of the test so moving it to a parsing test would be the best.
I will defer this to another bug and skip the test for now.
Comment 45 Eric Seidel (no email) 2010-06-21 23:25:49 PDT
I'm going to open a bug to land these test cases now that we've turned on the HTML5 parser.  I think the rest of this bug can just be closed though.
Comment 46 Eric Seidel (no email) 2010-06-22 00:13:28 PDT
Committed r61605: <http://trac.webkit.org/changeset/61605>
Comment 47 Eric Seidel (no email) 2010-06-22 00:14:14 PDT
I landed the LayoutTests portion of this change.  It all passed perfectly with the new HTML5 lexer. :)
Comment 48 Alexey Proskuryakov 2010-06-22 10:16:01 PDT
I didn't review this part of the landed change, and it looks wrong:


--- trunk/LayoutTests/platform/mac/fast/parser/broken-comments-vs-parsing-mode-expected.txt	2010-06-22 07:04:34 UTC (rev 61604)
+++ trunk/LayoutTests/platform/mac/fast/parser/broken-comments-vs-parsing-mode-expected.txt	2010-06-22 07:13:09 UTC (rev 61605)
@@ -2,4 +2,14 @@
   RenderView at (0,0) size 800x600
 layer at (0,0) size 800x600
   RenderBlock {HTML} at (0,0) size 800x600
-    RenderBody {BODY} at (8,8) size 784x584
+    RenderBody {BODY} at (8,8) size 784x576
+      RenderBlock {P} at (0,0) size 784x18
+        RenderInline {A} at (0,0) size 60x18 [color=#0000EE]
+          RenderText {#text} at (0,0) size 60x18
+            text run at (0,0) width 60: "bug 8626"
+        RenderText {#text} at (60,0) size 363x18
+          text run at (60,0) width 8: ": "
+          text run at (68,0) width 355: "Strict mode erroneously triggered by a broken comment."
+      RenderBlock {P} at (0,34) size 784x18 [color=#00FF00]
+        RenderText {#text} at (0,0) size 642x18
+          text run at (0,0) width 642: "This text should be green, not black (CSS color values not beginning with '#' are OK in quirks mode)."
Comment 49 Eric Seidel (no email) 2010-06-22 12:02:01 PDT
Thanks.  I'll look into what I did wrong today.
Comment 50 Eric Seidel (no email) 2010-06-22 14:24:32 PDT
(In reply to comment #49)
> Thanks.  I'll look into what I did wrong today.

I'm not sure I understand. You're right, it looks like Julien failed to include that diff in his original patch, so claiming you as the reviewer for that part of the change was definitely an error.

However, the results look correct.  Julien wrapped the comment in a <title> so it would be not treated as a comment and not eat the rest of the document (as a unclosed comment would).  So you reviewed the test chagne, you just never saw the updated results.

The updated results look "right", although I'm not sure the test is really still testing what it original set out to test.  The test is invalid in HTML5 as an unclosed comment eats the rest of the documetn in HTML5.  Wrapping it in a <title> just made it ignore (not treated as a comment).

I'm happy to leave it as is, or to back out Julien's change to that test, or to update the ChangeLog to say I reviewed that part, or all three.  Let me know what you'd like.
Comment 51 Alexey Proskuryakov 2010-06-22 15:33:48 PDT
We certainly lost coverage here, and I should have noticed that before. But it somehow came to me when I saw the new results. Since the comment is no longer treated as a comment, we don't test how an unclosed comment affects parsing mode.

I think that the test should be changed to end with the broken comment (plus some insignificant whitespace or elements), so that it wouldn't matter whether the comment eats everything after it or not.
Comment 52 Eric Seidel (no email) 2010-06-22 16:16:30 PDT
Would you like me to revert the change to fast/parser/broken-comments-vs-parsing-mode.html?  I'm happy to do so.
Comment 53 Alexey Proskuryakov 2010-06-22 16:26:15 PDT
No need to revert. I'd like someone (or myself?) to change the test as described above in a follow-up bug though.

This bug should probably be closed now.
Comment 54 Julien Chaffraix 2010-06-22 19:13:31 PDT
(In reply to comment #53)
> No need to revert. I'd like someone (or myself?) to change the test as described above in a follow-up bug though.
> 
> This bug should probably be closed now.

There is still the part about the HTMLTokenizer (now HTMLDocumentParser) which I intend to land. I have rebased it but did not test it enough to be comfortable with landing it.
If you prefer that we do not land that part and stick with the new HTML5 parser, then feel free to close this bug.
Comment 55 Alexey Proskuryakov 2010-06-23 09:48:50 PDT
I don't see the benefit in landing WebCore changes at this point, now that this code is disabled by default. It seems unlikely that we'll need to disable HTML5 parser - but if we do, we'll possibly want to restore old comment parsing behavior, too.
Comment 56 Alexey Proskuryakov 2010-06-23 11:23:32 PDT
> No need to revert. I'd like someone (or myself?) to change the test as described above in a follow-up bug though.

Filed bug 41083.

> If you prefer that we do not land that part and stick with the new HTML5 parser, then feel free to close this bug.

I'm not sure if my preference should be the deciding factor, but closing (as fixed, since the original problem is fixed in ToT, and covered by test cases).