The </title> end-tag is required by the html specification, but if it is missing, the page will be blank.
Other browsers display the content of the page. WebKit should close the tag at </head> or other tags
which cannot be inside the <title> tag (e.g. <meta>).
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
Created attachment 2867 [details]
File with the above html contents AKA test-case.
Confirmed, i guess </head> should close an open <title> as well..
Should be pretty simple to fix in the html parser.
(In reply to comment #3)
> Should be pretty simple to fix in the html parser.
Looks like it's currently handled by the tokenizer, so I'm cc:ing Adele per http://wiki.opendarwin.org/
(In reply to comment #2)
> Confirmed, i guess </head> should close an open <title> as well..
No, No!, I think that <title>this is a title bug report</title> is the right way. Dont let those pesky non
conforming webmasters do any non-standard bllllllllugh!
We should stick to the spec!
Though i agree with the previous poster in that we should stick with the spec. However, our main concern
is the user and not the webmaster. If we can display a page while not creating any other problems, we
HTML5 has specific rules for handling un-closed </title>.
Basically, what happens is that you parse the contents of <title> in a special tokeniser mode (treating < characters as characters, not tag starts), and then if you get to the end of the document without seeing a </title>, then you go all the way back to the <title>, and parse it again, as if it was a normal element with no special tokenisation, and the next start tag or end tag is where you close the <title> element.
(In reply to comment #7)
> Basically, what happens is that you parse the contents of <title> in a special
> tokeniser mode (treating < characters as characters, not tag starts), and then
> if you get to the end of the document without seeing a </title>, then you go
> all the way back to the <title>, and parse it again, as if it was a normal
> element with no special tokenisation, and the next start tag or end tag is
> where you close the <title> element.
The above method is the way Firefox 18.104.22.168 handles this case. MSIE 6 currently handles it like Safari does now: it reads the rest of the document as if it were the title #text, never finding the </title> tag, then just stops parsing.
I have a patch for this that works. I just need to make sure the state of the tokenizer is reset properly when backing up and reparsing after initially consuming the rest of the document.
Created attachment 6730 [details]
This is a gzip-compressed torture test. WARNING: It will expand to over 1 MB in size when uncompressed!
Basically, this is a 1 MB file with a missing </title> tag that causes the tokenizer to consume the entire document.
With my current fix (patch coming soon), the page will render in about 12 seconds on a 1.5 GHz PB G4 with 1 GB RAM. It spends 2-3 seconds consuming the document looking for </title> (spinner pauses during this time), then backs up and reparses with no special handling of the title.
Firefox 22.214.171.124 on the same PB G4 takes a few MINUTES to parse the file, but eventually finishes and inserts the missing </title> tag. MSIE 6 on a 1.7 GHz IBM ThinkPad T42 with 2 GB RAM takes at least twice as long as Firefox, and just stops parsing after it reaches the end of the document with no </title> tag.
Created attachment 6732 [details]
Patch v1 includes a layout test and code fix.
The code fix is to determine when HTMLTokenizer::parseSpecial() has consumed the rest of the document looking for </title> from within HTMLTokenizer::parseTag(), then reset the state of the tokenizer and retokenize with no special handling for the title. The HTML parser will then take care of adding the missing </title> tag.
NOTE: I'm not sure whether the state of the tokenizer has been reset sufficiently in this patch! Please verify this when reviewing the patch. I reset 'src' and 'lineno', but didn't really look at any other state variables. There may be other tokenizer state variables that need to be reset!
I did run all of the layout tests and they all passed (except fast/js/date-constructor.html which has known issues).
Conversation from IRC:
[11:45am] mitzpettel: ddkilzer: i'm wondering if you can't detect <title> badness before reaching the very end of the document
[11:46am] ddkilzer: mitzpettel: how?
[11:46am] mitzpettel: ddkilzer: e.g. as early as you see something that looks like a close tag?
[11:47am] mitzpettel: ddkilzer: (i'm sure you've given it more thought than i have so you probably can't do it that way)
[11:47am] ddkilzer: mitzpettel: What about this title: <title>Using </junk></title>
[11:47am] ddkilzer: mitzpettel: Not much more thought...I was looking for an easy solution that didn't impact normal parsing much.
[11:48am] ddkilzer: mitzpettel: To do what you're saying I'd have to add more state variables or code in other parts of the tokenizer.
[11:48am] ddkilzer: s/parsing/tokenizing/
[11:48am] ddkilzer: (I think)
[11:49am] ddkilzer: mitzpettel: This solution does a relatively quick check (state.inTitle() && src.isEmpty()) exactly after the tokenizer just ate the whole document, and is able to recover if needed. This is also an exceptional case, so I don't really think the solution needs to be efficient. I'm also just doing what Hixie will be suggesting for HTML 5.
[11:50am] mitzpettel: ddkilzer: i see
[11:50am] ddkilzer: In the normal case, the check will fall through after only state.inTitle() (which is a bit check), and not impact performance.
[11:50am] ddkilzer: mitzpettel: Of course, reviewers may see it differently, but I understand what you're saying.
[11:51am] mitzpettel: ddkilzer: i guess the easy solution is the correct one in this case
[11:51am] ddkilzer: I had no idea title-tokenizing was much like tokenizing <script> stuff where "anything goes" until the </title> tag appears.
The bug lies in the HTML tokenizer, not the DOM. Moving bug to "WebKit Misc." component.
Created attachment 6733 [details]
*PAIN* There was a curly brace that wasn't lined up properly in Patch v1. Fixed in Patch v2.
All of the previous questions still apply, though. :)
Created attachment 6734 [details]
Test open <title> tag in <body>
Created attachment 6741 [details]
Changes since Patch v2:
- Added a test for an unclosed <title> tag in the <body>, which failed with Patch v1 and v2.
- Added code to HTMLParser::parseToken() to handle condition when no </title> tag exists. Patch v1 and v2 handled this implicitly (probably in HTMLParser::handleError() when the </head> tag was processed) for the first case (missing </title> in <head>), but not for the second case (in the <body>).
Note that Firefox 126.96.36.199 and MSIE 6 handle an unclosed <title> tag in the body the same way they handle an unclosed <title> tag in the head.
Finally, note that all layout tests passed (no pixel tests) with Patch v3.
Comment on attachment 6741 [details]
It's not good that this code calls SegmentedString::toString() before parsing any <title>. This undoes the SegmentedString optimization, by building a giant string out of all the separate pieces. And it's going to happen often since many documents have a <title> tag.
Instead, we should simply assign savedSrc to src, without calling toString. We will presumably have to write our own copy constructor because the m_currentChar field might be pointing to m_pushedChar1 or m_pushedChar2.
This code also relies on the assignment operator of SegmentedString working. Since we're using the compiler-generated one in that case we could have the same trouble, except that we're in the special case of a SegmentedString that was just created so we know that m_currentChar is not pointing to m_pushedChar1 or 2. So I think we need to implement operator= as well for SegmentedString.
This copy constructor and assignment operator should not be too terribly difficult, since m_currentChar is the only case where I think we need to do anything special.
Created attachment 6753 [details]
Changes since Patch v3:
- After further review the copy constructor and assignment operator mentioned in Comment #18 were not needed. It was sufficient to save the state simply by using the default assignment operator in HTMLTokenizer::parseTag().
- Moved the code in HTMLParser::parseToken() into HTMLParser::handleError() [part 2] since that's where all the other cleanup code lives.
Comment on attachment 6753 [details]
In htmlParser.cpp, I don't know why the check n->isHTMLElement() is appropriate; it seems too strict. What about a non-HTML element after a <title> like a custom tag? It seems that the code could be simpler, more like the handling of addressTag, fontTag, dlTag, and dtTag.
I think it's OK to land the patch like this, but probably even better to simplify it more.
Comment on attachment 6753 [details]
I will simplify the patch and repost.
Created attachment 6769 [details]
Changes since Patch v4:
- Removed unnecessary code from HTMLParser::handleError() per Comment #20. If execution reaches this block of code, we know that the 'current' element is <title> and that the next element to be inserted is not </title> or a #text node (since we're in handleError()) so we can always pop <title> off the block.
Comment on attachment 6769 [details]
We talked about it a little more and decided that we really do need to write a copy constructor and assignment operator for SegmentedString.
Created attachment 6773 [details]
Updates from Patch v5:
- Removed verbose comments from htmlparser.cpp patch.
- Added copy constructor and assignment operator to SegmentedString class.
- In the SegmentedString class, I can't find any code where "m_currentChar = &m_pushedChar2", so I'm not sure the middle else-if logic is really necessary in the copy constructor and assignment operator.
- I did not put the operator=() prototype next to the other operator prototypes in SegmentedString.h. I thought it best to leave it by the copy constructor. Let me know if this needs to be moved.
- All layout tests passed with this new patch.
Comment on attachment 6773 [details]
I don't understand why we save lineno + src.lineCount() but then restore it into lineno; it seems like this will make lineno higher than it should be. I'd like to see a test that checks that the line numbers are correct. To do that you'll have to figure out what the line numbers are used for.
I don't think the SegmentedString copy constructor should be in the header file and the assignment operator not. Either both should be inlined or neither. I'd lean toward neither.
Because of the line number issue, I'm going to mark this review-, but this looks nearly perfect!
(In reply to comment #24)
> - In the SegmentedString class, I can't find any code where "m_currentChar =
> &m_pushedChar2", so I'm not sure the middle else-if logic is really necessary
> in the copy constructor and assignment operator.
Good point. Probably OK to leave it out.
> - I did not put the operator=() prototype next to the other operator prototypes
> in SegmentedString.h. I thought it best to leave it by the copy constructor.
> Let me know if this needs to be moved.
I think it's fine where it is.
(In reply to comment #25)
> (From update of attachment 6773 [details] )
> I don't understand why we save lineno + src.lineCount() but then restore it
> into lineno; it seems like this will make lineno higher than it should be. I'd
> like to see a test that checks that the line numbers are correct. To do that
> you'll have to figure out what the line numbers are used for.
The reason I used "lineno + src.lineCount()" was because the line numbers were off (confirmed via the debugger originally) when I was using "SegmentedString savedSrc = SegmentedString(src.toString())" instead of the copy constructor or assignment operator. I'll verify that "savedLineno = lineno" works via the debugger again.
(In reply to comment #27)
> a line number for the script source in it,
Yes, that's the way to do it.
> but that may only be relative to the script source, not the HTML source.
It should be relative to the HTML source. In fact, that's the only reason the tokenizer has a line number.
Created attachment 6820 [details]
Test case (bad DOM in <script>)
While writing a line-number test, I found a test document that did not parse correctly. The <script> tag in the <head> has the entire contents of the file after "<title>" as its contents. My guess is that some internal state of the tokenizer (or parser?) needs to be saved and restored, or the cleanup code in HTMLParser::handleError() isn't quite right.
Created attachment 6873 [details]
Updates from Patch v6:
- Added fast/js/exception-linenums-in-html-1.html and fast/js/exception-linenums-in-html-2.html tests to verify correct line numbers in HTML per Comment #25.
- Added fast/js/missing-title-end-tag-js.html to cover issue found in Comment #29 (Attachment #6820 [details]).
- Moved SegmentedString copy constructor to .cpp file from .h file per Comment #25.
- In HTMLTokenizer::parseTag() when </title> is missing:
* Save and restore the whole State (not just the TagState).
* Save and restore "lineno", not "lineno + src.lineCount()", per Comment #25.
* Reset scriptCodeSize to zero when resetting state (due to code in an unconditional else clause at the end of the while loop of parseSpecial() which accumulates every character in the document in QChar* scriptCode while looking for "</title"). This fixes the issue in Comment #29 (Attachment #6820 [details]).
- I tried fixing the issue in Comment #29 (Attachment #6820 [details]) by adding conditions to the unconditional else clause near the end of the while loop in parseSpecial(), but everything I tried caused Safari to simply hang when tokenizing/parsing a document. ("Everything" was "if (!state.inTitle())" and "if (state.inScript())".) It seems like a waste to me to update scriptCode unnecessarily if it's never used later (since it grows automatically), but I'm not sure how to fix it.
Comment on attachment 6873 [details]
The part here that seems tricky is knowing how much of the tokenizer state we need to save and restore. In theory, all the state of both the tokenizer and the parser might have to be saved.
I'm sure for individual fields we can prove why they don't need to be saved. But I'm concerned that there are *so* many that we aren't saving and restoring.
Just to pick a couple of example, what about "scriptCodeResync", "pendingSrc", and "currentPrependingSrc"?
I'm going to mark this review+, but I'm concerned there are some new edge cases we're introducing here. Maybe more testing is the right thing to do.
(In reply to comment #31)
> The part here that seems tricky is knowing how much of the tokenizer state we
> need to save and restore. In theory, all the state of both the tokenizer and
> the parser might have to be saved.
> I'm sure for individual fields we can prove why they don't need to be saved.
> But I'm concerned that there are *so* many that we aren't saving and restoring.
> Just to pick a couple of example, what about "scriptCodeResync", "pendingSrc",
> and "currentPrependingSrc"?
Within all of htmltokenizer.cpp, the only time that "scriptCodeResync" is set to a non-zero is in parseSpecial() when the end tag is found in the while() loop. Since we know that </title> is never found in the special case, we don't have to worry about resetting "scriptCodeResync".
Also, "pendingSrc" and "currentPrependingSrc" (along with similarly-named variables) are used primarily in HTMLTokenizer::scriptHandler() and scriptExecution(). Since we're searching for a </title> tag, this code is never run from parseSpecial() since state.inScript() is never true for this case.
I feel reasonably confident that we're not missing any state information, although "saveState()" and "restoreState()" methods on the tokenizer would have been nice to have for this patch. :)
Verified fixed as of r13202 (locally built).
This fix needs to be backed out Bug 6314, Comment #19.
(In reply to comment #34)
> This fix needs to be backed out Bug 6314, Comment #19.
While I did back out the fix for bug 6314, since this one doesn't have affects on major sites I think we could probably just fix this a new way without the intermediate step of backing this one's out.
Reassigning this bug back to webkit-unassigned in case anyone else is interested in fixing it. I plan to look at it in a couple weeks.
Before this is fixed, an http test needs to be created that reliably reproduces the latest problem found (see Comment #34).
Firefox does show a blank page and HTML5 currently requires that. Anything else would require some type of reparsing which is potentially dangerous in case of <title> iirc.
I believe this is INVALID. Reparsing is a security risk and HTML5 says not to. Other browsers are converging on that behaviour too.