Bug 3905 - Missing </title> makes page blank
: Missing </title> makes page blank
Status: RESOLVED INVALID
: WebKit
HTML DOM
: 420+
: Macintosh Mac OS X 10.4
: P2 Minor
Assigned To:
:
: HasReduction
:
: 6314
  Show dependency treegraph
 
Reported: 2005-07-08 07:20 PST by
Modified: 2007-12-27 23:22 PST (History)


Attachments
File with the above html contents AKA test-case. (271 bytes, text/html)
2005-07-08 07:21 PST, Peter Speck
no flags Details
Torture test (3.59 KB, application/x-gzip)
2006-02-25 08:17 PST, David Kilzer (:ddkilzer)
no flags Details
Patch v1 (5.47 KB, patch)
2006-02-25 09:38 PST, David Kilzer (:ddkilzer)
no flags Review Patch | Details | Formatted Diff | Diff
Patch v2 (5.41 KB, patch)
2006-02-25 10:16 PST, David Kilzer (:ddkilzer)
no flags Review Patch | Details | Formatted Diff | Diff
Test open <title> tag in <body> (363 bytes, text/html)
2006-02-25 11:11 PST, David Kilzer (:ddkilzer)
no flags Details
Patch v3 (9.57 KB, patch)
2006-02-25 18:54 PST, David Kilzer (:ddkilzer)
darin: review-
Review Patch | Details | Formatted Diff | Diff
Patch v4 (9.63 KB, patch)
2006-02-26 22:08 PST, David Kilzer (:ddkilzer)
no flags Review Patch | Details | Formatted Diff | Diff
Patch v5 (9.55 KB, patch)
2006-02-27 21:05 PST, David Kilzer (:ddkilzer)
darin: review-
Review Patch | Details | Formatted Diff | Diff
Patch v6 (11.95 KB, patch)
2006-02-28 06:56 PST, David Kilzer (:ddkilzer)
darin: review-
Review Patch | Details | Formatted Diff | Diff
Test case (bad DOM in <script>) (82 bytes, text/html)
2006-03-02 21:08 PST, David Kilzer (:ddkilzer)
no flags Details
Patch v7 (21.95 KB, patch)
2006-03-05 11:03 PST, David Kilzer (:ddkilzer)
no flags Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-07-08 07:20:39 PST
The </title> end-tag is required by the html specification, but if it is missing, the page will be blank. 
Other browsers display the content of the page. WebKit should close the tag at </head> or other tags 
which cannot be inside the <title> tag (e.g. <meta>).

Example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
<title>My title
</head>
<body>
<h1>Heading !</h1>
<p>para</p>
</body>
</html>
------- Comment #1 From 2005-07-08 07:21:50 PST -------
Created an attachment (id=2867) [details]
File with the above html contents AKA test-case.
------- Comment #2 From 2005-07-08 10:51:56 PST -------
Confirmed, i guess </head> should close an open <title> as well..
------- Comment #3 From 2005-12-27 14:29:18 PST -------
Should be pretty simple to fix in the html parser.
------- Comment #4 From 2006-01-05 11:00:22 PST -------
(In reply to comment #3)
> Should be pretty simple to fix in the html parser.

Looks like it's currently handled by the tokenizer, so I'm cc:ing Adele per http://wiki.opendarwin.org/
index.php/WebKit:WebKit_Team
------- Comment #5 From 2006-01-19 09:24:49 PST -------
(In reply to comment #2)
> Confirmed, i guess </head> should close an open <title> as well..

No, No!, I think that <title>this is a title bug report</title> is the right way. Dont let those pesky non 
conforming webmasters do any non-standard bllllllllugh!

We should stick to the spec!
------- Comment #6 From 2006-01-19 09:56:54 PST -------
Though i agree with the previous poster in that we should stick with the spec. However, our main concern 
is the user and not the webmaster. If we can display a page while not creating any other problems, we 
should.
------- Comment #7 From 2006-02-24 15:58:28 PST -------
HTML5 has specific rules for handling un-closed </title>.

Basically, what happens is that you parse the contents of <title> in a special tokeniser mode (treating < characters as characters, not tag starts), and then if you get to the end of the document without seeing a </title>, then you go all the way back to the <title>, and parse it again, as if it was a normal element with no special tokenisation, and the next start tag or end tag is where you close the <title> element.
------- Comment #8 From 2006-02-25 08:07:05 PST -------
(In reply to comment #7)
> Basically, what happens is that you parse the contents of <title> in a special
> tokeniser mode (treating < characters as characters, not tag starts), and then
> if you get to the end of the document without seeing a </title>, then you go
> all the way back to the <title>, and parse it again, as if it was a normal
> element with no special tokenisation, and the next start tag or end tag is
> where you close the <title> element.

The above method is the way Firefox 1.5.0.1 handles this case.  MSIE 6 currently handles it like Safari does now:  it reads the rest of the document as if it were the title #text, never finding the </title> tag, then just stops parsing.

I have a patch for this that works.  I just need to make sure the state of the tokenizer is reset properly when backing up and reparsing after initially consuming the rest of the document.
------- Comment #9 From 2006-02-25 08:17:30 PST -------
Created an attachment (id=6730) [details]
Torture test

This is a gzip-compressed torture test.  WARNING: It will expand to over 1 MB in size when uncompressed!

Basically, this is a 1 MB file with a missing </title> tag that causes the tokenizer to consume the entire document.

With my current fix (patch coming soon), the page will render in about 12 seconds on a 1.5 GHz PB G4 with 1 GB RAM.  It spends 2-3 seconds consuming the document looking for </title> (spinner pauses during this time), then backs up and reparses with no special handling of the title.

Firefox 1.5.0.1 on the same PB G4 takes a few MINUTES to parse the file, but eventually finishes and inserts the missing </title> tag.  MSIE 6 on a 1.7 GHz IBM ThinkPad T42 with 2 GB RAM takes at least twice as long as Firefox, and just stops parsing after it reaches the end of the document with no </title> tag.
------- Comment #10 From 2006-02-25 09:38:56 PST -------
Created an attachment (id=6732) [details]
Patch v1

Patch v1 includes a layout test and code fix.

The code fix is to determine when HTMLTokenizer::parseSpecial() has consumed the rest of the document looking for </title> from within HTMLTokenizer::parseTag(), then reset the state of the tokenizer and retokenize with no special handling for the title.  The HTML parser will then take care of adding the missing </title> tag.

NOTE: I'm not sure whether the state of the tokenizer has been reset sufficiently in this patch!  Please verify this when reviewing the patch.  I reset 'src' and 'lineno', but didn't really look at any other state variables.  There may be other tokenizer state variables that need to be reset!

I did run all of the layout tests and they all passed (except fast/js/date-constructor.html which has known issues).
------- Comment #11 From 2006-02-25 09:58:30 PST -------
Conversation from IRC:

[11:45am] mitzpettel: ddkilzer: i'm wondering if you can't detect <title> badness before reaching the very end of the document
[11:46am] ddkilzer: mitzpettel: how? 
[11:46am] mitzpettel: ddkilzer: e.g. as early as you see something that looks like a close tag?
[11:47am] mitzpettel: ddkilzer: (i'm sure you've given it more thought than i have so you probably can't do it that way)
[11:47am] ddkilzer: mitzpettel: What about this title:    <title>Using </junk></title>
[11:47am] ddkilzer: mitzpettel: Not much more thought...I was looking for an easy solution that didn't impact normal parsing much.
[11:48am] ddkilzer: mitzpettel: To do what you're saying I'd have to add more state variables or code in other parts of the tokenizer.
[11:48am] ddkilzer: s/parsing/tokenizing/
[11:48am] ddkilzer: (I think)
[11:49am] ddkilzer: mitzpettel: This solution does a relatively quick check (state.inTitle() && src.isEmpty()) exactly after the tokenizer just ate the whole document, and is able to recover if needed.  This is also an exceptional case, so I don't really think the solution needs to be efficient.    I'm also just doing what Hixie will be suggesting for HTML 5.
[11:50am] mitzpettel: ddkilzer: i see 
[11:50am] ddkilzer: In the normal case, the check will fall through after only state.inTitle() (which is a bit check), and not impact performance.  
[11:50am] ddkilzer: mitzpettel: Of course, reviewers may see it differently, but I understand what you're saying.
[11:51am] mitzpettel: ddkilzer: i guess the easy solution is the correct one in this case
[11:51am] ddkilzer: I had no idea title-tokenizing was much like tokenizing <script> stuff where "anything goes" until the </title> tag appears.
------- Comment #12 From 2006-02-25 10:05:21 PST -------
The bug lies in the HTML tokenizer, not the DOM.  Moving bug to "WebKit Misc." component.
------- Comment #13 From 2006-02-25 10:16:26 PST -------
Created an attachment (id=6733) [details]
Patch v2

*PAIN*  There was a curly brace that wasn't lined up properly in Patch v1.  Fixed in Patch v2.

All of the previous questions still apply, though. :)
------- Comment #14 From 2006-02-25 11:11:32 PST -------
Created an attachment (id=6734) [details]
Test open <title> tag in <body>
------- Comment #15 From 2006-02-25 18:54:42 PST -------
Created an attachment (id=6741) [details]
Patch v3

Changes since Patch v2:

- Added a test for an unclosed <title> tag in the <body>, which failed with Patch v1 and v2.
- Added code to HTMLParser::parseToken() to handle condition when no </title> tag exists.  Patch v1 and v2 handled this implicitly (probably in HTMLParser::handleError() when the </head> tag was processed) for the first case (missing </title> in <head>), but not for the second case (in the <body>).
------- Comment #16 From 2006-02-25 18:57:05 PST -------
Note that Firefox 1.5.0.1 and MSIE 6 handle an unclosed <title> tag in the body the same way they handle an unclosed <title> tag in the head.
------- Comment #17 From 2006-02-25 18:58:37 PST -------
Finally, note that all layout tests passed (no pixel tests) with Patch v3.
------- Comment #18 From 2006-02-25 20:03:53 PST -------
(From update of attachment 6741 [details])
It's not good that this code calls SegmentedString::toString() before parsing any <title>. This undoes the SegmentedString optimization, by building a giant string out of all the separate pieces. And it's going to happen often since many documents have a <title> tag.

Instead, we should simply assign savedSrc to src, without calling toString. We will presumably have to write our own copy constructor because the m_currentChar field might be pointing to m_pushedChar1 or m_pushedChar2.

This code also relies on the assignment operator of SegmentedString working. Since we're using the compiler-generated one in that case we could have the same trouble, except that we're in the special case of a SegmentedString that was just created so we know that m_currentChar is not pointing to m_pushedChar1 or 2. So I think we need to implement operator= as well for SegmentedString.

This copy constructor and assignment operator should not be too terribly difficult, since m_currentChar is the only case where I think we need to do anything special.
------- Comment #19 From 2006-02-26 22:08:31 PST -------
Created an attachment (id=6753) [details]
Patch v4

Changes since Patch v3:

- After further review the copy constructor and assignment operator mentioned in Comment #18 were not needed.  It was sufficient to save the state simply by using the default assignment operator in HTMLTokenizer::parseTag().
- Moved the code in HTMLParser::parseToken() into HTMLParser::handleError() [part 2] since that's where all the other cleanup code lives.
------- Comment #20 From 2006-02-27 08:26:37 PST -------
(From update of attachment 6753 [details])
In htmlParser.cpp, I don't know why the check n->isHTMLElement() is appropriate; it seems too strict. What about a non-HTML element after a <title> like a custom tag? It seems that the code could be simpler, more like the handling of addressTag, fontTag, dlTag, and dtTag.

I think it's OK to land the patch like this, but probably even better to simplify it more.

r=me anyway
------- Comment #21 From 2006-02-27 18:10:21 PST -------
(From update of attachment 6753 [details])
I will simplify the patch and repost.
------- Comment #22 From 2006-02-27 21:05:11 PST -------
Created an attachment (id=6769) [details]
Patch v5

Changes since Patch v4:

- Removed unnecessary code from HTMLParser::handleError() per Comment #20.  If execution reaches this block of code, we know that the 'current' element is <title> and that the next element to be inserted is not </title> or a #text node (since we're in handleError()) so we can always pop <title> off the block.
------- Comment #23 From 2006-02-27 22:35:30 PST -------
(From update of attachment 6769 [details])
We talked about it a little more and decided that we really do need to write a copy constructor and assignment operator for SegmentedString.
------- Comment #24 From 2006-02-28 06:56:37 PST -------
Created an attachment (id=6773) [details]
Patch v6

Updates from Patch v5:

- Removed verbose comments from htmlparser.cpp patch.
- Added copy constructor and assignment operator to SegmentedString class.

Notes:

- In the SegmentedString class, I can't find any code where "m_currentChar = &m_pushedChar2", so I'm not sure the middle else-if logic is really necessary in the copy constructor and assignment operator.
- I did not put the operator=() prototype next to the other operator prototypes in SegmentedString.h.  I thought it best to leave it by the copy constructor.  Let me know if this needs to be moved.
- All layout tests passed with this new patch.
------- Comment #25 From 2006-02-28 09:38:37 PST -------
(From update of attachment 6773 [details])
I don't understand why we save lineno + src.lineCount() but then restore it into lineno; it seems like this will make lineno higher than it should be. I'd like to see a test that checks that the line numbers are correct. To do that you'll have to figure out what the line numbers are used for.

I don't think the SegmentedString copy constructor should be in the header file and the assignment operator not. Either both should be inlined or neither. I'd lean toward neither.

Because of the line number issue, I'm going to mark this review-, but this looks nearly perfect!
------- Comment #26 From 2006-02-28 10:53:29 PST -------
(In reply to comment #24)
> - In the SegmentedString class, I can't find any code where "m_currentChar =
> &m_pushedChar2", so I'm not sure the middle else-if logic is really necessary
> in the copy constructor and assignment operator.

Good point. Probably OK to leave it out.

> - I did not put the operator=() prototype next to the other operator prototypes
> in SegmentedString.h.  I thought it best to leave it by the copy constructor. 
> Let me know if this needs to be moved.

I think it's fine where it is.
------- Comment #27 From 2006-02-28 12:42:09 PST -------
(In reply to comment #25)
> (From update of attachment 6773 [details] [edit])
> I don't understand why we save lineno + src.lineCount() but then restore it
> into lineno; it seems like this will make lineno higher than it should be. I'd
> like to see a test that checks that the line numbers are correct. To do that
> you'll have to figure out what the line numbers are used for.

The reason I used "lineno + src.lineCount()" was because the line numbers were off (confirmed via the debugger originally) when I was using "SegmentedString savedSrc = SegmentedString(src.toString())" instead of the copy constructor or assignment operator.  I'll verify that "savedLineno = lineno" works via the debugger again.

I'll also try to write a test for checking line numbers in the file, but I kind of doubt that the DOM or JavaScript has the concept of "__LINE__" for the file that's loaded, does it?  Hmm...maybe I can catch a JavaScript exception and get a line number for the script source in it, but that may only be relative to the script source, not the HTML source.
------- Comment #28 From 2006-03-01 10:37:43 PST -------
(In reply to comment #27)
> Hmm...maybe I can catch a JavaScript exception and get
> a line number for the script source in it,

Yes, that's the way to do it.

> but that may only be relative to the script source, not the HTML source.

It should be relative to the HTML source. In fact, that's the only reason the tokenizer has a line number.
------- Comment #29 From 2006-03-02 21:08:32 PST -------
Created an attachment (id=6820) [details]
Test case (bad DOM in <script>)

While writing a line-number test, I found a test document that did not parse correctly.  The <script> tag in the <head> has the entire contents of the file after "<title>" as its contents.  My guess is that some internal state of the tokenizer (or parser?) needs to be saved and restored, or the cleanup code in HTMLParser::handleError() isn't quite right.
------- Comment #30 From 2006-03-05 11:03:34 PST -------
Created an attachment (id=6873) [details]
Patch v7

Updates from Patch v6:

- Added fast/js/exception-linenums-in-html-1.html and fast/js/exception-linenums-in-html-2.html tests to verify correct line numbers in HTML per Comment #25.
- Added fast/js/missing-title-end-tag-js.html to cover issue found in Comment #29 (Attachment #6820 [details]).
- Moved SegmentedString copy constructor to .cpp file from .h file per Comment #25.
- In HTMLTokenizer::parseTag() when </title> is missing:
  * Save and restore the whole State (not just the TagState).
  * Save and restore "lineno", not "lineno + src.lineCount()", per Comment #25.
  * Reset scriptCodeSize to zero when resetting state (due to code in an unconditional else clause at the end of the while loop of parseSpecial() which accumulates every character in the document in QChar* scriptCode while looking for "</title").  This fixes the issue in Comment #29 (Attachment #6820 [details]).

Notes:

- I tried fixing the issue in Comment #29 (Attachment #6820 [details]) by adding conditions to the unconditional else clause near the end of the while loop in parseSpecial(), but everything I tried caused Safari to simply hang when tokenizing/parsing a document.  ("Everything" was "if (!state.inTitle())" and "if (state.inScript())".)  It seems like a waste to me to update scriptCode unnecessarily if it's never used later (since it grows automatically), but I'm not sure how to fix it.
------- Comment #31 From 2006-03-05 12:05:35 PST -------
(From update of attachment 6873 [details])
Nice tests!

The part here that seems tricky is knowing how much of the tokenizer state we need to save and restore. In theory, all the state of both the tokenizer and the parser might have to be saved.

I'm sure for individual fields we can prove why they don't need to be saved. But I'm concerned that there are *so* many that we aren't saving and restoring.

Just to pick a couple of example, what about "scriptCodeResync", "pendingSrc", and "currentPrependingSrc"?

I'm going to mark this review+, but I'm concerned there are some new edge cases we're introducing here. Maybe more testing is the right thing to do.
------- Comment #32 From 2006-03-05 16:13:04 PST -------
(In reply to comment #31)
> The part here that seems tricky is knowing how much of the tokenizer state we
> need to save and restore. In theory, all the state of both the tokenizer and
> the parser might have to be saved.
> 
> I'm sure for individual fields we can prove why they don't need to be saved.
> But I'm concerned that there are *so* many that we aren't saving and restoring.
> 
> Just to pick a couple of example, what about "scriptCodeResync", "pendingSrc",
> and "currentPrependingSrc"?

Within all of htmltokenizer.cpp, the only time that "scriptCodeResync" is set to a non-zero is in parseSpecial() when the end tag is found in the while() loop.  Since we know that </title> is never found in the special case, we don't have to worry about resetting "scriptCodeResync".

Also, "pendingSrc" and "currentPrependingSrc" (along with similarly-named variables) are used primarily in HTMLTokenizer::scriptHandler() and scriptExecution().  Since we're searching for a </title> tag, this code is never run from parseSpecial() since state.inScript() is never true for this case.

I feel reasonably confident that we're not missing any state information, although "saveState()" and "restoreState()" methods on the tokenizer would have been nice to have for this patch.  :)
------- Comment #33 From 2006-03-07 20:28:26 PST -------
Verified fixed as of r13202 (locally built).
------- Comment #34 From 2006-03-19 10:48:47 PST -------
This fix needs to be backed out Bug 6314, Comment #19.
------- Comment #35 From 2006-03-19 12:00:24 PST -------
(In reply to comment #34)
> This fix needs to be backed out Bug 6314, Comment #19.

While I did back out the fix for bug 6314, since this one doesn't have affects on major sites I think we could probably just fix this a new way without the intermediate step of backing this one's out.
------- Comment #36 From 2006-06-12 08:32:03 PST -------
Reassigning this bug back to webkit-unassigned in case anyone else is interested in fixing it.  I plan to look at it in a couple weeks.

Before this is fixed, an http test needs to be created that reliably reproduces the latest problem found (see Comment #34).
------- Comment #37 From 2007-06-27 13:57:58 PST -------
Firefox does show a blank page and HTML5 currently requires that. Anything else would require some type of reparsing which is potentially dangerous in case of <title> iirc.
------- Comment #38 From 2007-12-27 23:22:38 PST -------
I believe this is INVALID. Reparsing is a security risk and HTML5 says not to. Other browsers are converging on that behaviour too.