Bug 7346 - Parsing DOCTYPE with missing end quote includes part of the HTML document in the systemId
Summary: Parsing DOCTYPE with missing end quote includes part of the HTML document in ...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Misc. (show other bugs)
Version: 417.x
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Nobody
URL: http://jtds.sourceforge.net/doc/
Keywords: HasReduction
Depends on:
Blocks:
 
Reported: 2006-02-18 16:18 PST by David Kilzer (:ddkilzer)
Modified: 2008-07-01 11:12 PDT (History)
2 users (show)

See Also:


Attachments
Reduced test case (745 bytes, text/html)
2006-07-08 04:12 PDT, David Kilzer (:ddkilzer)
no flags Details
Reduced test case saved as webarchive (1.41 KB, application/x-webarchive)
2006-07-08 04:18 PDT, David Kilzer (:ddkilzer)
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Kilzer (:ddkilzer) 2006-02-18 16:18:34 PST
Summary:

When saving a web page containing framesets in .webarchive format, WebKit outputs invalid HTML for the WebResourceData data (the HTML source for the web page being archived) in the WebMainResource key.

Steps to reproduce:

1. Open Safari.
2. Go to a web page with a frameset, like this one:  http://jtds.sourceforge.net/doc/
3. Save the page in WebArchive format, e.g., as jtds.webarchive.
4. Open the jtds.webarchive file just saved in Safari.

Expected results:

The jtds.webarchive should look identical to the original web page.

Actual results:

The jtds.webarchive is missing content because the HTML page containing the frameset is invalid when written to the archive.

Regression:

This fails on Safari 2.0.3 (417.8) on Mac OS X 10.4.5 the same way it fails on WebKit ToT (r12884).
Comment 1 David Kilzer (:ddkilzer) 2006-02-21 15:16:22 PST
One thing I noticed about the .webarchive format is that WebKit appears to walk the DOM and output HTML based on the DOM rather than saving the originally loaded source that would be available via Cmd-Opt-U.  (Note that I HAVE NOT looked at the actual source yet--I'm basing this on observed behavior!)

This shows up as all tag names becoming capitalized (<HTML>, <HEAD>, etc.), and missing tags in the output that aren't included in the DOM (like <noframes> in Safari 2.0.3 before recent fixes in ToT WebKit).

If WebKit were to write out the original source to the .webarchive files, this would fix this bug.

Note that the DOM would still have to be walked to gather all of the various external resources (scripts, images, stylesheets, frames, etc.), but it might be walked at a much faster rate since only those tags with external resources would need to be processed.
Comment 2 Joost de Valk (AlthA) 2006-07-08 03:06:49 PDT
David, what would you like to see reduced here? :)
Comment 3 David Kilzer (:ddkilzer) 2006-07-08 04:12:46 PDT
Created attachment 9263 [details]
Reduced test case

(In reply to comment #2)
> David, what would you like to see reduced here? :)

I was looking for something like this attachment.  To see the problem, follow these steps:

1. Open the reduced test case.
2. Save the reduced test case as a webarchive file.  (I just tried Safari 2.0.4 (419.3) on Mac OS X 10.4.7 (8J135/PowerPC).)
3. Open the redueced test case in the browser again.
4. View HTML source.
5. Note the mangled HTML.

It appears that there is a bug in the webarchive code when writing framesets.
Comment 4 David Kilzer (:ddkilzer) 2006-07-08 04:14:56 PDT
(In reply to comment #3)
> 2. Save the reduced test case as a webarchive file.  (I just tried Safari 2.0.4
> (419.3) on Mac OS X 10.4.7 (8J135/PowerPC).)

With a locally-built WebKit r15227.

Comment 5 David Kilzer (:ddkilzer) 2006-07-08 04:17:17 PDT
(In reply to comment #3)
> 4. View HTML source.
> 5. Note the mangled HTML.

Using an extraction tool I wrote in Perl (see Bug 7241), I have confirmed that the HTML seen when viewing source is the HTML saved by the webarchive code.  Yes, it's that ugly.

Comment 6 David Kilzer (:ddkilzer) 2006-07-08 04:18:00 PDT
Created attachment 9264 [details]
Reduced test case saved as webarchive
Comment 7 David Kilzer (:ddkilzer) 2007-03-07 03:12:57 PST
The initial problem is that the DOCTYPE tag is missing an ending quote on its systemId.  (It also is missing a space between the publicId and the systemId, but that seems to be handled properly.)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REC-html40/loose.dtd>

The result is that instead of the systemId being this URL:

http://www.w3.org/TR/REC-html40/loose.dtd

It's now everything from that URL to the first double quote in the HTML document:

http://www.w3.org/TR/REC-html40/loose.dtd>
<!--NewPage-->
<HTML>
<HEAD>
<!-- Generated by javadoc on Tue Nov 08 14:06:10 EET 2005-->
<TITLE>
jTDS API
</TITLE>
</HEAD>
<FRAMESET cols=

Firefox manages to parse the systemId correctly (as seen when loading the example URL and then saving it as "Web Page, Complete").  WebKit should be able to recognize the end of the DOCTYPE tag by the ">" character and stop consuming the rest of the HTML document until it finds an ending double quote.

I have confirmed that adding the missing double quote to the DOCTYPE tag causes the webarchive file to be saved properly.
Comment 8 David Kilzer (:ddkilzer) 2008-07-01 11:12:37 PDT
The bisect-builds script reports the progression (fix) occurred between:

Fails: r30377  Works: r30459

Note that ToT WebKit doesn't output the broken <!DOCTYPE> tag in this case, which fixes this bug.

However, if the <!DOCTYPE> tag is valid, TWO <!DOCTYPE> tags are written to the WebArchive file.  See Bug 15290.