Bug 35954 - Processing instructions inside DOCTYPE internal subset are parsed incorrectly (by libxml2?)
: Processing instructions inside DOCTYPE internal subset are parsed incorrectly...
Status: NEW
Product: WebKit
Classification: Unclassified
Component: XML DOM
: 528+ (Nightly build)
: All All
: P2 Major
Assigned To: Nobody
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-09 20:17 PST by Leif Halvard Silli
Modified: 2011-10-06 01:01 PDT (History)
0 users

See Also:


Attachments
test (192 bytes, application/xhtml+xml)
2010-11-18 09:53 PST, Alexey Proskuryakov
no flags Details
test (239 bytes, application/xhtml+xml)
2010-11-18 09:57 PST, Alexey Proskuryakov
no flags Details
Shows that Webkit *does* follow XML PI-syntax (294 bytes, application/xhtml+xml)
2011-10-05 18:51 PDT, Leif Halvard Silli
no flags Details
Shows that Webkit accepts a closed comment inside the PI (408 bytes, application/xhtml+xml)
2011-10-05 18:58 PDT, Leif Halvard Silli
no flags Details
Reduction of the problem: Webkit doesn't accept a "unclosed" comment inside the PI (366 bytes, application/xhtml+xml)
2011-10-05 19:02 PDT, Leif Halvard Silli
no flags Details
Workaround: Shows how to circument the problem - perhaps point at a solution? (1.76 KB, application/xhtml+xml)
2011-10-05 19:28 PDT, Leif Halvard Silli
no flags Details
Workaround 2: Here the ]> appears right after the processing instruction has started (1.92 KB, application/xhtml+xml)
2011-10-06 00:14 PDT, Leif Halvard Silli
no flags Details
Workaround 3: Add a comment inside the DTD, after the PI (1.54 KB, application/xhtml+xml)
2011-10-06 00:59 PDT, Leif Halvard Silli
no flags Details
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround (2.04 KB, application/xhtml+xml)
2011-10-06 01:01 PDT, Leif Halvard Silli
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2010-03-09 20:17:42 PST
FIrst of all: This bug relates to the XML parsing of XHTML documents (not text/html parsing!). However this bug also is related to text/html issues, which I explain along the way.

How to  reproduce the bug:

(1) Add this DOCTYPE to a XHTML document. The Interntal DTD Subsets inside the DOCTYPE appliesa hack in the form of a XHTML processing instruction, to fool text/html parsers from displaying a "]>" inside the body.  The whole hack is explained in a e-mail message to the W3 validator's mailinglist: http://lists.w3.org/Archives/Public/www-validator/2010Mar/0026.html     
This is the code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 
[ 
<!ATTLIST html class CDATA #IMPLIED>
<?parser-hack ><!--?>
]> 
<!--><?!-->

(2) If you wish, try to load the page as text/html. However, the point in this bug is XML, so load the page as "application/xhtml+xml". 

(3) Results in Firefox, Konqueror and Opera: works 100%

(4) Result in Webkit: "yellow scren of death" in the form of the following message:
     "This page contains the following errors: error on line 3 at column 1: Extra content at the end of the document"
      In short: Nothing is displayed.

(5) Remove the "<?parser-hack ><!--?>" and reload the page - voila, it works in Webkit as well.

(6) Place the "<?parser-hack ><!--?>" inside the body of the XHTML page. Reload. No problems


        CONCLUSION ABOUT THE PROBLEM: 
       ======================

 Apparently, when a PI is placed inside the internal subset  of an XHTML Doctype, then Webkit parses the XHTML PI as if it was a HTML4 PI. Meaning, that it thinks that it ends when it sees the first ">".  And thus, Webkit also sees the HTML comment "start tag" - the "<!--". 

In text/HTML mode, then the point of this hack is exactly that the browser thinks the PI ends with the ">" and that it also sees the "<!--". 

However, this is in XHTML/XML mode. And thus is should parse the DOCTYPE, including PIs, according to XHTML/XML rules. Hence: it is permitted withi a ">" inside the PI. And a "<!--" should not affect the parsing.

I tested in Webkit latest nightly version 4.0.4 (5531.21.10, r55610). And also in iCab, And in Safari for Mac Intel and PPC and for Windows.
Comment 1 Leif Halvard Silli 2010-03-09 20:20:34 PST
I will once again stress that this bug is about application/xhtml+xml parsing.
Comment 2 Alexey Proskuryakov 2010-11-18 09:53:06 PST
Created attachment 74246 [details]
test

Same test as an attachment
Comment 3 Alexey Proskuryakov 2010-11-18 09:57:21 PST
Created attachment 74247 [details]
test

Modified to pass in Firefox.
Comment 4 Alexey Proskuryakov 2010-11-18 11:35:26 PST
This is weird - the only callbacks we get from libxml2 are startDocumentHandler, internalSubsetHandler and then normalErrorHandler, so this looks almost like a libxml2 bug. Note that internalSubsetHandler only carries name, externalID, systemID - we certainly aren't handling DTD itself in WebKit.

But command line xmllint doesn't seem to have a problem with this file.
Comment 5 Leif Halvard Silli 2011-10-05 18:51:48 PDT
Created attachment 109901 [details]
Shows that Webkit *does* follow XML PI-syntax

My diagnosis was wrong: The attached XHTML file includes a HTML PI inside the DTD, and Webkit then correctly reports that the PI never ends (because there is no "?>" to end it.
Comment 6 Leif Halvard Silli 2011-10-05 18:58:16 PDT
Created attachment 109902 [details]
Shows that Webkit accepts a closed comment inside the PI

A XML comment inside an XML processing instruction, is not a XML comment. But Webkit apparently sees it as one. And as long as it perceives it as a well formed comment, it accepts its - as the demo shows.
Comment 7 Leif Halvard Silli 2011-10-05 19:02:01 PDT
Created attachment 109903 [details]
Reduction of the problem:  Webkit doesn't accept a "unclosed" comment inside the PI

Add minimal demo to show what Webkit doesn't accept.
Comment 8 Leif Halvard Silli 2011-10-05 19:28:03 PDT
Created attachment 109906 [details]
Workaround: Shows how to circument the problem - perhaps point at a solution?

This test file shows how to workaround the problem. Please read the comments in the test file.
Comment 9 Leif Halvard Silli 2011-10-06 00:14:59 PDT
Created attachment 109928 [details]
Workaround 2: Here the ]> appears right after the processing instruction has started

In this new attachment, the ]> comes right aft the processing instruction has begun:

<!DOCTYPE html SYSTEM "about:legacy">
<?pi ]>
   <whatever><!--goes here
?>
]>

So, seemingly, as long as Webkit is able to 
  a) find 2 occurences of the string ']>', and
  b) the string occurs immediately after the PI has begun or 
      inside (!) a comment right after the then DTD has ended
then webkit allows any content inside the processing instruction.

(For more comments and speculation, see the attachment.)
Comment 10 Leif Halvard Silli 2011-10-06 00:59:35 PDT
Created attachment 109929 [details]
Workaround 3: Add a comment inside the DTD, after the PI
Comment 11 Leif Halvard Silli 2011-10-06 01:01:29 PDT
Created attachment 109930 [details]
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround