Bug 35954 - Processing instructions inside DOCTYPE internal subset are parsed incorrectly (by libxml2?)
: Processing instructions inside DOCTYPE internal subset are parsed incorrectly...
Status: NEW
: WebKit
XML DOM
: 528+ (Nightly build)
: All All
: P2 Major
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2010-03-09 20:17 PST by
Modified: 2011-10-06 01:01 PST (History)


Attachments
test (192 bytes, application/xhtml+xml)
2010-11-18 09:53 PST, Alexey Proskuryakov
no flags Details
test (239 bytes, application/xhtml+xml)
2010-11-18 09:57 PST, Alexey Proskuryakov
no flags Details
Shows that Webkit *does* follow XML PI-syntax (294 bytes, application/xhtml+xml)
2011-10-05 18:51 PST, Leif Halvard Silli
no flags Details
Shows that Webkit accepts a closed comment inside the PI (408 bytes, application/xhtml+xml)
2011-10-05 18:58 PST, Leif Halvard Silli
no flags Details
Reduction of the problem: Webkit doesn't accept a "unclosed" comment inside the PI (366 bytes, application/xhtml+xml)
2011-10-05 19:02 PST, Leif Halvard Silli
no flags Details
Workaround: Shows how to circument the problem - perhaps point at a solution? (1.76 KB, application/xhtml+xml)
2011-10-05 19:28 PST, Leif Halvard Silli
no flags Details
Workaround 2: Here the ]> appears right after the processing instruction has started (1.92 KB, application/xhtml+xml)
2011-10-06 00:14 PST, Leif Halvard Silli
no flags Details
Workaround 3: Add a comment inside the DTD, after the PI (1.54 KB, application/xhtml+xml)
2011-10-06 00:59 PST, Leif Halvard Silli
no flags Details
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround (2.04 KB, application/xhtml+xml)
2011-10-06 01:01 PST, Leif Halvard Silli
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2010-03-09 20:17:42 PST
FIrst of all: This bug relates to the XML parsing of XHTML documents (not text/html parsing!). However this bug also is related to text/html issues, which I explain along the way.

How to  reproduce the bug:

(1) Add this DOCTYPE to a XHTML document. The Interntal DTD Subsets inside the DOCTYPE appliesa hack in the form of a XHTML processing instruction, to fool text/html parsers from displaying a "]>" inside the body.  The whole hack is explained in a e-mail message to the W3 validator's mailinglist: http://lists.w3.org/Archives/Public/www-validator/2010Mar/0026.html     
This is the code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 
[ 
<!ATTLIST html class CDATA #IMPLIED>
<?parser-hack ><!--?>
]> 
<!--><?!-->

(2) If you wish, try to load the page as text/html. However, the point in this bug is XML, so load the page as "application/xhtml+xml". 

(3) Results in Firefox, Konqueror and Opera: works 100%

(4) Result in Webkit: "yellow scren of death" in the form of the following message:
     "This page contains the following errors: error on line 3 at column 1: Extra content at the end of the document"
      In short: Nothing is displayed.

(5) Remove the "<?parser-hack ><!--?>" and reload the page - voila, it works in Webkit as well.

(6) Place the "<?parser-hack ><!--?>" inside the body of the XHTML page. Reload. No problems


        CONCLUSION ABOUT THE PROBLEM: 
       ======================

 Apparently, when a PI is placed inside the internal subset  of an XHTML Doctype, then Webkit parses the XHTML PI as if it was a HTML4 PI. Meaning, that it thinks that it ends when it sees the first ">".  And thus, Webkit also sees the HTML comment "start tag" - the "<!--". 

In text/HTML mode, then the point of this hack is exactly that the browser thinks the PI ends with the ">" and that it also sees the "<!--". 

However, this is in XHTML/XML mode. And thus is should parse the DOCTYPE, including PIs, according to XHTML/XML rules. Hence: it is permitted withi a ">" inside the PI. And a "<!--" should not affect the parsing.

I tested in Webkit latest nightly version 4.0.4 (5531.21.10, r55610). And also in iCab, And in Safari for Mac Intel and PPC and for Windows.
------- Comment #1 From 2010-03-09 20:20:34 PST -------
I will once again stress that this bug is about application/xhtml+xml parsing.
------- Comment #2 From 2010-11-18 09:53:06 PST -------
Created an attachment (id=74246) [details]
test

Same test as an attachment
------- Comment #3 From 2010-11-18 09:57:21 PST -------
Created an attachment (id=74247) [details]
test

Modified to pass in Firefox.
------- Comment #4 From 2010-11-18 11:35:26 PST -------
This is weird - the only callbacks we get from libxml2 are startDocumentHandler, internalSubsetHandler and then normalErrorHandler, so this looks almost like a libxml2 bug. Note that internalSubsetHandler only carries name, externalID, systemID - we certainly aren't handling DTD itself in WebKit.

But command line xmllint doesn't seem to have a problem with this file.
------- Comment #5 From 2011-10-05 18:51:48 PST -------
Created an attachment (id=109901) [details]
Shows that Webkit *does* follow XML PI-syntax

My diagnosis was wrong: The attached XHTML file includes a HTML PI inside the DTD, and Webkit then correctly reports that the PI never ends (because there is no "?>" to end it.
------- Comment #6 From 2011-10-05 18:58:16 PST -------
Created an attachment (id=109902) [details]
Shows that Webkit accepts a closed comment inside the PI

A XML comment inside an XML processing instruction, is not a XML comment. But Webkit apparently sees it as one. And as long as it perceives it as a well formed comment, it accepts its - as the demo shows.
------- Comment #7 From 2011-10-05 19:02:01 PST -------
Created an attachment (id=109903) [details]
Reduction of the problem:  Webkit doesn't accept a "unclosed" comment inside the PI

Add minimal demo to show what Webkit doesn't accept.
------- Comment #8 From 2011-10-05 19:28:03 PST -------
Created an attachment (id=109906) [details]
Workaround: Shows how to circument the problem - perhaps point at a solution?

This test file shows how to workaround the problem. Please read the comments in the test file.
------- Comment #9 From 2011-10-06 00:14:59 PST -------
Created an attachment (id=109928) [details]
Workaround 2: Here the ]> appears right after the processing instruction has started

In this new attachment, the ]> comes right aft the processing instruction has begun:

<!DOCTYPE html SYSTEM "about:legacy">
<?pi ]>
   <whatever><!--goes here
?>
]>

So, seemingly, as long as Webkit is able to 
  a) find 2 occurences of the string ']>', and
  b) the string occurs immediately after the PI has begun or 
      inside (!) a comment right after the then DTD has ended
then webkit allows any content inside the processing instruction.

(For more comments and speculation, see the attachment.)
------- Comment #10 From 2011-10-06 00:59:35 PST -------
Created an attachment (id=109929) [details]
Workaround 3: Add a comment inside the DTD, after the PI
------- Comment #11 From 2011-10-06 01:01:29 PST -------
Created an attachment (id=109930) [details]
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround