Bug 35954

Summary: Processing instructions inside DOCTYPE internal subset are parsed incorrectly (by libxml2?)
Product: WebKit Reporter: Leif Halvard Silli <xn--mlform-iua>
Component: XML DOMAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Major    
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
test
none
test
none
Shows that Webkit *does* follow XML PI-syntax
none
Shows that Webkit accepts a closed comment inside the PI
none
Reduction of the problem: Webkit doesn't accept a "unclosed" comment inside the PI
none
Workaround: Shows how to circument the problem - perhaps point at a solution?
none
Workaround 2: Here the ]> appears right after the processing instruction has started
none
Workaround 3: Add a comment inside the DTD, after the PI
none
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround none

Description Leif Halvard Silli 2010-03-09 20:17:42 PST
FIrst of all: This bug relates to the XML parsing of XHTML documents (not text/html parsing!). However this bug also is related to text/html issues, which I explain along the way.

How to  reproduce the bug:

(1) Add this DOCTYPE to a XHTML document. The Interntal DTD Subsets inside the DOCTYPE appliesa hack in the form of a XHTML processing instruction, to fool text/html parsers from displaying a "]>" inside the body.  The whole hack is explained in a e-mail message to the W3 validator's mailinglist: http://lists.w3.org/Archives/Public/www-validator/2010Mar/0026.html     
This is the code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 
[ 
<!ATTLIST html class CDATA #IMPLIED>
<?parser-hack ><!--?>
]> 
<!--><?!-->

(2) If you wish, try to load the page as text/html. However, the point in this bug is XML, so load the page as "application/xhtml+xml". 

(3) Results in Firefox, Konqueror and Opera: works 100%

(4) Result in Webkit: "yellow scren of death" in the form of the following message:
     "This page contains the following errors: error on line 3 at column 1: Extra content at the end of the document"
      In short: Nothing is displayed.

(5) Remove the "<?parser-hack ><!--?>" and reload the page - voila, it works in Webkit as well.

(6) Place the "<?parser-hack ><!--?>" inside the body of the XHTML page. Reload. No problems


        CONCLUSION ABOUT THE PROBLEM: 
       ======================

 Apparently, when a PI is placed inside the internal subset  of an XHTML Doctype, then Webkit parses the XHTML PI as if it was a HTML4 PI. Meaning, that it thinks that it ends when it sees the first ">".  And thus, Webkit also sees the HTML comment "start tag" - the "<!--". 

In text/HTML mode, then the point of this hack is exactly that the browser thinks the PI ends with the ">" and that it also sees the "<!--". 

However, this is in XHTML/XML mode. And thus is should parse the DOCTYPE, including PIs, according to XHTML/XML rules. Hence: it is permitted withi a ">" inside the PI. And a "<!--" should not affect the parsing.

I tested in Webkit latest nightly version 4.0.4 (5531.21.10, r55610). And also in iCab, And in Safari for Mac Intel and PPC and for Windows.
Comment 1 Leif Halvard Silli 2010-03-09 20:20:34 PST
I will once again stress that this bug is about application/xhtml+xml parsing.
Comment 2 Alexey Proskuryakov 2010-11-18 09:53:06 PST
Created attachment 74246 [details]
test

Same test as an attachment
Comment 3 Alexey Proskuryakov 2010-11-18 09:57:21 PST
Created attachment 74247 [details]
test

Modified to pass in Firefox.
Comment 4 Alexey Proskuryakov 2010-11-18 11:35:26 PST
This is weird - the only callbacks we get from libxml2 are startDocumentHandler, internalSubsetHandler and then normalErrorHandler, so this looks almost like a libxml2 bug. Note that internalSubsetHandler only carries name, externalID, systemID - we certainly aren't handling DTD itself in WebKit.

But command line xmllint doesn't seem to have a problem with this file.
Comment 5 Leif Halvard Silli 2011-10-05 18:51:48 PDT
Created attachment 109901 [details]
Shows that Webkit *does* follow XML PI-syntax

My diagnosis was wrong: The attached XHTML file includes a HTML PI inside the DTD, and Webkit then correctly reports that the PI never ends (because there is no "?>" to end it.
Comment 6 Leif Halvard Silli 2011-10-05 18:58:16 PDT
Created attachment 109902 [details]
Shows that Webkit accepts a closed comment inside the PI

A XML comment inside an XML processing instruction, is not a XML comment. But Webkit apparently sees it as one. And as long as it perceives it as a well formed comment, it accepts its - as the demo shows.
Comment 7 Leif Halvard Silli 2011-10-05 19:02:01 PDT
Created attachment 109903 [details]
Reduction of the problem:  Webkit doesn't accept a "unclosed" comment inside the PI

Add minimal demo to show what Webkit doesn't accept.
Comment 8 Leif Halvard Silli 2011-10-05 19:28:03 PDT
Created attachment 109906 [details]
Workaround: Shows how to circument the problem - perhaps point at a solution?

This test file shows how to workaround the problem. Please read the comments in the test file.
Comment 9 Leif Halvard Silli 2011-10-06 00:14:59 PDT
Created attachment 109928 [details]
Workaround 2: Here the ]> appears right after the processing instruction has started

In this new attachment, the ]> comes right aft the processing instruction has begun:

<!DOCTYPE html SYSTEM "about:legacy">
<?pi ]>
   <whatever><!--goes here
?>
]>

So, seemingly, as long as Webkit is able to 
  a) find 2 occurences of the string ']>', and
  b) the string occurs immediately after the PI has begun or 
      inside (!) a comment right after the then DTD has ended
then webkit allows any content inside the processing instruction.

(For more comments and speculation, see the attachment.)
Comment 10 Leif Halvard Silli 2011-10-06 00:59:35 PDT
Created attachment 109929 [details]
Workaround 3: Add a comment inside the DTD, after the PI
Comment 11 Leif Halvard Silli 2011-10-06 01:01:29 PDT
Created attachment 109930 [details]
Workaround: Shows that a "HTML5 comment" - a "short comment" (<!-->) can be used as workaround