Bug 30508

Summary: [libxml2] parsing broken with libxml2 >= 2.7.5
Product: WebKit Reporter: Philippe Normand <pnormand>
Component: XMLAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, a.renevier, commit-queue, ddkilzer, eric, ggray, jmalonzo, krit, leo.yang, mrobinson, mrowe, phajdan.jr, une.belette, vincent-webkit, webkit.review.bot, xan.lopez, yong.li.webkit
Priority: P2 Keywords: NeedsReduction
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 33027    
Attachments:
Description Flags
fix xml parsing with recent versions of libxml2
ap: review-
test case: switching encoding to utf-16 breaks parsing
none
updated patch
none
patch
none
patch none

Description Philippe Normand 2009-10-19 03:53:05 PDT
Many layout tests are failing with recent versions of libxml2. Example: fast/xsl/xslt-doc-enc.xml. After some investigation for that specific test it turns out that some hacks are present in our libxml2 wrapper, like resetting the encoding to utf-16 before parsing each chunk of data. Why is that? If i remove that hack (copypasted in 2 places of the code), that test passes fine but breaks other tests like fast/xpath/document-order.html. That test now fails because we try to parse an xml document that has no preamble (<? xml version="1.0" ?>). 

Note: the tests pass fine with libxml2 == 2.7.3
Comment 1 Philippe Normand 2009-10-19 07:32:13 PDT
Created attachment 41416 [details]
fix xml parsing with recent versions of libxml2

This patch fixes most of the fast/xsl (excepted one related to locale sorting), fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please try it :)
Comment 2 Philippe Normand 2009-10-19 07:33:14 PDT
(In reply to comment #1)
> Created an attachment (id=41416) [details]
> fix xml parsing with recent versions of libxml2
> 
> This patch fixes most of the fast/xsl (excepted one related to locale sorting),
> fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please
> try it :)

Forgot to mention this is with the Gtk+ port
Comment 3 Alexey Proskuryakov 2009-10-19 10:20:50 PDT
Comment on attachment 41416 [details]
fix xml parsing with recent versions of libxml2

> +	    String toParse = "<?xml version=\"1.0\" ?>" + parseString;

doWrite is called several times, once per chunk of data arriving from network. We certainly can't insert the XML declaration into the middle of XML source.

If no test broke because of this change, we should add one that will (possibly an http one, to make sending data in chunks easier).

Also, there are tabs in the patch, which should be spaces.
Comment 4 Mark Rowe (bdash) 2009-10-19 11:35:15 PDT
We also can’t make changes that would break parsing with older versions of libxml2.  The oldest version that I’m aware of a need to support is 2.6.16, which ships on Tiger.

Given that our code works correctly with recent versions of libxml2 (2.7.3)… are we sure this isn’t a regression in libxml2 that should be fixed on their end?
Comment 5 Xan Lopez 2009-10-20 01:50:43 PDT
(In reply to comment #4)
> Given that our code works correctly with recent versions of libxml2 (2.7.3)…
> are we sure this isn’t a regression in libxml2 that should be fixed on their
> end?

Were this to be a regression in libxml2 we'd still need to apply some kind of workaround in WebKit, since we can't possibly expect everyone using versions >= than 2.7.3 to instantaneously upgrade when a fixed version is out. I agree it somehow has to keep the code working with older releases, of course.
Comment 6 Mark Rowe (bdash) 2009-10-20 01:53:21 PDT
(In reply to comment #5)
> (In reply to comment #4)
> > Given that our code works correctly with recent versions of libxml2 (2.7.3)…
> > are we sure this isn’t a regression in libxml2 that should be fixed on their
> > end?
> 
> Were this to be a regression in libxml2 we'd still need to apply some kind of
> workaround in WebKit, since we can't possibly expect everyone using versions >=
> than 2.7.3 to instantaneously upgrade when a fixed version is out.

Of course, but we also want any upstream bugs to be fixed ASAP to prevent us from having to keep such a workaround forever.
Comment 7 Mark Rowe (bdash) 2009-10-20 01:59:33 PDT
It’d help to have a description of what the problem is and how the attached patch fixes it.
Comment 8 Philippe Normand 2009-10-20 02:21:37 PDT
(In reply to comment #7)
> It’d help to have a description of what the problem is and how the attached
> patch fixes it.

As I mentionned in the bug description there are 2 issues:

- the hacks we have in place to switch encoding to utf-16 before parsing each chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is no link to libxml2 bugtracker along with those hacks.
- parsing a chunk will fail if no xml preamble is present. In doWrite I can check the value of  m_sawFirstElement and insert the preamble if needed, would that be a better solution?
Comment 9 Mark Rowe (bdash) 2009-10-20 11:15:57 PDT
(In reply to comment #8)
> (In reply to comment #7)
> > It’d help to have a description of what the problem is and how the attached
> > patch fixes it.
> 
> As I mentionned in the bug description there are 2 issues:
> 
> - the hacks we have in place to switch encoding to utf-16 before parsing each
> chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is
> no link to libxml2 bugtracker along with those hacks.

What do you mean by “breaking the parsing"?   I’m not sure why you would expect there to be a link to the libxml2 bugtracker here.

> - parsing a chunk will fail if no xml preamble is present. In doWrite I can
> check the value of  m_sawFirstElement and insert the preamble if needed, would
> that be a better solution?

A document is not required to start with an XML declaration so this doesn’t make much sense to me.
Comment 10 Philippe Normand 2009-10-21 01:44:03 PDT
Created attachment 41555 [details]
test case: switching encoding to utf-16 breaks parsing

Sample output:

encoding switched to UTF-16
SAX.startDocument()
SAX.endDocument()
error 4: Document is empty

----------
using default encoding
SAX.startDocument()
successful parse
----------
Comment 11 Mark Rowe (bdash) 2009-10-21 01:47:27 PDT
(In reply to comment #10)
> Created an attachment (id=41555) [details]
> test case: switching encoding to utf-16 breaks parsing
> 
> Sample output:
> 
> encoding switched to UTF-16
> SAX.startDocument()
> SAX.endDocument()
> error 4: Document is empty
> 
> ----------
> using default encoding
> SAX.startDocument()
> successful parse
> ----------

This example has little relation to what WebKit does.  In your example you’re passing ASCII data to libxml2 to parse.  It’s not surprising that it would be unhappy to attempt to parse that as UTF-16.  WebKit passes in UTF-16 data from a WebCore::String.
Comment 12 Mark Rowe (bdash) 2009-10-21 02:37:47 PDT
        // Hack around libxml2's lack of encoding overide support by manually
        // resetting the encoding to UTF-16 before every chunk.  Otherwise libxml
        // will detect <?xml version="1.0" encoding="<encoding name>"?> blocks 
        // and switch encodings, causing the parse to fail.
        const UChar BOM = 0xFEFF;
        const unsigned char BOMHighByte = *reinterpret_cast<const unsigned char*>(&BOM);
        xmlSwitchEncoding(m_context, BOMHighByte == 0xFF ? XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);

It seems like what is happening is that our hack to work around libxml2's lack of encoding override support is just not working with the newer version of libxml2.
Comment 13 Mark Rowe (bdash) 2009-10-21 02:53:28 PDT
The libxml2 change that broke WebCore parsing was <http://git.gnome.org/cgit/libxml2/commit/?id=a6c76a>.  This occurred between libxml2 2.7.3 and 2.7.4.
Comment 14 Mark Rowe (bdash) 2009-10-21 03:31:01 PDT
After staring at the libxml2 for a while I’ve worked out what is going on.  The change in question causes libxml2 to decode the first chunk of data we pass it in two goes.  It first decodes enough data to try and parse the XML declaration.  Once it has parsed this it then decodes the rest of the chunk and continues parsing.  Since it is decoded in two parts the encoding switch that happens as a result of parsing the XML declaration’s encoding name leads libxml2 to decode the remaining data in the chunk with the wrong encoding.
Comment 15 Philippe Normand 2009-10-21 09:31:25 PDT
Created attachment 41570 [details]
updated patch

Not setting review flag, this patch is still work-in-progress, 1 xslt layout test remains to be fixed.
Comment 16 Dirk Schulze 2009-10-27 01:09:59 PDT
Is the new libxml2 version resposible for the braekage of some SVG images like http://apike.ca/media/svg/exampleMask.svg ? Or is it because of the latest changes to the parsing-code?
Note that this SVG example has a Doctype that shouldn't be used on SVG but is still allowed in the SVG 1.0/1.1 specification.
Comment 17 Dirk Schulze 2009-10-27 01:10:53 PDT
(In reply to comment #16)
> Is the new libxml2 version resposible for the braekage of some SVG images like
> http://apike.ca/media/svg/exampleMask.svg ?
Got parsing error:

This page contains the following errors:

error on line 2 at column 29: Unfinished System or Public ID " or ' expected
Below is a rendering of the page up to the first error.
Comment 18 Philippe Normand 2009-10-30 03:02:29 PDT
(In reply to comment #17)
> (In reply to comment #16)
> > Is the new libxml2 version resposible for the braekage of some SVG images like
> > http://apike.ca/media/svg/exampleMask.svg ?
> Got parsing error:
> 
> This page contains the following errors:
> 
> error on line 2 at column 29: Unfinished System or Public ID " or ' expected
> Below is a rendering of the page up to the first error.

That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What version of libxml2 do you use exactly?
Comment 19 Philippe Normand 2009-10-30 03:16:57 PDT
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > Is the new libxml2 version resposible for the braekage of some SVG images like
> > > http://apike.ca/media/svg/exampleMask.svg ?
> > Got parsing error:
> > 
> > This page contains the following errors:
> > 
> > error on line 2 at column 29: Unfinished System or Public ID " or ' expected
> > Below is a rendering of the page up to the first error.
> 
> That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What
> version of libxml2 do you use exactly?

Sorry ignore that previous comment of mine, I wasn't using the latest libxml2, now I have the same error as you.
Comment 20 Dirk Schulze 2009-11-18 01:27:29 PST
This problem also appears on lixml2 2.7.5.
Comment 21 Dirk Schulze 2009-11-20 23:04:02 PST
The last patch solves the problem with SVG's. I hope we get a solution for this problem. 1/3 of the SVG's in the wild are broken :-(
Comment 22 Eric Seidel (no email) 2010-01-28 13:14:40 PST
I had an email thread about this with DV back when I added that hack.  He did not seem interested at the time in supporting WebCore's usecase of determining the encoding of the string ourselves.  He suggested that libxml's supported usecase was as a stand-alone library, not as a support library for parsing as we're using it.  I can dig up the emails if necessary (assuming they weren't lost in the depths of my @apple account).  So in the end, we came up with this hack to work around the issue.  I'm not surprised it broke, but I'm sad that it did. :(
Comment 23 Jean Louis 2010-03-31 09:12:29 PDT
Hi WebKit Team,

I have also this bug, see my history with midori on 

http://www.twotoasts.de/bugs/index.php?do=details&task_id=793

and my (closed) bug report on libxml2 on

https://bugzilla.gnome.org/show_bug.cgi?id=614333

The bug occurs is between these two versions : 2.7.3-r1.5 and 2.7.7-r6.0.5.

Waiting for news, bye, Jean Louis.
Comment 24 Jean Louis 2010-03-31 10:47:25 PDT
Hi,

Can you read the new interesting comment of Daniel for your bug fix ?

https://bugzilla.gnome.org/show_bug.cgi?id=614333#c6

Thanks.
Comment 25 Jean Louis 2010-04-04 03:33:07 PDT
Hi,

Thanks Alexey for your post on libxml2 but do you think you can implement recommendation of the XML spec appendix F like Daniel say on https://bugzilla.gnome.org/show_bug.cgi?id=614333#c4 
without break the xml parsing of webkit ?

Thanks.
Comment 26 Jean Louis 2010-04-29 03:07:26 PDT
Hi,

Can I have hope that this bug is fixed one day ?

Thanks for any reply.
Comment 27 Gabriel Gray 2010-07-16 13:13:50 PDT
Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk.

anyone knows about any patch or workaround for this issue ??????
Comment 28 Jean Louis 2010-07-17 00:16:47 PDT
(In reply to comment #27)
> Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk.
> 
> anyone knows about any patch or workaround for this issue ??????

Hi,

After check comments, it seems there is a non compliance with libxml and the team prefer keep an old (bad) workaround instead of correct with new version of libxml2.

Personally, I no longer believe.
Comment 29 Paweł Hajdan, Jr. 2010-08-26 15:48:37 PDT
Created attachment 65635 [details]
patch
Comment 30 WebKit Review Bot 2010-08-26 15:50:35 PDT
Attachment 65635 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1
WebCore/dom/XMLDocumentParserLibxml2.cpp:366:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:367:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:368:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:369:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:370:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:371:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:372:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
Total errors found: 7 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 31 Paweł Hajdan, Jr. 2010-08-26 15:53:29 PDT
Created attachment 65637 [details]
patch
Comment 32 Eric Seidel (no email) 2010-08-26 15:54:31 PDT
Comment on attachment 65635 [details]
patch

Why is the 3rd switch needed?  CAn we write a test case for this?
Comment 33 Mark Rowe (bdash) 2010-08-26 15:58:30 PDT
(In reply to comment #32)
> (From update of attachment 65635 [details])
> Why is the 3rd switch needed?  CAn we write a test case for this?

I would assume it’s needed to fix the bug.  Per my analysis in comment 14, libxml2 is now switching the encoding in the middle of a single chunk after it parses the XML declaration.  That point corresponds to the startDocumentHandler callback.  This patch switches the encoding back to UTF-16 at that point.  The ChangeLog should really explain all of this though.

Has this patch been tested with older versions of libxml2?
Comment 34 Paweł Hajdan, Jr. 2010-08-26 16:19:14 PDT
(In reply to comment #32)
> (From update of attachment 65635 [details])
> Why is the 3rd switch needed?

Without it http://code.google.com/p/chromium/issues/detail?id=29333 is not fixed.

> CAn we write a test case for this?

Please take a look at https://bugs.webkit.org/show_bug.cgi?id=33027

(In reply to comment #33)
> Has this patch been tested with older versions of libxml2?

I will test it with libxml2-2.7.3 and report here.
Comment 35 Mark Rowe (bdash) 2010-08-26 16:23:56 PDT
(In reply to comment #34)
> (In reply to comment #33)
> > Has this patch been tested with older versions of libxml2?
> 
> I will test it with libxml2-2.7.3 and report here.

We need to make sure that this works as far back as libxml2-2.6.16, the version that Tiger and Leopard use.
Comment 36 Paweł Hajdan, Jr. 2010-08-27 07:40:10 PDT
(In reply to comment #35)
> We need to make sure that this works as far back as libxml2-2.6.16,
> the version that Tiger and Leopard use.

I have tested with the following versions of libxml2: 2.7.7, 2.7.3, and 2.6.16. The patch seems to work fine for all of those versions.
Comment 37 WebKit Commit Bot 2010-08-29 13:46:33 PDT
Comment on attachment 65637 [details]
patch

Clearing flags on attachment: 65637

Committed r66336: <http://trac.webkit.org/changeset/66336>
Comment 38 WebKit Commit Bot 2010-08-29 13:46:40 PDT
All reviewed patches have been landed.  Closing bug.
Comment 39 Vincent Lefevre 2010-08-30 08:09:19 PDT
*** Bug 34063 has been marked as a duplicate of this bug. ***
Comment 40 Martin Robinson 2010-09-16 10:34:23 PDT
This seems to have fixed a large number of the tests that were failing because of libxml issues, but not all them. For instance, see: http/tests/xmlhttprequest/broken-xml-encoding.html or just search for "libxml" in the GTK+ skipped list.