RESOLVED FIXED 30508
[libxml2] parsing broken with libxml2 >= 2.7.5
https://bugs.webkit.org/show_bug.cgi?id=30508
Summary [libxml2] parsing broken with libxml2 >= 2.7.5
Philippe Normand
Reported 2009-10-19 03:53:05 PDT
Many layout tests are failing with recent versions of libxml2. Example: fast/xsl/xslt-doc-enc.xml. After some investigation for that specific test it turns out that some hacks are present in our libxml2 wrapper, like resetting the encoding to utf-16 before parsing each chunk of data. Why is that? If i remove that hack (copypasted in 2 places of the code), that test passes fine but breaks other tests like fast/xpath/document-order.html. That test now fails because we try to parse an xml document that has no preamble (<? xml version="1.0" ?>). Note: the tests pass fine with libxml2 == 2.7.3
Attachments
fix xml parsing with recent versions of libxml2 (3.17 KB, patch)
2009-10-19 07:32 PDT, Philippe Normand
ap: review-
test case: switching encoding to utf-16 breaks parsing (2.46 KB, text/plain)
2009-10-21 01:44 PDT, Philippe Normand
no flags
updated patch (3.77 KB, patch)
2009-10-21 09:31 PDT, Philippe Normand
no flags
patch (3.65 KB, patch)
2010-08-26 15:48 PDT, Paweł Hajdan, Jr.
no flags
patch (3.66 KB, patch)
2010-08-26 15:53 PDT, Paweł Hajdan, Jr.
no flags
Philippe Normand
Comment 1 2009-10-19 07:32:13 PDT
Created attachment 41416 [details] fix xml parsing with recent versions of libxml2 This patch fixes most of the fast/xsl (excepted one related to locale sorting), fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please try it :)
Philippe Normand
Comment 2 2009-10-19 07:33:14 PDT
(In reply to comment #1) > Created an attachment (id=41416) [details] > fix xml parsing with recent versions of libxml2 > > This patch fixes most of the fast/xsl (excepted one related to locale sorting), > fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please > try it :) Forgot to mention this is with the Gtk+ port
Alexey Proskuryakov
Comment 3 2009-10-19 10:20:50 PDT
Comment on attachment 41416 [details] fix xml parsing with recent versions of libxml2 > + String toParse = "<?xml version=\"1.0\" ?>" + parseString; doWrite is called several times, once per chunk of data arriving from network. We certainly can't insert the XML declaration into the middle of XML source. If no test broke because of this change, we should add one that will (possibly an http one, to make sending data in chunks easier). Also, there are tabs in the patch, which should be spaces.
Mark Rowe (bdash)
Comment 4 2009-10-19 11:35:15 PDT
We also can’t make changes that would break parsing with older versions of libxml2. The oldest version that I’m aware of a need to support is 2.6.16, which ships on Tiger. Given that our code works correctly with recent versions of libxml2 (2.7.3)… are we sure this isn’t a regression in libxml2 that should be fixed on their end?
Xan Lopez
Comment 5 2009-10-20 01:50:43 PDT
(In reply to comment #4) > Given that our code works correctly with recent versions of libxml2 (2.7.3)… > are we sure this isn’t a regression in libxml2 that should be fixed on their > end? Were this to be a regression in libxml2 we'd still need to apply some kind of workaround in WebKit, since we can't possibly expect everyone using versions >= than 2.7.3 to instantaneously upgrade when a fixed version is out. I agree it somehow has to keep the code working with older releases, of course.
Mark Rowe (bdash)
Comment 6 2009-10-20 01:53:21 PDT
(In reply to comment #5) > (In reply to comment #4) > > Given that our code works correctly with recent versions of libxml2 (2.7.3)… > > are we sure this isn’t a regression in libxml2 that should be fixed on their > > end? > > Were this to be a regression in libxml2 we'd still need to apply some kind of > workaround in WebKit, since we can't possibly expect everyone using versions >= > than 2.7.3 to instantaneously upgrade when a fixed version is out. Of course, but we also want any upstream bugs to be fixed ASAP to prevent us from having to keep such a workaround forever.
Mark Rowe (bdash)
Comment 7 2009-10-20 01:59:33 PDT
It’d help to have a description of what the problem is and how the attached patch fixes it.
Philippe Normand
Comment 8 2009-10-20 02:21:37 PDT
(In reply to comment #7) > It’d help to have a description of what the problem is and how the attached > patch fixes it. As I mentionned in the bug description there are 2 issues: - the hacks we have in place to switch encoding to utf-16 before parsing each chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is no link to libxml2 bugtracker along with those hacks. - parsing a chunk will fail if no xml preamble is present. In doWrite I can check the value of m_sawFirstElement and insert the preamble if needed, would that be a better solution?
Mark Rowe (bdash)
Comment 9 2009-10-20 11:15:57 PDT
(In reply to comment #8) > (In reply to comment #7) > > It’d help to have a description of what the problem is and how the attached > > patch fixes it. > > As I mentionned in the bug description there are 2 issues: > > - the hacks we have in place to switch encoding to utf-16 before parsing each > chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is > no link to libxml2 bugtracker along with those hacks. What do you mean by “breaking the parsing"? I’m not sure why you would expect there to be a link to the libxml2 bugtracker here. > - parsing a chunk will fail if no xml preamble is present. In doWrite I can > check the value of m_sawFirstElement and insert the preamble if needed, would > that be a better solution? A document is not required to start with an XML declaration so this doesn’t make much sense to me.
Philippe Normand
Comment 10 2009-10-21 01:44:03 PDT
Created attachment 41555 [details] test case: switching encoding to utf-16 breaks parsing Sample output: encoding switched to UTF-16 SAX.startDocument() SAX.endDocument() error 4: Document is empty ---------- using default encoding SAX.startDocument() successful parse ----------
Mark Rowe (bdash)
Comment 11 2009-10-21 01:47:27 PDT
(In reply to comment #10) > Created an attachment (id=41555) [details] > test case: switching encoding to utf-16 breaks parsing > > Sample output: > > encoding switched to UTF-16 > SAX.startDocument() > SAX.endDocument() > error 4: Document is empty > > ---------- > using default encoding > SAX.startDocument() > successful parse > ---------- This example has little relation to what WebKit does. In your example you’re passing ASCII data to libxml2 to parse. It’s not surprising that it would be unhappy to attempt to parse that as UTF-16. WebKit passes in UTF-16 data from a WebCore::String.
Mark Rowe (bdash)
Comment 12 2009-10-21 02:37:47 PDT
// Hack around libxml2's lack of encoding overide support by manually // resetting the encoding to UTF-16 before every chunk. Otherwise libxml // will detect <?xml version="1.0" encoding="<encoding name>"?> blocks // and switch encodings, causing the parse to fail. const UChar BOM = 0xFEFF; const unsigned char BOMHighByte = *reinterpret_cast<const unsigned char*>(&BOM); xmlSwitchEncoding(m_context, BOMHighByte == 0xFF ? XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE); It seems like what is happening is that our hack to work around libxml2's lack of encoding override support is just not working with the newer version of libxml2.
Mark Rowe (bdash)
Comment 13 2009-10-21 02:53:28 PDT
The libxml2 change that broke WebCore parsing was <http://git.gnome.org/cgit/libxml2/commit/?id=a6c76a>. This occurred between libxml2 2.7.3 and 2.7.4.
Mark Rowe (bdash)
Comment 14 2009-10-21 03:31:01 PDT
After staring at the libxml2 for a while I’ve worked out what is going on. The change in question causes libxml2 to decode the first chunk of data we pass it in two goes. It first decodes enough data to try and parse the XML declaration. Once it has parsed this it then decodes the rest of the chunk and continues parsing. Since it is decoded in two parts the encoding switch that happens as a result of parsing the XML declaration’s encoding name leads libxml2 to decode the remaining data in the chunk with the wrong encoding.
Philippe Normand
Comment 15 2009-10-21 09:31:25 PDT
Created attachment 41570 [details] updated patch Not setting review flag, this patch is still work-in-progress, 1 xslt layout test remains to be fixed.
Dirk Schulze
Comment 16 2009-10-27 01:09:59 PDT
Is the new libxml2 version resposible for the braekage of some SVG images like http://apike.ca/media/svg/exampleMask.svg ? Or is it because of the latest changes to the parsing-code? Note that this SVG example has a Doctype that shouldn't be used on SVG but is still allowed in the SVG 1.0/1.1 specification.
Dirk Schulze
Comment 17 2009-10-27 01:10:53 PDT
(In reply to comment #16) > Is the new libxml2 version resposible for the braekage of some SVG images like > http://apike.ca/media/svg/exampleMask.svg ? Got parsing error: This page contains the following errors: error on line 2 at column 29: Unfinished System or Public ID " or ' expected Below is a rendering of the page up to the first error.
Philippe Normand
Comment 18 2009-10-30 03:02:29 PDT
(In reply to comment #17) > (In reply to comment #16) > > Is the new libxml2 version resposible for the braekage of some SVG images like > > http://apike.ca/media/svg/exampleMask.svg ? > Got parsing error: > > This page contains the following errors: > > error on line 2 at column 29: Unfinished System or Public ID " or ' expected > Below is a rendering of the page up to the first error. That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What version of libxml2 do you use exactly?
Philippe Normand
Comment 19 2009-10-30 03:16:57 PDT
(In reply to comment #18) > (In reply to comment #17) > > (In reply to comment #16) > > > Is the new libxml2 version resposible for the braekage of some SVG images like > > > http://apike.ca/media/svg/exampleMask.svg ? > > Got parsing error: > > > > This page contains the following errors: > > > > error on line 2 at column 29: Unfinished System or Public ID " or ' expected > > Below is a rendering of the page up to the first error. > > That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What > version of libxml2 do you use exactly? Sorry ignore that previous comment of mine, I wasn't using the latest libxml2, now I have the same error as you.
Dirk Schulze
Comment 20 2009-11-18 01:27:29 PST
This problem also appears on lixml2 2.7.5.
Dirk Schulze
Comment 21 2009-11-20 23:04:02 PST
The last patch solves the problem with SVG's. I hope we get a solution for this problem. 1/3 of the SVG's in the wild are broken :-(
Eric Seidel (no email)
Comment 22 2010-01-28 13:14:40 PST
I had an email thread about this with DV back when I added that hack. He did not seem interested at the time in supporting WebCore's usecase of determining the encoding of the string ourselves. He suggested that libxml's supported usecase was as a stand-alone library, not as a support library for parsing as we're using it. I can dig up the emails if necessary (assuming they weren't lost in the depths of my @apple account). So in the end, we came up with this hack to work around the issue. I'm not surprised it broke, but I'm sad that it did. :(
Jean Louis
Comment 23 2010-03-31 09:12:29 PDT
Hi WebKit Team, I have also this bug, see my history with midori on http://www.twotoasts.de/bugs/index.php?do=details&task_id=793 and my (closed) bug report on libxml2 on https://bugzilla.gnome.org/show_bug.cgi?id=614333 The bug occurs is between these two versions : 2.7.3-r1.5 and 2.7.7-r6.0.5. Waiting for news, bye, Jean Louis.
Jean Louis
Comment 24 2010-03-31 10:47:25 PDT
Hi, Can you read the new interesting comment of Daniel for your bug fix ? https://bugzilla.gnome.org/show_bug.cgi?id=614333#c6 Thanks.
Jean Louis
Comment 25 2010-04-04 03:33:07 PDT
Hi, Thanks Alexey for your post on libxml2 but do you think you can implement recommendation of the XML spec appendix F like Daniel say on https://bugzilla.gnome.org/show_bug.cgi?id=614333#c4 without break the xml parsing of webkit ? Thanks.
Jean Louis
Comment 26 2010-04-29 03:07:26 PDT
Hi, Can I have hope that this bug is fixed one day ? Thanks for any reply.
Gabriel Gray
Comment 27 2010-07-16 13:13:50 PDT
Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk. anyone knows about any patch or workaround for this issue ??????
Jean Louis
Comment 28 2010-07-17 00:16:47 PDT
(In reply to comment #27) > Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk. > > anyone knows about any patch or workaround for this issue ?????? Hi, After check comments, it seems there is a non compliance with libxml and the team prefer keep an old (bad) workaround instead of correct with new version of libxml2. Personally, I no longer believe.
Paweł Hajdan, Jr.
Comment 29 2010-08-26 15:48:37 PDT
WebKit Review Bot
Comment 30 2010-08-26 15:50:35 PDT
Attachment 65635 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1 WebCore/dom/XMLDocumentParserLibxml2.cpp:366: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:367: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:368: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:369: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:370: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:371: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] WebCore/dom/XMLDocumentParserLibxml2.cpp:372: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3] Total errors found: 7 in 2 files If any of these errors are false positives, please file a bug against check-webkit-style.
Paweł Hajdan, Jr.
Comment 31 2010-08-26 15:53:29 PDT
Eric Seidel (no email)
Comment 32 2010-08-26 15:54:31 PDT
Comment on attachment 65635 [details] patch Why is the 3rd switch needed? CAn we write a test case for this?
Mark Rowe (bdash)
Comment 33 2010-08-26 15:58:30 PDT
(In reply to comment #32) > (From update of attachment 65635 [details]) > Why is the 3rd switch needed? CAn we write a test case for this? I would assume it’s needed to fix the bug. Per my analysis in comment 14, libxml2 is now switching the encoding in the middle of a single chunk after it parses the XML declaration. That point corresponds to the startDocumentHandler callback. This patch switches the encoding back to UTF-16 at that point. The ChangeLog should really explain all of this though. Has this patch been tested with older versions of libxml2?
Paweł Hajdan, Jr.
Comment 34 2010-08-26 16:19:14 PDT
(In reply to comment #32) > (From update of attachment 65635 [details]) > Why is the 3rd switch needed? Without it http://code.google.com/p/chromium/issues/detail?id=29333 is not fixed. > CAn we write a test case for this? Please take a look at https://bugs.webkit.org/show_bug.cgi?id=33027 (In reply to comment #33) > Has this patch been tested with older versions of libxml2? I will test it with libxml2-2.7.3 and report here.
Mark Rowe (bdash)
Comment 35 2010-08-26 16:23:56 PDT
(In reply to comment #34) > (In reply to comment #33) > > Has this patch been tested with older versions of libxml2? > > I will test it with libxml2-2.7.3 and report here. We need to make sure that this works as far back as libxml2-2.6.16, the version that Tiger and Leopard use.
Paweł Hajdan, Jr.
Comment 36 2010-08-27 07:40:10 PDT
(In reply to comment #35) > We need to make sure that this works as far back as libxml2-2.6.16, > the version that Tiger and Leopard use. I have tested with the following versions of libxml2: 2.7.7, 2.7.3, and 2.6.16. The patch seems to work fine for all of those versions.
WebKit Commit Bot
Comment 37 2010-08-29 13:46:33 PDT
Comment on attachment 65637 [details] patch Clearing flags on attachment: 65637 Committed r66336: <http://trac.webkit.org/changeset/66336>
WebKit Commit Bot
Comment 38 2010-08-29 13:46:40 PDT
All reviewed patches have been landed. Closing bug.
Vincent Lefevre
Comment 39 2010-08-30 08:09:19 PDT
*** Bug 34063 has been marked as a duplicate of this bug. ***
Martin Robinson
Comment 40 2010-09-16 10:34:23 PDT
This seems to have fixed a large number of the tests that were failing because of libxml issues, but not all them. For instance, see: http/tests/xmlhttprequest/broken-xml-encoding.html or just search for "libxml" in the GTK+ skipped list.
Note You need to log in before you can comment on or make changes to this bug.