Summary: | [libxml2] parsing broken with libxml2 >= 2.7.5 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Philippe Normand <pnormand> | ||||||||||||
Component: | XML | Assignee: | Nobody <webkit-unassigned> | ||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||
Severity: | Normal | CC: | ap, a.renevier, commit-queue, ddkilzer, eric, ggray, jmalonzo, krit, leo.yang, mrobinson, mrowe, phajdan.jr, une.belette, vincent-webkit, webkit.review.bot, xan.lopez, yong.li.webkit | ||||||||||||
Priority: | P2 | Keywords: | NeedsReduction | ||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||
Hardware: | PC | ||||||||||||||
OS: | OS X 10.5 | ||||||||||||||
Bug Depends on: | |||||||||||||||
Bug Blocks: | 33027 | ||||||||||||||
Attachments: |
|
Description
Philippe Normand
2009-10-19 03:53:05 PDT
Created attachment 41416 [details]
fix xml parsing with recent versions of libxml2
This patch fixes most of the fast/xsl (excepted one related to locale sorting), fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please try it :)
(In reply to comment #1) > Created an attachment (id=41416) [details] > fix xml parsing with recent versions of libxml2 > > This patch fixes most of the fast/xsl (excepted one related to locale sorting), > fast/js and fast/xsl tests. Also tested with libxml2 2.7.3 with success. Please > try it :) Forgot to mention this is with the Gtk+ port Comment on attachment 41416 [details] fix xml parsing with recent versions of libxml2 > + String toParse = "<?xml version=\"1.0\" ?>" + parseString; doWrite is called several times, once per chunk of data arriving from network. We certainly can't insert the XML declaration into the middle of XML source. If no test broke because of this change, we should add one that will (possibly an http one, to make sending data in chunks easier). Also, there are tabs in the patch, which should be spaces. We also can’t make changes that would break parsing with older versions of libxml2. The oldest version that I’m aware of a need to support is 2.6.16, which ships on Tiger. Given that our code works correctly with recent versions of libxml2 (2.7.3)… are we sure this isn’t a regression in libxml2 that should be fixed on their end? (In reply to comment #4) > Given that our code works correctly with recent versions of libxml2 (2.7.3)… > are we sure this isn’t a regression in libxml2 that should be fixed on their > end? Were this to be a regression in libxml2 we'd still need to apply some kind of workaround in WebKit, since we can't possibly expect everyone using versions >= than 2.7.3 to instantaneously upgrade when a fixed version is out. I agree it somehow has to keep the code working with older releases, of course. (In reply to comment #5) > (In reply to comment #4) > > Given that our code works correctly with recent versions of libxml2 (2.7.3)… > > are we sure this isn’t a regression in libxml2 that should be fixed on their > > end? > > Were this to be a regression in libxml2 we'd still need to apply some kind of > workaround in WebKit, since we can't possibly expect everyone using versions >= > than 2.7.3 to instantaneously upgrade when a fixed version is out. Of course, but we also want any upstream bugs to be fixed ASAP to prevent us from having to keep such a workaround forever. It’d help to have a description of what the problem is and how the attached patch fixes it. (In reply to comment #7) > It’d help to have a description of what the problem is and how the attached > patch fixes it. As I mentionned in the bug description there are 2 issues: - the hacks we have in place to switch encoding to utf-16 before parsing each chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is no link to libxml2 bugtracker along with those hacks. - parsing a chunk will fail if no xml preamble is present. In doWrite I can check the value of m_sawFirstElement and insert the preamble if needed, would that be a better solution? (In reply to comment #8) > (In reply to comment #7) > > It’d help to have a description of what the problem is and how the attached > > patch fixes it. > > As I mentionned in the bug description there are 2 issues: > > - the hacks we have in place to switch encoding to utf-16 before parsing each > chunk are breaking the parsing with recent libxml2 (>= 2.7.3, afaics). There is > no link to libxml2 bugtracker along with those hacks. What do you mean by “breaking the parsing"? I’m not sure why you would expect there to be a link to the libxml2 bugtracker here. > - parsing a chunk will fail if no xml preamble is present. In doWrite I can > check the value of m_sawFirstElement and insert the preamble if needed, would > that be a better solution? A document is not required to start with an XML declaration so this doesn’t make much sense to me. Created attachment 41555 [details]
test case: switching encoding to utf-16 breaks parsing
Sample output:
encoding switched to UTF-16
SAX.startDocument()
SAX.endDocument()
error 4: Document is empty
----------
using default encoding
SAX.startDocument()
successful parse
----------
(In reply to comment #10) > Created an attachment (id=41555) [details] > test case: switching encoding to utf-16 breaks parsing > > Sample output: > > encoding switched to UTF-16 > SAX.startDocument() > SAX.endDocument() > error 4: Document is empty > > ---------- > using default encoding > SAX.startDocument() > successful parse > ---------- This example has little relation to what WebKit does. In your example you’re passing ASCII data to libxml2 to parse. It’s not surprising that it would be unhappy to attempt to parse that as UTF-16. WebKit passes in UTF-16 data from a WebCore::String. // Hack around libxml2's lack of encoding overide support by manually // resetting the encoding to UTF-16 before every chunk. Otherwise libxml // will detect <?xml version="1.0" encoding="<encoding name>"?> blocks // and switch encodings, causing the parse to fail. const UChar BOM = 0xFEFF; const unsigned char BOMHighByte = *reinterpret_cast<const unsigned char*>(&BOM); xmlSwitchEncoding(m_context, BOMHighByte == 0xFF ? XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE); It seems like what is happening is that our hack to work around libxml2's lack of encoding override support is just not working with the newer version of libxml2. The libxml2 change that broke WebCore parsing was <http://git.gnome.org/cgit/libxml2/commit/?id=a6c76a>. This occurred between libxml2 2.7.3 and 2.7.4. After staring at the libxml2 for a while I’ve worked out what is going on. The change in question causes libxml2 to decode the first chunk of data we pass it in two goes. It first decodes enough data to try and parse the XML declaration. Once it has parsed this it then decodes the rest of the chunk and continues parsing. Since it is decoded in two parts the encoding switch that happens as a result of parsing the XML declaration’s encoding name leads libxml2 to decode the remaining data in the chunk with the wrong encoding. Created attachment 41570 [details]
updated patch
Not setting review flag, this patch is still work-in-progress, 1 xslt layout test remains to be fixed.
Is the new libxml2 version resposible for the braekage of some SVG images like http://apike.ca/media/svg/exampleMask.svg ? Or is it because of the latest changes to the parsing-code? Note that this SVG example has a Doctype that shouldn't be used on SVG but is still allowed in the SVG 1.0/1.1 specification. (In reply to comment #16) > Is the new libxml2 version resposible for the braekage of some SVG images like > http://apike.ca/media/svg/exampleMask.svg ? Got parsing error: This page contains the following errors: error on line 2 at column 29: Unfinished System or Public ID " or ' expected Below is a rendering of the page up to the first error. (In reply to comment #17) > (In reply to comment #16) > > Is the new libxml2 version resposible for the braekage of some SVG images like > > http://apike.ca/media/svg/exampleMask.svg ? > Got parsing error: > > This page contains the following errors: > > error on line 2 at column 29: Unfinished System or Public ID " or ' expected > Below is a rendering of the page up to the first error. That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What version of libxml2 do you use exactly? (In reply to comment #18) > (In reply to comment #17) > > (In reply to comment #16) > > > Is the new libxml2 version resposible for the braekage of some SVG images like > > > http://apike.ca/media/svg/exampleMask.svg ? > > Got parsing error: > > > > This page contains the following errors: > > > > error on line 2 at column 29: Unfinished System or Public ID " or ' expected > > Below is a rendering of the page up to the first error. > > That SVG loads fine here with HEAD of WebKitGTK and latest libxml2. What > version of libxml2 do you use exactly? Sorry ignore that previous comment of mine, I wasn't using the latest libxml2, now I have the same error as you. This problem also appears on lixml2 2.7.5. The last patch solves the problem with SVG's. I hope we get a solution for this problem. 1/3 of the SVG's in the wild are broken :-( I had an email thread about this with DV back when I added that hack. He did not seem interested at the time in supporting WebCore's usecase of determining the encoding of the string ourselves. He suggested that libxml's supported usecase was as a stand-alone library, not as a support library for parsing as we're using it. I can dig up the emails if necessary (assuming they weren't lost in the depths of my @apple account). So in the end, we came up with this hack to work around the issue. I'm not surprised it broke, but I'm sad that it did. :( Hi WebKit Team, I have also this bug, see my history with midori on http://www.twotoasts.de/bugs/index.php?do=details&task_id=793 and my (closed) bug report on libxml2 on https://bugzilla.gnome.org/show_bug.cgi?id=614333 The bug occurs is between these two versions : 2.7.3-r1.5 and 2.7.7-r6.0.5. Waiting for news, bye, Jean Louis. Hi, Can you read the new interesting comment of Daniel for your bug fix ? https://bugzilla.gnome.org/show_bug.cgi?id=614333#c6 Thanks. Hi, Thanks Alexey for your post on libxml2 but do you think you can implement recommendation of the XML spec appendix F like Daniel say on https://bugzilla.gnome.org/show_bug.cgi?id=614333#c4 without break the xml parsing of webkit ? Thanks. Hi, Can I have hope that this bug is fixed one day ? Thanks for any reply. Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk. anyone knows about any patch or workaround for this issue ?????? (In reply to comment #27) > Hi,once more i'm confirming this issue with libxml2-python-2.7.4.win32-py2.6.exe and pywebkitgtk. > > anyone knows about any patch or workaround for this issue ?????? Hi, After check comments, it seems there is a non compliance with libxml and the team prefer keep an old (bad) workaround instead of correct with new version of libxml2. Personally, I no longer believe. Created attachment 65635 [details]
patch
Attachment 65635 [details] did not pass style-queue:
Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1
WebCore/dom/XMLDocumentParserLibxml2.cpp:366: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:367: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:368: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:369: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:370: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:371: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
WebCore/dom/XMLDocumentParserLibxml2.cpp:372: Weird number of spaces at line-start. Are you using a 4-space indent? [whitespace/indent] [3]
Total errors found: 7 in 2 files
If any of these errors are false positives, please file a bug against check-webkit-style.
Created attachment 65637 [details]
patch
Comment on attachment 65635 [details]
patch
Why is the 3rd switch needed? CAn we write a test case for this?
(In reply to comment #32) > (From update of attachment 65635 [details]) > Why is the 3rd switch needed? CAn we write a test case for this? I would assume it’s needed to fix the bug. Per my analysis in comment 14, libxml2 is now switching the encoding in the middle of a single chunk after it parses the XML declaration. That point corresponds to the startDocumentHandler callback. This patch switches the encoding back to UTF-16 at that point. The ChangeLog should really explain all of this though. Has this patch been tested with older versions of libxml2? (In reply to comment #32) > (From update of attachment 65635 [details]) > Why is the 3rd switch needed? Without it http://code.google.com/p/chromium/issues/detail?id=29333 is not fixed. > CAn we write a test case for this? Please take a look at https://bugs.webkit.org/show_bug.cgi?id=33027 (In reply to comment #33) > Has this patch been tested with older versions of libxml2? I will test it with libxml2-2.7.3 and report here. (In reply to comment #34) > (In reply to comment #33) > > Has this patch been tested with older versions of libxml2? > > I will test it with libxml2-2.7.3 and report here. We need to make sure that this works as far back as libxml2-2.6.16, the version that Tiger and Leopard use. (In reply to comment #35) > We need to make sure that this works as far back as libxml2-2.6.16, > the version that Tiger and Leopard use. I have tested with the following versions of libxml2: 2.7.7, 2.7.3, and 2.6.16. The patch seems to work fine for all of those versions. Comment on attachment 65637 [details] patch Clearing flags on attachment: 65637 Committed r66336: <http://trac.webkit.org/changeset/66336> All reviewed patches have been landed. Closing bug. *** Bug 34063 has been marked as a duplicate of this bug. *** This seems to have fixed a large number of the tests that were failing because of libxml issues, but not all them. For instance, see: http/tests/xmlhttprequest/broken-xml-encoding.html or just search for "libxml" in the GTK+ skipped list. |