Bug 50708 - Parse error during XSLT transformation
Summary: Parse error during XSLT transformation
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Major
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2010-12-08 13:05 PST by Jasmin Lapalme
Modified: 2010-12-09 20:06 PST (History)
5 users (show)

See Also:


Attachments
Archive containing test case and patch (14.18 KB, application/zip)
2010-12-08 13:05 PST, Jasmin Lapalme
no flags Details
patch to fix UTF-8 conversion error (1.44 KB, patch)
2010-12-08 13:07 PST, Jasmin Lapalme
no flags Details | Formatted Diff | Diff
Patch to fix the issue (28.28 KB, patch)
2010-12-08 18:26 PST, Jasmin Lapalme
ap: review+
ap: commit-queue-
Details | Formatted Diff | Diff
revised patch to fix the issue (36.39 KB, patch)
2010-12-09 07:34 PST, Jasmin Lapalme
no flags Details | Formatted Diff | Diff
2nd revised patch to fix the issue (36.41 KB, patch)
2010-12-09 09:03 PST, Jasmin Lapalme
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jasmin Lapalme 2010-12-08 13:05:44 PST
Created attachment 75951 [details]
Archive containing test case and patch

There is a bug in WebCore where Safari is unable to render an HTML web page created by an XSLT transformation from an XML file and an XSLT stylesheet.

To reproduce, open test.xml (in the zip archive in attachment) in Safari (Mac OS X 10.6 or Safari Mobile on iPad or iPhone). There will be a message like this :
---
This page contains the following errors:

error on line 1 at column 1: Extra content at the end of the document
Below is a rendering of the page up to the first error.

This document was created as the result of an XSL transformation. The line and column numbers given are from the transformed result.
---

By examining the source code, I found that it is caused by faulty conversion from UTF-8 to UTF-16 in WebCore. The conversion convert chunks of UTF-8 data, but does not treat the case where the end of the chunk is in the middle of a multibyte UTF-8 character.

I am attaching a patch to fix this case (utf8_conversion.patch)
Comment 1 Jasmin Lapalme 2010-12-08 13:07:03 PST
Created attachment 75953 [details]
patch to fix UTF-8 conversion error
Comment 2 David Kilzer (:ddkilzer) 2010-12-08 13:21:17 PST
<rdar://problem/8728066>
Comment 3 Alexey Proskuryakov 2010-12-08 13:22:22 PST
Thank you for tackling this serious issue! Would you be willing to formally submit a patch for review, as described in <http://webkit.org/coding/contributing.html>?

We can't land a fix without a ChangeLog and a regression test.

For Apple: <rdar://problem/8728066>/<rdar://problem/8697596>.
Comment 4 David Kilzer (:ddkilzer) 2010-12-08 17:57:13 PST
(In reply to comment #3)
> We can't land a fix without a ChangeLog and a regression test.

The regression test should contain a reduced (if possible) and non-copyrighted test case.
Comment 5 Jasmin Lapalme 2010-12-08 18:26:48 PST
Created attachment 75999 [details]
Patch to fix the issue
Comment 6 Jasmin Lapalme 2010-12-08 18:30:22 PST
(In reply to comment #3)
> We can't land a fix without a ChangeLog and a regression test.
  I've attached the complete patch with a regression test and ChangeLog. I hope it's all OK.
Comment 7 Jasmin Lapalme 2010-12-08 18:33:14 PST
(In reply to comment #4)
> The regression test should contain a reduced (if possible) and non-copyrighted test case.
  I've reduced a lot the test and made it non-copyrighted. The xml file cannot be reduced more than that since it's with big files the bug will show up.
Comment 8 Darin Adler 2010-12-08 18:56:36 PST
Comment on attachment 75999 [details]
Patch to fix the issue

What’s wrong with String::fromUTF8?
Comment 9 Alexey Proskuryakov 2010-12-08 18:59:36 PST
Comment on attachment 75999 [details]
Patch to fix the issue

View in context: https://bugs.webkit.org/attachment.cgi?id=75999&action=review

This looks great, thanks! I've had a number of trivial nitpicks, at least a few of which need to be fixed before landing.

I'm marking this r+, so a committer can fix these before landing. Or ideally, you could upload an updated patch, and then we'll land through commit-queue.

> WebCore/ChangeLog:11
> +        * xml/XSLTProcessorLibxslt.cpp:
> +        (WebCore::writeToVector):

This auto-generated list is intended to be completed with a brief per-function description of changes. Many active contributors don't do that, but it's a useful practice.

> WebCore/xml/XSLTProcessorLibxslt.cpp:50
>  #include <wtf/text/CString.h>
>  #include <wtf/Vector.h>
> +#include <wtf/unicode/UTF8.h>

A pre-existing issue: we sort headers in ASCII order (same as UNIX sort command or Xcode Sort Selection script). So upper case comes first.

> WebCore/xml/XSLTProcessorLibxslt.cpp:166
> +    if (len == 0)

WebKit style is "if (!len)".

> WebCore/xml/XSLTProcessorLibxslt.cpp:171
> +    UChar* bufferUChar;
> +    String stringBuffer(StringImpl::createUninitialized(len, bufferUChar));
> +    UChar* bufferUCharEnd = bufferUChar + len;

I think that StringBuffer would look better here. String::fromUTF8() code that you copied this from probably predates StringBuffer.

> WebCore/xml/XSLTProcessorLibxslt.cpp:173
> +    // Try converting into the buffer.

Why try? Can converting really fail?

Simply converting into a buffer doesn't seem worth a comment to me.

> WebCore/xml/XSLTProcessorLibxslt.cpp:176
> +    WTF::Unicode::ConversionResult result =
> +        WTF::Unicode::convertUTF8ToUTF16(&stringCurrent, buffer + len, &bufferUChar, bufferUCharEnd);

No need to wrap this line, it's short enough.

> WebCore/xml/XSLTProcessorLibxslt.cpp:177
> +    if (result == WTF::Unicode::conversionOK || result == WTF::Unicode::sourceExhausted) {

Common WebKit style it to have early return on error case, not to nest "if"s for success cases.

> WebCore/xml/XSLTProcessorLibxslt.cpp:183
>      return len;

Is this correct error handling? Per <http://xmlsoft.org/html/libxml-xmlIO.html#xmlOutputWriteCallback>, -1 should be returned.

Also, looks like we don't really ever expect any errors here, so ASSERT_NOT_REACHED() would be good, with or without release build handling.

> LayoutTests/ChangeLog:10
> +        * fast/xsl/xslt-bad-encoding-expected.txt: Added.
> +        * fast/xsl/xslt-bad-encoding.xml: Added.
> +        * fast/xsl/xslt-bad-encoding.xsl: Added.

This is not a good name for the test, as the source files are all properly encoded. I'd suggest something like "utf8-chunks.xml".

> LayoutTests/fast/xsl/xslt-bad-encoding.xml:30
> +ééééééééééééééééééééééééééééééééééééééééééééééééé

Is this enough to reliably trigger the bug under different versions of libxml2 and other XML back-ends? Maybe make this section longer, just to be safe?

> LayoutTests/fast/xsl/xslt-bad-encoding.xsl:3
> +				xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

Our subversion pre-commit hooks will complain about tabs in files. Please convert to spaces.
Comment 10 Alexey Proskuryakov 2010-12-08 19:01:17 PST
> What’s wrong with String::fromUTF8?

We need to know how many input bytes were consumed, in case a UTF-8 sequence is not all in the provided chunk.
Comment 11 Jasmin Lapalme 2010-12-09 07:34:18 PST
Created attachment 76063 [details]
revised patch to fix the issue
Comment 12 Jasmin Lapalme 2010-12-09 07:38:50 PST
(In reply to comment #9)
> This looks great, thanks! I've had a number of trivial nitpicks, at least a few of which need to be fixed before landing.
 I've attached a revised patch following your comments.
Comment 13 Alexey Proskuryakov 2010-12-09 08:32:35 PST
Comment on attachment 76063 [details]
revised patch to fix the issue

View in context: https://bugs.webkit.org/attachment.cgi?id=76063&action=review

Looks good. Did you decide to not add ASSERT_NOT_REACHED because the error case can actually be seen?

> WebCore/ChangeLog:8
> +        Test: fast/xsl/xslt-bad-encoding.xml

Could you please also fix this?

> LayoutTests/ChangeLog:10
> +        * fast/xsl/xslt-bad-encoding-expected.txt: Added.
> +        * fast/xsl/xslt-bad-encoding.xml: Added.
> +        * fast/xsl/xslt-bad-encoding.xsl: Added.

And this?
Comment 14 Jasmin Lapalme 2010-12-09 09:03:22 PST
Created attachment 76073 [details]
2nd revised patch to fix the issue
Comment 15 Jasmin Lapalme 2010-12-09 09:05:55 PST
(In reply to comment #13)
> Did you decide to not add ASSERT_NOT_REACHED because the error case can actually be seen?
  You're right. I don't think it will happen. I add an ASSERT_NOT_REACHED.
Comment 16 Alexey Proskuryakov 2010-12-09 09:20:46 PST
Comment on attachment 76073 [details]
2nd revised patch to fix the issue

Thank you!
Comment 17 WebKit Commit Bot 2010-12-09 20:06:14 PST
Comment on attachment 76073 [details]
2nd revised patch to fix the issue

Clearing flags on attachment: 76073

Committed r73679: <http://trac.webkit.org/changeset/73679>
Comment 18 WebKit Commit Bot 2010-12-09 20:06:20 PST
All reviewed patches have been landed.  Closing bug.