Bug 17079

Summary:

XML document sent with wrong encoding does not fail as expected (affects Acid3 test 70)

Product:

WebKit

Reporter:

Eric Seidel (no email) <eric>

Component:

DOM

Assignee:

Eric Seidel (no email) <eric>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, cdumez, gavin.sharp, ian, webkit, webkit-bugs

Priority:

Version:

528+ (Nightly build)

Hardware:

Mac

OS:

OS X 10.5

URL:

http://acid3.acidtests.org/

Bug Depends on:

Bug Blocks:

17064, 268263

Attachments:

Description	Flags
Fix, makes us pass Test 70. Needs test cases!	none
Fix, makes us pass Test 70, now includes minimal testing	darin: review+

Eric Seidel (no email)

Reported 2008-01-29 14:59:51 PST

XML document sent with wrong encoding does not fail as expected (Acid3 bug) function () { // test 71: XML encoding test // the third child in kungFuDeathGrip is an ISO-8859-1 document sent as UTF-8. // q.v. XML 1.0, section 4.3.3 Character Encoding in Entities // this only tests one of a large number of conditions that should cause fatal errors var doc = kungFuDeathGrip.childNodes[2].contentDocument; if (!doc) return 5; if (!doc.documentElement.tagName == "root") return 5; if (!doc.documentElement.getElementsByTagName('test')) return 5; fail("UTF-8 encoded XML document with invalid character did not have a well-formedness error"); },

Attachments
Fix, makes us pass Test 70. Needs test cases! (21.29 KB, patch) 2008-03-26 01:04 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Fix, makes us pass Test 70, now includes minimal testing (25.12 KB, patch) 2008-03-26 09:48 PDT, Eric Seidel (no email)	darin: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2008-01-30 01:32:28 PST

Bleh. More XML strictness bugs. If FF or Opera are not strict about this, then this isn't a bug. :(

Alexey Proskuryakov

Comment 2 2008-01-30 01:45:01 PST

We match Firefox here, and changing this may be a compatibility problem. Also, I am not convinced that the XML spec has any say in how text decoding should work - it is a layering violation for it to impose any restrictions on UTF-8, for example. What if the XML spec imposed restrictions on Ethernet MTU, for example? This restriction was probably reasonable when XML 1.0 was first published, because the Unicode standard didn't deal with these issues well, but now it does. Could having this test in Acid3 be reconsidered?

Ian 'Hixie' Hickson

Comment 3 2008-01-30 16:20:22 PST

If you can get the XML WG to change this requirement, I will immediately change the test. However, the spec is hard to misinterpret right now, and this is a major interop danger with XML. I believe testing this in Acid3 is the right thing to do.

Alexey Proskuryakov

Comment 4 2008-01-31 02:37:50 PST

I suggest that we don't try to be the first to "fix" this - exactly because it's a huge interop danger.

Alexey Proskuryakov

Comment 5 2008-01-31 03:14:50 PST

To reiterate, current shipping or preview versions of Firefox, Opera and Safari do not fail to parse <http://hixie.ch/tests/evil/acid/003/empty.xml>. Changing this would not be an interoperability enhancement, nor would it make the Web a better place in some other way I can see.

Ian 'Hixie' Hickson

Comment 6 2008-01-31 11:28:50 PST

Ironically, IE7 does show an error when you go to empty.xml directly. So...

Alexey Proskuryakov

Comment 7 2008-01-31 13:02:18 PST

I think that the most important case for interoperability is what happens for XHTML. IE is not really a reference point in that case. Surely, we don't want to the first to start failing on pages that work in all other browsers (as IE would just get text/html MIME type from the server).

Maciej Stachowiak

Comment 8 2008-02-20 18:29:43 PST

Actually, I think the most important interoperability test case for this is XMLHttpRequest - that's likely the most common use of XML on the web. Hixie points out that IE passes <http://www.hixie.ch/tests/adhoc/dom/web-apps/XMLHttpRequest/020.html>. XHTML with an XML MIME type is pretty rare on the web and generally only used by people who care more about strictness than their site not breaking. So I think we should fix this.

Rosyna

Comment 9 2008-03-18 19:00:28 PDT

Sadly, this is not a layering violation any more C10 of the Unicode 5.0 specification Chapter 3 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. " It may have been a layering violation when the XML part of this was originally written. But it isn't as of Unicode 5.0, since it explicitly says "error condition". And these XML specs are free to specify exactly what happens in an "error condition". (Many Text based apps just replace the invalids with U+FFFD (http://www.fileformat.info/info/unicode/char/fffd/index.htm) as it is far worse for these applications to bail when encountering invalid codepoints)

Eric Seidel (no email)

Comment 10 2008-03-25 00:32:44 PDT

So the changes here need to happen to the various TextDecoder classes. I think for TextDecoderICU we can implement a "strict" mode using error callbacks which are covered by this documentation: http://www.icu-project.org/apiref/icu4c/ucnv__err_8h.html#1e9f87a69d75288c0f93bd77a6f2c9db currently we don't use any error callbacks. We would then create the TextDecoder in a "strict" mode (or set it in a strict mode once we've determined we're handling XML) and then check if it had an error in the write() function in the XMLTokenizer. If it had an error we would stop tokenizing and display the error (using the existing error reporting functions in XMLTokenizer).

Eric Seidel (no email)

Comment 11 2008-03-25 16:43:16 PDT

It looks like we will need to change TextResourceDecoder to so that when it detects an XML content type it should go into a "strict" mode and set a flag on error that the XMLTokenizer can then detect and react to. I expect this patch will be a lot of plumbing.

Eric Seidel (no email)

Comment 12 2008-03-25 18:51:10 PDT

I'm testing a patch now.

Eric Seidel (no email)

Comment 13 2008-03-26 01:04:24 PDT

Created attachment 20077 [details] Fix, makes us pass Test 70. Needs test cases! WebCore/WebCore.base.exp | 2 +- WebCore/dom/XMLTokenizer.cpp | 5 ++ WebCore/loader/CachedFont.cpp | 4 ++ WebCore/loader/TextResourceDecoder.cpp | 7 ++- WebCore/loader/TextResourceDecoder.h | 3 + WebCore/platform/text/TextCodec.h | 13 +++-- WebCore/platform/text/TextCodecICU.cpp | 65 +++++++++++++++++++----- WebCore/platform/text/TextCodecICU.h | 5 ++- WebCore/platform/text/TextCodecLatin1.cpp | 2 +- WebCore/platform/text/TextCodecLatin1.h | 2 +- WebCore/platform/text/TextCodecUTF16.cpp | 2 +- WebCore/platform/text/TextCodecUTF16.h | 2 +- WebCore/platform/text/TextCodecUserDefined.cpp | 2 +- WebCore/platform/text/TextCodecUserDefined.h | 2 +- WebCore/platform/text/TextDecoder.cpp | 11 +++-- WebCore/platform/text/TextDecoder.h | 8 ++-- WebCore/platform/text/TextEncoding.cpp | 4 +- WebCore/platform/text/TextEncoding.h | 7 ++- WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++-- WebCore/platform/text/mac/TextCodecMac.h | 3 +- 20 files changed, 116 insertions(+), 47 deletions(-)

Eric Seidel (no email)

Comment 14 2008-03-26 01:14:07 PDT

(In reply to comment #13) > Created an attachment (id=20077) [edit] > Fix, makes us pass Test 70. Needs test cases! Basically I need to test every single encoding we support (to make sure I test all of the various decoders). Anyone know how to easily create an invalid char in any encoding? Maybe some encodings don't have invalid codepoints?

Alexey Proskuryakov

Comment 15 2008-03-26 02:05:42 PDT

Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an unpaired surrogate), and the general case is probably easy to test with Shift-JIS (I don't have an example handy, but some random noise should suffice, I think).

Eric Seidel (no email)

Comment 16 2008-03-26 08:56:46 PDT

(In reply to comment #15) > Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an > unpaired surrogate), and the general case is probably easy to test with > Shift-JIS (I don't have an example handy, but some random noise should suffice, > I think). > Hum... Now I just need a way to produce an invalid char in an encoding which goes through TextDecoderMac. x-mac-hebrew is the example used in the layout test attached to bug 4195... Perhaps you have a suggestion of an encoding and an invalid code-point I could use for testing?

Adam Roben (:aroben)

Comment 17 2008-03-26 09:29:30 PDT

Comment on attachment 20077 [details] Fix, makes us pass Test 70. Needs test cases! It looks like this patch will only fix platforms which use ICU. I think the ErrorCallbackSetter constructor/destructor should return early when m_shouldStopOnEncodingErrorsis false. Other than that this looks good (though of course we still need tests).

Eric Seidel (no email)

Comment 18 2008-03-26 09:48:12 PDT

Created attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing LayoutTests/fast/encoding/invalid-xml.html | 12 ++++ .../encoding/resources/invalid-xml-shift-jis.xml | 5 ++ .../fast/encoding/resources/invalid-xml-utf16.xml | Bin 0 -> 444 bytes .../fast/encoding/resources/invalid-xml-utf8.xml | 5 ++ .../encoding/resources/invalid-xml-x-mac-thai.xml | 5 ++ LayoutTests/fast/encoding/resources/invalid-xml.js | 43 +++++++++++++ WebCore/WebCore.base.exp | 2 +- WebCore/dom/XMLTokenizer.cpp | 5 ++ WebCore/loader/CachedFont.cpp | 4 + WebCore/loader/TextResourceDecoder.cpp | 7 +- WebCore/loader/TextResourceDecoder.h | 3 + WebCore/platform/text/TextCodec.h | 13 +++- WebCore/platform/text/TextCodecICU.cpp | 65 ++++++++++++++++---- WebCore/platform/text/TextCodecICU.h | 5 +- WebCore/platform/text/TextCodecLatin1.cpp | 2 +- WebCore/platform/text/TextCodecLatin1.h | 2 +- WebCore/platform/text/TextCodecUTF16.cpp | 2 +- WebCore/platform/text/TextCodecUTF16.h | 2 +- WebCore/platform/text/TextCodecUserDefined.cpp | 2 +- WebCore/platform/text/TextCodecUserDefined.h | 2 +- WebCore/platform/text/TextDecoder.cpp | 11 ++- WebCore/platform/text/TextDecoder.h | 8 +- WebCore/platform/text/TextEncoding.cpp | 4 +- WebCore/platform/text/TextEncoding.h | 7 ++- WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++-- WebCore/platform/text/mac/TextCodecMac.h | 3 +- 26 files changed, 186 insertions(+), 47 deletions(-)

Eric Seidel (no email)

Comment 19 2008-03-26 10:03:33 PDT

Comment on attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing I think this might actually be ready for review. All test cases pass. I may need to break out the mac-encoding tests into a separate file, since I'm not sure this test would pass on other platforms.

Darin Adler

Comment 20 2008-03-26 10:08:36 PDT

Comment on attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing 672 if (m_doc->decoder() && m_doc->decoder()->sawError()) 673 // If the decoder saw an error, report it as fatal (stops parsing) 674 handleError(fatal, "Encoding error", lineNumber(), columnNumber()); Needs braces. We normally put braces around multiple lines even if one is a comment. r=me

Eric Seidel (no email)

Comment 21 2008-03-26 10:21:30 PDT

Thanks. Oops. I think I forgot to fix the if, will do. r31316

Alexey Proskuryakov

Comment 22 2009-03-05 23:38:59 PST

Follow-up in <http://trac.webkit.org/changeset/41476> - don't be strict with XMLHttpRequest responses. Note that IE has quite a different model - it decodes responseXML and responseText separately, so they can even use different encodings.

Lucas Forschler

Comment 23 2019-02-06 09:03:30 PST

Mass moving XML DOM bugs to the "DOM" Component.

Note You need to log in before you can comment on or make changes to this bug.