Bug 17079

Summary: XML document sent with wrong encoding does not fail as expected (affects Acid3 test 70)
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: DOMAssignee: Eric Seidel (no email) <eric>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, cdumez, gavin.sharp, ian, webkit, webkit-bugs
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Mac   
OS: OS X 10.5   
URL: http://acid3.acidtests.org/
Bug Depends on:    
Bug Blocks: 17064, 268263    
Attachments:
Description Flags
Fix, makes us pass Test 70. Needs test cases!
none
Fix, makes us pass Test 70, now includes minimal testing darin: review+

Eric Seidel (no email)
Reported 2008-01-29 14:59:51 PST
XML document sent with wrong encoding does not fail as expected (Acid3 bug) function () { // test 71: XML encoding test // the third child in kungFuDeathGrip is an ISO-8859-1 document sent as UTF-8. // q.v. XML 1.0, section 4.3.3 Character Encoding in Entities // this only tests one of a large number of conditions that should cause fatal errors var doc = kungFuDeathGrip.childNodes[2].contentDocument; if (!doc) return 5; if (!doc.documentElement.tagName == "root") return 5; if (!doc.documentElement.getElementsByTagName('test')) return 5; fail("UTF-8 encoded XML document with invalid character did not have a well-formedness error"); },
Attachments
Fix, makes us pass Test 70. Needs test cases! (21.29 KB, patch)
2008-03-26 01:04 PDT, Eric Seidel (no email)
no flags
Fix, makes us pass Test 70, now includes minimal testing (25.12 KB, patch)
2008-03-26 09:48 PDT, Eric Seidel (no email)
darin: review+
Eric Seidel (no email)
Comment 1 2008-01-30 01:32:28 PST
Bleh. More XML strictness bugs. If FF or Opera are not strict about this, then this isn't a bug. :(
Alexey Proskuryakov
Comment 2 2008-01-30 01:45:01 PST
We match Firefox here, and changing this may be a compatibility problem. Also, I am not convinced that the XML spec has any say in how text decoding should work - it is a layering violation for it to impose any restrictions on UTF-8, for example. What if the XML spec imposed restrictions on Ethernet MTU, for example? This restriction was probably reasonable when XML 1.0 was first published, because the Unicode standard didn't deal with these issues well, but now it does. Could having this test in Acid3 be reconsidered?
Ian 'Hixie' Hickson
Comment 3 2008-01-30 16:20:22 PST
If you can get the XML WG to change this requirement, I will immediately change the test. However, the spec is hard to misinterpret right now, and this is a major interop danger with XML. I believe testing this in Acid3 is the right thing to do.
Alexey Proskuryakov
Comment 4 2008-01-31 02:37:50 PST
I suggest that we don't try to be the first to "fix" this - exactly because it's a huge interop danger.
Alexey Proskuryakov
Comment 5 2008-01-31 03:14:50 PST
To reiterate, current shipping or preview versions of Firefox, Opera and Safari do not fail to parse <http://hixie.ch/tests/evil/acid/003/empty.xml>. Changing this would not be an interoperability enhancement, nor would it make the Web a better place in some other way I can see.
Ian 'Hixie' Hickson
Comment 6 2008-01-31 11:28:50 PST
Ironically, IE7 does show an error when you go to empty.xml directly. So...
Alexey Proskuryakov
Comment 7 2008-01-31 13:02:18 PST
I think that the most important case for interoperability is what happens for XHTML. IE is not really a reference point in that case. Surely, we don't want to the first to start failing on pages that work in all other browsers (as IE would just get text/html MIME type from the server).
Maciej Stachowiak
Comment 8 2008-02-20 18:29:43 PST
Actually, I think the most important interoperability test case for this is XMLHttpRequest - that's likely the most common use of XML on the web. Hixie points out that IE passes <http://www.hixie.ch/tests/adhoc/dom/web-apps/XMLHttpRequest/020.html>. XHTML with an XML MIME type is pretty rare on the web and generally only used by people who care more about strictness than their site not breaking. So I think we should fix this.
Rosyna
Comment 9 2008-03-18 19:00:28 PDT
Sadly, this is not a layering violation any more C10 of the Unicode 5.0 specification Chapter 3 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. " It may have been a layering violation when the XML part of this was originally written. But it isn't as of Unicode 5.0, since it explicitly says "error condition". And these XML specs are free to specify exactly what happens in an "error condition". (Many Text based apps just replace the invalids with U+FFFD (http://www.fileformat.info/info/unicode/char/fffd/index.htm) as it is far worse for these applications to bail when encountering invalid codepoints)
Eric Seidel (no email)
Comment 10 2008-03-25 00:32:44 PDT
So the changes here need to happen to the various TextDecoder classes. I think for TextDecoderICU we can implement a "strict" mode using error callbacks which are covered by this documentation: http://www.icu-project.org/apiref/icu4c/ucnv__err_8h.html#1e9f87a69d75288c0f93bd77a6f2c9db currently we don't use any error callbacks. We would then create the TextDecoder in a "strict" mode (or set it in a strict mode once we've determined we're handling XML) and then check if it had an error in the write() function in the XMLTokenizer. If it had an error we would stop tokenizing and display the error (using the existing error reporting functions in XMLTokenizer).
Eric Seidel (no email)
Comment 11 2008-03-25 16:43:16 PDT
It looks like we will need to change TextResourceDecoder to so that when it detects an XML content type it should go into a "strict" mode and set a flag on error that the XMLTokenizer can then detect and react to. I expect this patch will be a lot of plumbing.
Eric Seidel (no email)
Comment 12 2008-03-25 18:51:10 PDT
I'm testing a patch now.
Eric Seidel (no email)
Comment 13 2008-03-26 01:04:24 PDT
Created attachment 20077 [details] Fix, makes us pass Test 70. Needs test cases! WebCore/WebCore.base.exp | 2 +- WebCore/dom/XMLTokenizer.cpp | 5 ++ WebCore/loader/CachedFont.cpp | 4 ++ WebCore/loader/TextResourceDecoder.cpp | 7 ++- WebCore/loader/TextResourceDecoder.h | 3 + WebCore/platform/text/TextCodec.h | 13 +++-- WebCore/platform/text/TextCodecICU.cpp | 65 +++++++++++++++++++----- WebCore/platform/text/TextCodecICU.h | 5 ++- WebCore/platform/text/TextCodecLatin1.cpp | 2 +- WebCore/platform/text/TextCodecLatin1.h | 2 +- WebCore/platform/text/TextCodecUTF16.cpp | 2 +- WebCore/platform/text/TextCodecUTF16.h | 2 +- WebCore/platform/text/TextCodecUserDefined.cpp | 2 +- WebCore/platform/text/TextCodecUserDefined.h | 2 +- WebCore/platform/text/TextDecoder.cpp | 11 +++-- WebCore/platform/text/TextDecoder.h | 8 ++-- WebCore/platform/text/TextEncoding.cpp | 4 +- WebCore/platform/text/TextEncoding.h | 7 ++- WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++-- WebCore/platform/text/mac/TextCodecMac.h | 3 +- 20 files changed, 116 insertions(+), 47 deletions(-)
Eric Seidel (no email)
Comment 14 2008-03-26 01:14:07 PDT
(In reply to comment #13) > Created an attachment (id=20077) [edit] > Fix, makes us pass Test 70. Needs test cases! Basically I need to test every single encoding we support (to make sure I test all of the various decoders). Anyone know how to easily create an invalid char in any encoding? Maybe some encodings don't have invalid codepoints?
Alexey Proskuryakov
Comment 15 2008-03-26 02:05:42 PDT
Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an unpaired surrogate), and the general case is probably easy to test with Shift-JIS (I don't have an example handy, but some random noise should suffice, I think).
Eric Seidel (no email)
Comment 16 2008-03-26 08:56:46 PDT
(In reply to comment #15) > Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an > unpaired surrogate), and the general case is probably easy to test with > Shift-JIS (I don't have an example handy, but some random noise should suffice, > I think). > Hum... Now I just need a way to produce an invalid char in an encoding which goes through TextDecoderMac. x-mac-hebrew is the example used in the layout test attached to bug 4195... Perhaps you have a suggestion of an encoding and an invalid code-point I could use for testing?
Adam Roben (:aroben)
Comment 17 2008-03-26 09:29:30 PDT
Comment on attachment 20077 [details] Fix, makes us pass Test 70. Needs test cases! It looks like this patch will only fix platforms which use ICU. I think the ErrorCallbackSetter constructor/destructor should return early when m_shouldStopOnEncodingErrorsis false. Other than that this looks good (though of course we still need tests).
Eric Seidel (no email)
Comment 18 2008-03-26 09:48:12 PDT
Created attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing LayoutTests/fast/encoding/invalid-xml.html | 12 ++++ .../encoding/resources/invalid-xml-shift-jis.xml | 5 ++ .../fast/encoding/resources/invalid-xml-utf16.xml | Bin 0 -> 444 bytes .../fast/encoding/resources/invalid-xml-utf8.xml | 5 ++ .../encoding/resources/invalid-xml-x-mac-thai.xml | 5 ++ LayoutTests/fast/encoding/resources/invalid-xml.js | 43 +++++++++++++ WebCore/WebCore.base.exp | 2 +- WebCore/dom/XMLTokenizer.cpp | 5 ++ WebCore/loader/CachedFont.cpp | 4 + WebCore/loader/TextResourceDecoder.cpp | 7 +- WebCore/loader/TextResourceDecoder.h | 3 + WebCore/platform/text/TextCodec.h | 13 +++- WebCore/platform/text/TextCodecICU.cpp | 65 ++++++++++++++++---- WebCore/platform/text/TextCodecICU.h | 5 +- WebCore/platform/text/TextCodecLatin1.cpp | 2 +- WebCore/platform/text/TextCodecLatin1.h | 2 +- WebCore/platform/text/TextCodecUTF16.cpp | 2 +- WebCore/platform/text/TextCodecUTF16.h | 2 +- WebCore/platform/text/TextCodecUserDefined.cpp | 2 +- WebCore/platform/text/TextCodecUserDefined.h | 2 +- WebCore/platform/text/TextDecoder.cpp | 11 ++- WebCore/platform/text/TextDecoder.h | 8 +- WebCore/platform/text/TextEncoding.cpp | 4 +- WebCore/platform/text/TextEncoding.h | 7 ++- WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++-- WebCore/platform/text/mac/TextCodecMac.h | 3 +- 26 files changed, 186 insertions(+), 47 deletions(-)
Eric Seidel (no email)
Comment 19 2008-03-26 10:03:33 PDT
Comment on attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing I think this might actually be ready for review. All test cases pass. I may need to break out the mac-encoding tests into a separate file, since I'm not sure this test would pass on other platforms.
Darin Adler
Comment 20 2008-03-26 10:08:36 PDT
Comment on attachment 20083 [details] Fix, makes us pass Test 70, now includes minimal testing 672 if (m_doc->decoder() && m_doc->decoder()->sawError()) 673 // If the decoder saw an error, report it as fatal (stops parsing) 674 handleError(fatal, "Encoding error", lineNumber(), columnNumber()); Needs braces. We normally put braces around multiple lines even if one is a comment. r=me
Eric Seidel (no email)
Comment 21 2008-03-26 10:21:30 PDT
Thanks. Oops. I think I forgot to fix the if, will do. r31316
Alexey Proskuryakov
Comment 22 2009-03-05 23:38:59 PST
Follow-up in <http://trac.webkit.org/changeset/41476> - don't be strict with XMLHttpRequest responses. Note that IE has quite a different model - it decodes responseXML and responseText separately, so they can even use different encodings.
Lucas Forschler
Comment 23 2019-02-06 09:03:30 PST
Mass moving XML DOM bugs to the "DOM" Component.
Note You need to log in before you can comment on or make changes to this bug.