Summary: | XML document sent with wrong encoding does not fail as expected (affects Acid3 test 70) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Eric Seidel (no email) <eric> | ||||||
Component: | DOM | Assignee: | Eric Seidel (no email) <eric> | ||||||
Status: | RESOLVED FIXED | ||||||||
Severity: | Normal | CC: | ap, cdumez, gavin.sharp, ian, webkit, webkit-bugs | ||||||
Priority: | P2 | ||||||||
Version: | 528+ (Nightly build) | ||||||||
Hardware: | Mac | ||||||||
OS: | OS X 10.5 | ||||||||
URL: | http://acid3.acidtests.org/ | ||||||||
Bug Depends on: | |||||||||
Bug Blocks: | 17064, 268263 | ||||||||
Attachments: |
|
Description
Eric Seidel (no email)
2008-01-29 14:59:51 PST
Bleh. More XML strictness bugs. If FF or Opera are not strict about this, then this isn't a bug. :( We match Firefox here, and changing this may be a compatibility problem. Also, I am not convinced that the XML spec has any say in how text decoding should work - it is a layering violation for it to impose any restrictions on UTF-8, for example. What if the XML spec imposed restrictions on Ethernet MTU, for example? This restriction was probably reasonable when XML 1.0 was first published, because the Unicode standard didn't deal with these issues well, but now it does. Could having this test in Acid3 be reconsidered? If you can get the XML WG to change this requirement, I will immediately change the test. However, the spec is hard to misinterpret right now, and this is a major interop danger with XML. I believe testing this in Acid3 is the right thing to do. I suggest that we don't try to be the first to "fix" this - exactly because it's a huge interop danger. To reiterate, current shipping or preview versions of Firefox, Opera and Safari do not fail to parse <http://hixie.ch/tests/evil/acid/003/empty.xml>. Changing this would not be an interoperability enhancement, nor would it make the Web a better place in some other way I can see. Ironically, IE7 does show an error when you go to empty.xml directly. So... I think that the most important case for interoperability is what happens for XHTML. IE is not really a reference point in that case. Surely, we don't want to the first to start failing on pages that work in all other browsers (as IE would just get text/html MIME type from the server). Actually, I think the most important interoperability test case for this is XMLHttpRequest - that's likely the most common use of XML on the web. Hixie points out that IE passes <http://www.hixie.ch/tests/adhoc/dom/web-apps/XMLHttpRequest/020.html>. XHTML with an XML MIME type is pretty rare on the web and generally only used by people who care more about strictness than their site not breaking. So I think we should fix this. Sadly, this is not a layering violation any more C10 of the Unicode 5.0 specification Chapter 3 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. " It may have been a layering violation when the XML part of this was originally written. But it isn't as of Unicode 5.0, since it explicitly says "error condition". And these XML specs are free to specify exactly what happens in an "error condition". (Many Text based apps just replace the invalids with U+FFFD (http://www.fileformat.info/info/unicode/char/fffd/index.htm) as it is far worse for these applications to bail when encountering invalid codepoints) So the changes here need to happen to the various TextDecoder classes. I think for TextDecoderICU we can implement a "strict" mode using error callbacks which are covered by this documentation: http://www.icu-project.org/apiref/icu4c/ucnv__err_8h.html#1e9f87a69d75288c0f93bd77a6f2c9db currently we don't use any error callbacks. We would then create the TextDecoder in a "strict" mode (or set it in a strict mode once we've determined we're handling XML) and then check if it had an error in the write() function in the XMLTokenizer. If it had an error we would stop tokenizing and display the error (using the existing error reporting functions in XMLTokenizer). It looks like we will need to change TextResourceDecoder to so that when it detects an XML content type it should go into a "strict" mode and set a flag on error that the XMLTokenizer can then detect and react to. I expect this patch will be a lot of plumbing. I'm testing a patch now. Created attachment 20077 [details]
Fix, makes us pass Test 70. Needs test cases!
WebCore/WebCore.base.exp | 2 +-
WebCore/dom/XMLTokenizer.cpp | 5 ++
WebCore/loader/CachedFont.cpp | 4 ++
WebCore/loader/TextResourceDecoder.cpp | 7 ++-
WebCore/loader/TextResourceDecoder.h | 3 +
WebCore/platform/text/TextCodec.h | 13 +++--
WebCore/platform/text/TextCodecICU.cpp | 65 +++++++++++++++++++-----
WebCore/platform/text/TextCodecICU.h | 5 ++-
WebCore/platform/text/TextCodecLatin1.cpp | 2 +-
WebCore/platform/text/TextCodecLatin1.h | 2 +-
WebCore/platform/text/TextCodecUTF16.cpp | 2 +-
WebCore/platform/text/TextCodecUTF16.h | 2 +-
WebCore/platform/text/TextCodecUserDefined.cpp | 2 +-
WebCore/platform/text/TextCodecUserDefined.h | 2 +-
WebCore/platform/text/TextDecoder.cpp | 11 +++--
WebCore/platform/text/TextDecoder.h | 8 ++--
WebCore/platform/text/TextEncoding.cpp | 4 +-
WebCore/platform/text/TextEncoding.h | 7 ++-
WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++--
WebCore/platform/text/mac/TextCodecMac.h | 3 +-
20 files changed, 116 insertions(+), 47 deletions(-)
(In reply to comment #13) > Created an attachment (id=20077) [edit] > Fix, makes us pass Test 70. Needs test cases! Basically I need to test every single encoding we support (to make sure I test all of the various decoders). Anyone know how to easily create an invalid char in any encoding? Maybe some encodings don't have invalid codepoints? Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an unpaired surrogate), and the general case is probably easy to test with Shift-JIS (I don't have an example handy, but some random noise should suffice, I think). (In reply to comment #15) > Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an > unpaired surrogate), and the general case is probably easy to test with > Shift-JIS (I don't have an example handy, but some random noise should suffice, > I think). > Hum... Now I just need a way to produce an invalid char in an encoding which goes through TextDecoderMac. x-mac-hebrew is the example used in the layout test attached to bug 4195... Perhaps you have a suggestion of an encoding and an invalid code-point I could use for testing? Comment on attachment 20077 [details]
Fix, makes us pass Test 70. Needs test cases!
It looks like this patch will only fix platforms which use ICU.
I think the ErrorCallbackSetter constructor/destructor should return early when m_shouldStopOnEncodingErrorsis false.
Other than that this looks good (though of course we still need tests).
Created attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing
LayoutTests/fast/encoding/invalid-xml.html | 12 ++++
.../encoding/resources/invalid-xml-shift-jis.xml | 5 ++
.../fast/encoding/resources/invalid-xml-utf16.xml | Bin 0 -> 444 bytes
.../fast/encoding/resources/invalid-xml-utf8.xml | 5 ++
.../encoding/resources/invalid-xml-x-mac-thai.xml | 5 ++
LayoutTests/fast/encoding/resources/invalid-xml.js | 43 +++++++++++++
WebCore/WebCore.base.exp | 2 +-
WebCore/dom/XMLTokenizer.cpp | 5 ++
WebCore/loader/CachedFont.cpp | 4 +
WebCore/loader/TextResourceDecoder.cpp | 7 +-
WebCore/loader/TextResourceDecoder.h | 3 +
WebCore/platform/text/TextCodec.h | 13 +++-
WebCore/platform/text/TextCodecICU.cpp | 65 ++++++++++++++++----
WebCore/platform/text/TextCodecICU.h | 5 +-
WebCore/platform/text/TextCodecLatin1.cpp | 2 +-
WebCore/platform/text/TextCodecLatin1.h | 2 +-
WebCore/platform/text/TextCodecUTF16.cpp | 2 +-
WebCore/platform/text/TextCodecUTF16.h | 2 +-
WebCore/platform/text/TextCodecUserDefined.cpp | 2 +-
WebCore/platform/text/TextCodecUserDefined.h | 2 +-
WebCore/platform/text/TextDecoder.cpp | 11 ++-
WebCore/platform/text/TextDecoder.h | 8 +-
WebCore/platform/text/TextEncoding.cpp | 4 +-
WebCore/platform/text/TextEncoding.h | 7 ++-
WebCore/platform/text/mac/TextCodecMac.cpp | 14 +++--
WebCore/platform/text/mac/TextCodecMac.h | 3 +-
26 files changed, 186 insertions(+), 47 deletions(-)
Comment on attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing
I think this might actually be ready for review. All test cases pass. I may need to break out the mac-encoding tests into a separate file, since I'm not sure this test would pass on other platforms.
Comment on attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing
672 if (m_doc->decoder() && m_doc->decoder()->sawError())
673 // If the decoder saw an error, report it as fatal (stops parsing)
674 handleError(fatal, "Encoding error", lineNumber(), columnNumber());
Needs braces. We normally put braces around multiple lines even if one is a comment.
r=me
Thanks. Oops. I think I forgot to fix the if, will do. r31316 Follow-up in <http://trac.webkit.org/changeset/41476> - don't be strict with XMLHttpRequest responses. Note that IE has quite a different model - it decodes responseXML and responseText separately, so they can even use different encodings. Mass moving XML DOM bugs to the "DOM" Component. |