Bug 17079 - XML document sent with wrong encoding does not fail as expected (affects Acid3 test 70)
Summary: XML document sent with wrong encoding does not fail as expected (affects Acid...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: Macintosh OS X 10.5
: P2 Normal
Assignee: Eric Seidel (no email)
URL: http://acid3.acidtests.org/
Keywords:
Depends on:
Blocks: Acid3
  Show dependency treegraph
 
Reported: 2008-01-29 14:59 PST by Eric Seidel (no email)
Modified: 2019-02-06 09:03 PST (History)
6 users (show)

See Also:


Attachments
Fix, makes us pass Test 70. Needs test cases! (21.29 KB, patch)
2008-03-26 01:04 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Fix, makes us pass Test 70, now includes minimal testing (25.12 KB, patch)
2008-03-26 09:48 PDT, Eric Seidel (no email)
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2008-01-29 14:59:51 PST
XML document sent with wrong encoding does not fail as expected (Acid3 bug)

    function () {
      // test 71: XML encoding test
      // the third child in kungFuDeathGrip is an ISO-8859-1 document sent as UTF-8.
      // q.v. XML 1.0, section 4.3.3 Character Encoding in Entities
      // this only tests one of a large number of conditions that should cause fatal errors
      var doc = kungFuDeathGrip.childNodes[2].contentDocument;
      if (!doc)
        return 5;
      if (!doc.documentElement.tagName == "root")
        return 5;
      if (!doc.documentElement.getElementsByTagName('test'))
        return 5;
      fail("UTF-8 encoded XML document with invalid character did not have a well-formedness error");
    },
Comment 1 Eric Seidel (no email) 2008-01-30 01:32:28 PST
Bleh.  More XML strictness bugs.  If FF or Opera are not strict about this, then this isn't a bug. :(
Comment 2 Alexey Proskuryakov 2008-01-30 01:45:01 PST
We match Firefox here, and changing this may be a compatibility problem.

Also, I am not convinced that the XML spec has any say in how text decoding should work - it is a layering violation for it to impose any restrictions on UTF-8, for example. What if the XML spec imposed restrictions on Ethernet MTU, for example?

This restriction was probably reasonable when XML 1.0 was first published, because the Unicode standard didn't deal with these issues well, but now it does.

Could having this test in Acid3 be reconsidered?
Comment 3 Ian 'Hixie' Hickson 2008-01-30 16:20:22 PST
If you can get the XML WG to change this requirement, I will immediately change the test. However, the spec is hard to misinterpret right now, and this is a major interop danger with XML. I believe testing this in Acid3 is the right thing to do.
Comment 4 Alexey Proskuryakov 2008-01-31 02:37:50 PST
I suggest that we don't try to be the first to "fix" this - exactly because it's a huge interop danger.
Comment 5 Alexey Proskuryakov 2008-01-31 03:14:50 PST
To reiterate, current shipping or preview versions of Firefox, Opera and Safari do not fail to parse <http://hixie.ch/tests/evil/acid/003/empty.xml>. Changing this would not be an interoperability enhancement, nor would it make the Web a better place in some other way I can see.
Comment 6 Ian 'Hixie' Hickson 2008-01-31 11:28:50 PST
Ironically, IE7 does show an error when you go to empty.xml directly. So...
Comment 7 Alexey Proskuryakov 2008-01-31 13:02:18 PST
I think that the most important case for interoperability is what happens for XHTML. IE is not really a reference point in that case.

Surely, we don't want to the first to start failing on pages that work in all other browsers (as IE would just get text/html MIME type from the server).
Comment 8 Maciej Stachowiak 2008-02-20 18:29:43 PST
Actually, I think the most important interoperability test case for this is XMLHttpRequest - that's likely the most common use of XML on the web. Hixie points out that IE passes <http://www.hixie.ch/tests/adhoc/dom/web-apps/XMLHttpRequest/020.html>. XHTML with an XML MIME type is pretty rare on the web and generally only used by people who care more about strictness than their site not breaking. So I think we should fix this.
Comment 9 Rosyna 2008-03-18 19:00:28 PDT
Sadly, this is not a layering violation any more C10 of the Unicode 5.0 specification Chapter 3 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) says:

"When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. "

It may have been a layering violation when the XML part of this was originally written. But it isn't as of Unicode 5.0, since it explicitly says "error condition". And these XML specs are free to specify exactly what happens in an "error condition".

(Many Text based apps just replace the invalids with U+FFFD (http://www.fileformat.info/info/unicode/char/fffd/index.htm) as it is far worse for these applications to bail when encountering invalid codepoints)
Comment 10 Eric Seidel (no email) 2008-03-25 00:32:44 PDT
So the changes here need to happen to the various TextDecoder classes.  I think for TextDecoderICU we can implement a "strict" mode using error callbacks which are covered by this documentation:
http://www.icu-project.org/apiref/icu4c/ucnv__err_8h.html#1e9f87a69d75288c0f93bd77a6f2c9db
currently we don't use any error callbacks.

We would then create the TextDecoder in a "strict" mode (or set it in a strict mode once we've determined we're handling XML) and then check if it had an error in the write() function in the XMLTokenizer.  If it had an error we would stop tokenizing and display the error (using the existing error reporting functions in XMLTokenizer).
Comment 11 Eric Seidel (no email) 2008-03-25 16:43:16 PDT
It looks like we will need to change TextResourceDecoder to so that when it detects an XML content type it should go into a "strict" mode and set a flag on error that the XMLTokenizer can then detect and react to.  I expect this patch will be a lot of plumbing.
Comment 12 Eric Seidel (no email) 2008-03-25 18:51:10 PDT
I'm testing a patch now.
Comment 13 Eric Seidel (no email) 2008-03-26 01:04:24 PDT
Created attachment 20077 [details]
Fix, makes us pass Test 70.  Needs test cases!

 WebCore/WebCore.base.exp                       |    2 +-
 WebCore/dom/XMLTokenizer.cpp                   |    5 ++
 WebCore/loader/CachedFont.cpp                  |    4 ++
 WebCore/loader/TextResourceDecoder.cpp         |    7 ++-
 WebCore/loader/TextResourceDecoder.h           |    3 +
 WebCore/platform/text/TextCodec.h              |   13 +++--
 WebCore/platform/text/TextCodecICU.cpp         |   65 +++++++++++++++++++-----
 WebCore/platform/text/TextCodecICU.h           |    5 ++-
 WebCore/platform/text/TextCodecLatin1.cpp      |    2 +-
 WebCore/platform/text/TextCodecLatin1.h        |    2 +-
 WebCore/platform/text/TextCodecUTF16.cpp       |    2 +-
 WebCore/platform/text/TextCodecUTF16.h         |    2 +-
 WebCore/platform/text/TextCodecUserDefined.cpp |    2 +-
 WebCore/platform/text/TextCodecUserDefined.h   |    2 +-
 WebCore/platform/text/TextDecoder.cpp          |   11 +++--
 WebCore/platform/text/TextDecoder.h            |    8 ++--
 WebCore/platform/text/TextEncoding.cpp         |    4 +-
 WebCore/platform/text/TextEncoding.h           |    7 ++-
 WebCore/platform/text/mac/TextCodecMac.cpp     |   14 +++--
 WebCore/platform/text/mac/TextCodecMac.h       |    3 +-
 20 files changed, 116 insertions(+), 47 deletions(-)
Comment 14 Eric Seidel (no email) 2008-03-26 01:14:07 PDT
(In reply to comment #13)
> Created an attachment (id=20077) [edit]
> Fix, makes us pass Test 70.  Needs test cases!

Basically I need to test every single encoding we support (to make sure I test all of the various decoders).  Anyone know how to easily create an invalid char in any encoding?  Maybe some encodings don't have invalid codepoints?
Comment 15 Alexey Proskuryakov 2008-03-26 02:05:42 PDT
Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an unpaired surrogate), and the general case is probably easy to test with Shift-JIS (I don't have an example handy, but some random noise should suffice, I think).
Comment 16 Eric Seidel (no email) 2008-03-26 08:56:46 PDT
(In reply to comment #15)
> Latin1 and user-defined can always be decoded; UTF is pretty easy (just make an
> unpaired surrogate), and the general case is probably easy to test with
> Shift-JIS (I don't have an example handy, but some random noise should suffice,
> I think).
> 

Hum... Now I just need a way to produce an invalid char in an encoding which goes through TextDecoderMac.  x-mac-hebrew is the example used in the layout test attached to bug 4195...  Perhaps you have a suggestion of an encoding and an invalid code-point I could use for testing?
Comment 17 Adam Roben (:aroben) 2008-03-26 09:29:30 PDT
Comment on attachment 20077 [details]
Fix, makes us pass Test 70.  Needs test cases!

It looks like this patch will only fix platforms which use ICU.

I think the ErrorCallbackSetter constructor/destructor should return early when m_shouldStopOnEncodingErrorsis false.

Other than that this looks good (though of course we still need tests).
Comment 18 Eric Seidel (no email) 2008-03-26 09:48:12 PDT
Created attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing

 LayoutTests/fast/encoding/invalid-xml.html         |   12 ++++
 .../encoding/resources/invalid-xml-shift-jis.xml   |    5 ++
 .../fast/encoding/resources/invalid-xml-utf16.xml  |  Bin 0 -> 444 bytes
 .../fast/encoding/resources/invalid-xml-utf8.xml   |    5 ++
 .../encoding/resources/invalid-xml-x-mac-thai.xml  |    5 ++
 LayoutTests/fast/encoding/resources/invalid-xml.js |   43 +++++++++++++
 WebCore/WebCore.base.exp                           |    2 +-
 WebCore/dom/XMLTokenizer.cpp                       |    5 ++
 WebCore/loader/CachedFont.cpp                      |    4 +
 WebCore/loader/TextResourceDecoder.cpp             |    7 +-
 WebCore/loader/TextResourceDecoder.h               |    3 +
 WebCore/platform/text/TextCodec.h                  |   13 +++-
 WebCore/platform/text/TextCodecICU.cpp             |   65 ++++++++++++++++----
 WebCore/platform/text/TextCodecICU.h               |    5 +-
 WebCore/platform/text/TextCodecLatin1.cpp          |    2 +-
 WebCore/platform/text/TextCodecLatin1.h            |    2 +-
 WebCore/platform/text/TextCodecUTF16.cpp           |    2 +-
 WebCore/platform/text/TextCodecUTF16.h             |    2 +-
 WebCore/platform/text/TextCodecUserDefined.cpp     |    2 +-
 WebCore/platform/text/TextCodecUserDefined.h       |    2 +-
 WebCore/platform/text/TextDecoder.cpp              |   11 ++-
 WebCore/platform/text/TextDecoder.h                |    8 +-
 WebCore/platform/text/TextEncoding.cpp             |    4 +-
 WebCore/platform/text/TextEncoding.h               |    7 ++-
 WebCore/platform/text/mac/TextCodecMac.cpp         |   14 +++--
 WebCore/platform/text/mac/TextCodecMac.h           |    3 +-
 26 files changed, 186 insertions(+), 47 deletions(-)
Comment 19 Eric Seidel (no email) 2008-03-26 10:03:33 PDT
Comment on attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing

I think this might actually be ready for review.  All test cases pass.  I may need to break out the mac-encoding tests into a separate file, since I'm not sure this test would pass on other platforms.
Comment 20 Darin Adler 2008-03-26 10:08:36 PDT
Comment on attachment 20083 [details]
Fix, makes us pass Test 70, now includes minimal testing

 672     if (m_doc->decoder() && m_doc->decoder()->sawError())
 673         // If the decoder saw an error, report it as fatal (stops parsing)
 674         handleError(fatal, "Encoding error", lineNumber(), columnNumber());

Needs braces. We normally put braces around multiple lines even if one is a comment.

r=me
Comment 21 Eric Seidel (no email) 2008-03-26 10:21:30 PDT
Thanks.  Oops.  I think I forgot to fix the if, will do.

r31316
Comment 22 Alexey Proskuryakov 2009-03-05 23:38:59 PST
Follow-up in <http://trac.webkit.org/changeset/41476> - don't be strict with XMLHttpRequest responses.

Note that IE has quite a different model - it decodes responseXML and responseText separately, so they can even use different encodings.
Comment 23 Lucas Forschler 2019-02-06 09:03:30 PST
Mass moving XML DOM bugs to the "DOM" Component.