Steps to reproduce: 1. Go to http://www.oper.ru. 2. Manually set the encoding to KOI8-U Results: empty page, empty "view source", errors in debug log: ================= ERROR: the ICU Converter won't convert from text encoding 0xA08, error 4 (/Users/ap/WebKit/WebCore/kwq/KWQTextCodec.mm:379 UErrorCode KWQTextDecoder::createICUConverter()) ================= Marking P1, since this is a regression.
Apparently caused by changes from 2005-07-12: "Switched over from TEC to ICU for unicode text conversion." It would be interesting to know why it was decided to use ICU directly, not CFString (which is open source, cross-platform, and well optimized for performance).
Apparently, the reason is that ICU 3.2 (version shipped with 10.4) didn't support KOI8-U. This encoding has been added in ICU 3.4, so it won't be an issue in 10.5. Still, not using CFString looks strange to me... Apple strongly discourages external developers from directly using ICU - eat your own dog food! :)
ICU usage has been discussed elsewhere, e.g. in bug 4821.
We're going to need to fix this one way or another. Someone should check with Deborah Goldsmith on why ICU is not handling this encoding. It could be as simple as fixing the string we pass in to ICU.
Here is the ICU bug: <http://dev.icu-project.org/cgi-bin/icu-bugs/closed? id=1143;expression=koi;user=guest;searchclosed=1>. Support for KOI8-U just isn't present in the ICU version shipped with Tiger.
we'll have to restore the old TEC codepath until ICU has support for all the codesets we need.
related to <rdar://problem/3546838> some more detail about behavior with different builds: behavior observed in 416.12 1) load http://www.oper.ru/ - page shows up but encoded wrong 2) change encoding to from default to KOI8-U - page shows up correctly in ukrainian (it doesn't turn blank) behavior observed in TOT: 1) load http://www.oper.ru/ - page shows up correctly in ukrainian 2) change encoding from default to KOI8-U - page turns blank
(In reply to comment #7) ToT renders this site correctly because of a fix in bug 3590.
(In reply to comment #8) > ToT renders this site correctly because of a fix in bug 3590. This bug is confirmed fixed in the latest ToT, marking as such.
(In reply to comment #9) > > ToT renders this site correctly because of a fix in bug 3590. > This bug is confirmed fixed in the latest ToT, marking as such. Which has nothing to do with _this_ bug...
Exactly. The site is fixed, but the bug is not, and it's still a P1.
Created attachment 5054 [details] what is missing from ICU Lists all aliases that are not known to Tiger version of ICU. As it turns out, there are quite a few (both aliases and encodings). I haven't analyzed which encodings have actually regressed (even if ICU knows an encoding by some aliases, it doesn't mean that WebCore gives it one of these). Of course, all encodings that are not supported by ICU no longer work, and many of these aren't even in the current ICU <http://www-950.ibm.com/software/globalization/icu/demo/converters>. Some possible bugs in WebCore's tables are also highlighted (one of these filed as bug 4362; I am not sure which is actually correct in other cases).
Created attachment 5081 [details] what is missing from ICU Added mac-cyrillic; corrected x-mac-ukrainian comments. For info on MacUkrainian, see <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/UKRAINE.TXT>.
<rdar://problem/4404312>
Adding Regression keyword.
Taking this.
FYI, what I'd really like to see is an implementation that uses ICU for a lot more, eliminating our own decoding table except for the few exceptions where ICU doesn't do the right thing in either looking up an encoding by name or in handling character sets. So while it doesn't have to be part of this bug fix, I'd like to see mac-encodings.txt and its friends done away with entirely.
Yes, the fact that encoding names have to round-trip via CFStringEncoding, even when handled by ICU, seems unfortunate to me, too. In this patch, I'm taking a conservative route though, probably not even getting rid of DeprecatedStrings (to avoid the need to change all clients).
Created attachment 9453 [details] proposed fix At the moment, StreamingTextDecoderMac is a complete implementation, and I have verified that all existing tests pass even with StreamingTextDecoderICU disabled. I haven't attempted to factor out any common code, as the TEC/CFString code path will be simplified in the future; also, Unity may have a largely different implementation with its Qt back-end. WebCoreTextDecoder was dead code (used for WebTextView before it was moved to WebCore).
Comment on attachment 9453 [details] proposed fix Looks great! r=me
Committed revision 15449. Added one more test (http/tests/misc/BOM-override-script.html) with additional examples of BOM handling.