Bug 4195

Summary: REGRESSION: KOI8-U encoding no longer supported
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: Page LoadingAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: adele, alice.barraclough, darin, ian, kocienda, nickshanks
Priority: P1 Keywords: InRadar, Regression
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
URL: http://www.oper.ru
Bug Depends on:    
Bug Blocks: 5232    
Attachments:
Description Flags
what is missing from ICU
none
what is missing from ICU
none
proposed fix darin: review+

Alexey Proskuryakov
Reported 2005-07-29 04:55:39 PDT
Steps to reproduce: 1. Go to http://www.oper.ru. 2. Manually set the encoding to KOI8-U Results: empty page, empty "view source", errors in debug log: ================= ERROR: the ICU Converter won't convert from text encoding 0xA08, error 4 (/Users/ap/WebKit/WebCore/kwq/KWQTextCodec.mm:379 UErrorCode KWQTextDecoder::createICUConverter()) ================= Marking P1, since this is a regression.
Attachments
what is missing from ICU (4.84 KB, text/plain)
2005-12-12 14:15 PST, Alexey Proskuryakov
no flags
what is missing from ICU (4.88 KB, text/plain)
2005-12-14 13:36 PST, Alexey Proskuryakov
no flags
proposed fix (82.28 KB, patch)
2006-07-14 13:03 PDT, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2005-07-29 05:00:16 PDT
Apparently caused by changes from 2005-07-12: "Switched over from TEC to ICU for unicode text conversion." It would be interesting to know why it was decided to use ICU directly, not CFString (which is open source, cross-platform, and well optimized for performance).
Alexey Proskuryakov
Comment 2 2005-08-15 11:17:47 PDT
Apparently, the reason is that ICU 3.2 (version shipped with 10.4) didn't support KOI8-U. This encoding has been added in ICU 3.4, so it won't be an issue in 10.5. Still, not using CFString looks strange to me... Apple strongly discourages external developers from directly using ICU - eat your own dog food! :)
Alexey Proskuryakov
Comment 3 2005-10-01 11:46:12 PDT
ICU usage has been discussed elsewhere, e.g. in bug 4821.
Darin Adler
Comment 4 2005-10-10 08:45:48 PDT
We're going to need to fix this one way or another. Someone should check with Deborah Goldsmith on why ICU is not handling this encoding. It could be as simple as fixing the string we pass in to ICU.
Alexey Proskuryakov
Comment 5 2005-10-10 09:56:00 PDT
Here is the ICU bug: <http://dev.icu-project.org/cgi-bin/icu-bugs/closed? id=1143;expression=koi;user=guest;searchclosed=1>. Support for KOI8-U just isn't present in the ICU version shipped with Tiger.
Adele Peterson
Comment 6 2005-10-10 10:00:32 PDT
we'll have to restore the old TEC codepath until ICU has support for all the codesets we need.
Alice Liu
Comment 7 2005-11-04 13:59:43 PST
related to <rdar://problem/3546838> some more detail about behavior with different builds: behavior observed in 416.12 1) load http://www.oper.ru/ - page shows up but encoded wrong 2) change encoding to from default to KOI8-U - page shows up correctly in ukrainian (it doesn't turn blank) behavior observed in TOT: 1) load http://www.oper.ru/ - page shows up correctly in ukrainian 2) change encoding from default to KOI8-U - page turns blank
Alexey Proskuryakov
Comment 8 2005-11-04 14:14:15 PST
(In reply to comment #7) ToT renders this site correctly because of a fix in bug 3590.
Daniel Udey
Comment 9 2005-11-09 06:32:53 PST
(In reply to comment #8) > ToT renders this site correctly because of a fix in bug 3590. This bug is confirmed fixed in the latest ToT, marking as such.
Alexey Proskuryakov
Comment 10 2005-11-09 09:50:07 PST
(In reply to comment #9) > > ToT renders this site correctly because of a fix in bug 3590. > This bug is confirmed fixed in the latest ToT, marking as such. Which has nothing to do with _this_ bug...
Darin Adler
Comment 11 2005-11-09 10:55:15 PST
Exactly. The site is fixed, but the bug is not, and it's still a P1.
Alexey Proskuryakov
Comment 12 2005-12-12 14:15:09 PST
Created attachment 5054 [details] what is missing from ICU Lists all aliases that are not known to Tiger version of ICU. As it turns out, there are quite a few (both aliases and encodings). I haven't analyzed which encodings have actually regressed (even if ICU knows an encoding by some aliases, it doesn't mean that WebCore gives it one of these). Of course, all encodings that are not supported by ICU no longer work, and many of these aren't even in the current ICU <http://www-950.ibm.com/software/globalization/icu/demo/converters>. Some possible bugs in WebCore's tables are also highlighted (one of these filed as bug 4362; I am not sure which is actually correct in other cases).
Alexey Proskuryakov
Comment 13 2005-12-14 13:36:45 PST
Created attachment 5081 [details] what is missing from ICU Added mac-cyrillic; corrected x-mac-ukrainian comments. For info on MacUkrainian, see <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/UKRAINE.TXT>.
Alice Liu
Comment 14 2006-01-10 10:05:57 PST
Joost de Valk (AlthA)
Comment 15 2006-01-22 04:33:45 PST
Adding Regression keyword.
Alexey Proskuryakov
Comment 16 2006-07-13 12:07:51 PDT
Taking this.
Darin Adler
Comment 17 2006-07-14 06:06:03 PDT
FYI, what I'd really like to see is an implementation that uses ICU for a lot more, eliminating our own decoding table except for the few exceptions where ICU doesn't do the right thing in either looking up an encoding by name or in handling character sets. So while it doesn't have to be part of this bug fix, I'd like to see mac-encodings.txt and its friends done away with entirely.
Alexey Proskuryakov
Comment 18 2006-07-14 06:23:12 PDT
Yes, the fact that encoding names have to round-trip via CFStringEncoding, even when handled by ICU, seems unfortunate to me, too. In this patch, I'm taking a conservative route though, probably not even getting rid of DeprecatedStrings (to avoid the need to change all clients).
Alexey Proskuryakov
Comment 19 2006-07-14 13:03:08 PDT
Created attachment 9453 [details] proposed fix At the moment, StreamingTextDecoderMac is a complete implementation, and I have verified that all existing tests pass even with StreamingTextDecoderICU disabled. I haven't attempted to factor out any common code, as the TEC/CFString code path will be simplified in the future; also, Unity may have a largely different implementation with its Qt back-end. WebCoreTextDecoder was dead code (used for WebTextView before it was moved to WebCore).
Darin Adler
Comment 20 2006-07-14 18:45:29 PDT
Comment on attachment 9453 [details] proposed fix Looks great! r=me
Alexey Proskuryakov
Comment 21 2006-07-14 23:52:15 PDT
Committed revision 15449. Added one more test (http/tests/misc/BOM-override-script.html) with additional examples of BOM handling.
Note You need to log in before you can comment on or make changes to this bug.