Bug 4195 - REGRESSION: KOI8-U encoding no longer supported
Summary: REGRESSION: KOI8-U encoding no longer supported
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P1 Normal
Assignee: Alexey Proskuryakov
URL: http://www.oper.ru
Keywords: InRadar, Regression
Depends on:
Blocks: 5232
  Show dependency treegraph
 
Reported: 2005-07-29 04:55 PDT by Alexey Proskuryakov
Modified: 2006-07-14 23:52 PDT (History)
6 users (show)

See Also:


Attachments
what is missing from ICU (4.84 KB, text/plain)
2005-12-12 14:15 PST, Alexey Proskuryakov
no flags Details
what is missing from ICU (4.88 KB, text/plain)
2005-12-14 13:36 PST, Alexey Proskuryakov
no flags Details
proposed fix (82.28 KB, patch)
2006-07-14 13:03 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-07-29 04:55:39 PDT
Steps to reproduce:
1. Go to http://www.oper.ru.
2. Manually set the encoding to KOI8-U

Results: empty page, empty "view source", errors in debug log:

=================
ERROR: the ICU Converter won't convert from text encoding 0xA08, error 4
(/Users/ap/WebKit/WebCore/kwq/KWQTextCodec.mm:379 UErrorCode 
KWQTextDecoder::createICUConverter())
=================

Marking P1, since this is a regression.
Comment 1 Alexey Proskuryakov 2005-07-29 05:00:16 PDT
Apparently caused by changes from 2005-07-12: "Switched over from TEC to ICU for unicode text 
conversion."

It would be interesting to know why it was decided to use ICU directly, not CFString (which is open source, 
cross-platform, and well optimized for performance).
Comment 2 Alexey Proskuryakov 2005-08-15 11:17:47 PDT
  Apparently, the reason is that ICU 3.2 (version shipped with 10.4) didn't support KOI8-U. This encoding 
has been added in ICU 3.4, so it won't be an issue in 10.5.

  Still, not using CFString looks strange to me... Apple strongly discourages external developers from 
directly using ICU - eat your own dog food! :)
Comment 3 Alexey Proskuryakov 2005-10-01 11:46:12 PDT
ICU usage has been discussed elsewhere, e.g. in bug 4821.
Comment 4 Darin Adler 2005-10-10 08:45:48 PDT
We're going to need to fix this one way or another. Someone should check with Deborah Goldsmith on 
why ICU is not handling this encoding. It could be as simple as fixing the string we pass in to ICU.
Comment 5 Alexey Proskuryakov 2005-10-10 09:56:00 PDT
Here is the ICU bug: <http://dev.icu-project.org/cgi-bin/icu-bugs/closed?
id=1143;expression=koi;user=guest;searchclosed=1>. Support for KOI8-U just isn't present in the ICU 
version shipped with Tiger. 
Comment 6 Adele Peterson 2005-10-10 10:00:32 PDT
we'll have to restore the old TEC codepath until ICU has support for all the codesets we need.
Comment 7 Alice Liu 2005-11-04 13:59:43 PST
related to <rdar://problem/3546838>

some more detail about behavior with different builds: 
behavior observed in 416.12
1) load  http://www.oper.ru/ - page shows up but encoded wrong
2) change encoding to from default to KOI8-U - page shows up correctly in ukrainian (it doesn't turn 
blank)

behavior observed in TOT:
1) load http://www.oper.ru/ - page shows up correctly in ukrainian
2) change encoding from default to KOI8-U - page turns blank

Comment 8 Alexey Proskuryakov 2005-11-04 14:14:15 PST
(In reply to comment #7)
ToT renders this site correctly because of a fix in bug 3590.
Comment 9 Daniel Udey 2005-11-09 06:32:53 PST
(In reply to comment #8)
> ToT renders this site correctly because of a fix in bug 3590.

This bug is confirmed fixed in the latest ToT, marking as such.
Comment 10 Alexey Proskuryakov 2005-11-09 09:50:07 PST
(In reply to comment #9)
> > ToT renders this site correctly because of a fix in bug 3590.
> This bug is confirmed fixed in the latest ToT, marking as such.

Which has nothing to do with _this_ bug...
Comment 11 Darin Adler 2005-11-09 10:55:15 PST
Exactly. The site is fixed, but the bug is not, and it's still a P1.
Comment 12 Alexey Proskuryakov 2005-12-12 14:15:09 PST
Created attachment 5054 [details]
what is missing from ICU

Lists all aliases that are not known to Tiger version of ICU. As it turns out,
there are quite a few (both aliases and encodings).

I haven't analyzed which encodings have actually regressed (even if ICU knows
an encoding by some aliases, it doesn't mean that WebCore gives it one of
these). Of course, all encodings that are not supported by ICU no longer work,
and many of these aren't even in the current ICU
<http://www-950.ibm.com/software/globalization/icu/demo/converters>.

Some possible bugs in WebCore's tables are also highlighted (one of these filed
as bug 4362; I am not sure which is actually correct in other cases).
Comment 13 Alexey Proskuryakov 2005-12-14 13:36:45 PST
Created attachment 5081 [details]
what is missing from ICU

Added mac-cyrillic; corrected x-mac-ukrainian comments. For info on
MacUkrainian, see 
<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/UKRAINE.TXT>.
Comment 14 Alice Liu 2006-01-10 10:05:57 PST
<rdar://problem/4404312>
Comment 15 Joost de Valk (AlthA) 2006-01-22 04:33:45 PST
Adding Regression keyword.
Comment 16 Alexey Proskuryakov 2006-07-13 12:07:51 PDT
Taking this.
Comment 17 Darin Adler 2006-07-14 06:06:03 PDT
FYI, what I'd really like to see is an implementation that uses ICU for a lot more, eliminating our own decoding table except for the few exceptions where ICU doesn't do the right thing in either looking up an encoding by name or in handling character sets.

So while it doesn't have to be part of this bug fix, I'd like to see mac-encodings.txt and its friends done away with entirely.
Comment 18 Alexey Proskuryakov 2006-07-14 06:23:12 PDT
Yes, the fact that encoding names have to round-trip via CFStringEncoding, even when handled by ICU, seems unfortunate to me, too. In this patch, I'm taking a conservative route though, probably not even getting rid of DeprecatedStrings (to avoid the need to change all clients).
Comment 19 Alexey Proskuryakov 2006-07-14 13:03:08 PDT
Created attachment 9453 [details]
proposed fix

At the moment, StreamingTextDecoderMac is a complete implementation, and I have verified that all existing tests pass even with StreamingTextDecoderICU disabled. I haven't attempted to factor out any common code, as the TEC/CFString code path will be simplified in the future; also, Unity may have a largely different implementation with its Qt back-end.

WebCoreTextDecoder was dead code (used for WebTextView before it was moved to WebCore).
Comment 20 Darin Adler 2006-07-14 18:45:29 PDT
Comment on attachment 9453 [details]
proposed fix

Looks great! r=me
Comment 21 Alexey Proskuryakov 2006-07-14 23:52:15 PDT
Committed revision 15449.

Added one more test (http/tests/misc/BOM-override-script.html) with additional examples of BOM handling.