Bug 21990 - When a rare EUC-JP character is present, explicitly (and correctly) labelled EUC-JP document is mistreated as Shift_JIS
: When a rare EUC-JP character is present, explicitly (and correctly) labelled ...
Status: RESOLVED FIXED
: WebKit
Page Loading
: 528+ (Nightly build)
: All All
: P2 Normal
Assigned To:
: http://www.google.com/search?hl=en&in...
:
: 16482
:
  Show dependency treegraph
 
Reported: 2008-10-30 16:24 PST by
Modified: 2009-09-02 14:27 PST (History)


Attachments
proposed patch (7.69 KB, patch)
2009-09-01 15:36 PST, Alexey Proskuryakov
darin: review+
Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-10-30 16:24:57 PST
1. Go to 
http://www.google.com/search?hl=en&inlang=ja&ie=EUC-JP&oe=EUC-JP&q=%8F%A2%C3&btnG=Search

(it's explicitly and correctly labelled as in EUC-JP in HTTP C-T header field).

2. You'd see '召テ'  instead of '¦'.

3. The latter is represented in 0x8F 0xA2 0xC3 in EUC-JP (3 bytes). 

Japanese Encoding detector in TextResourceDecoder.cpp is fooled by '0x8F' and misdetect the document as in Shift_JIS.  

I think this logic for invoking JapaneseEncoding detector is too liberal:

if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && en
coding().isJapanese())

No encoding detector is perfect and I'd rather not invoke any encoding detector (Unicode BOM detection can be an exception) for documents with an explicit charset declaration (http header or meta).  After resolving bug 16482 (ICU encoding detector hook-up), I'll revisit this issue.
------- Comment #1 From 2008-10-31 03:26:00 PST -------
Makes sense to me, but I don't know which use cases the encoding detector was supposed to fix by original design. Is there a chance that there is some amount of mislabeled content, correctly rendered by other browsers for whatever reasons? I guess that's unlikely.
------- Comment #2 From 2008-10-31 09:52:18 PST -------
The Japanese encoding detector was originally intended at least in part to make mislabeled pages work correctly. Limiting the automatic detection only to pages that are not labeled with a charset at all will almost certainly break some websites.

I don't know how to make a good decision about this. I'm not an expert on the state of the art in encoding in Japanese-language websites, nor do I know what the other major web browsers currently do about this.
------- Comment #3 From 2008-11-03 15:35:26 PST -------
Ooops. I wrote a long reply last week and thought I submitted it, but apparently moved away before submitting it (I shouldn't open too many tabs :-) )

Let me rewrite what I wrote before:

1. We should never invoke it without an explicit user request even when its almost perfect. Currently, webkit does not offer a way to control it. Bug 16482 adds a settings/preference entry for that among other things.

2. Until we have a very good quality encoding detector (I'd regard none of encoding detector used in web browsers today as clearing the bar. Neither is ICU's encoding detector), we should NOT invoke it for a page with an expliclty (and more often than not, correctly) specified encoding (meta or http) even if a user turns on the detector. This is what Firefox does and what I implemented in bug 16482.  On the other hand, MS IE behaves differently (I'm not sure exactly what it does)

3. When we have a really good detector, we may reconsider #2.  

For this particular bug, I can't get rid of built-in Japanese detector completely yet because ICU's encoding detector does not detect ISO-2022-JP, but I propose we use the same condition for invoking built-in encoding detector as I do for ICU's detector in the patch for bug 16482. 

How does it sound? 


BTW, this was independently reported for Chrome ( http://code.google.com/p/chromium/issues/detail?id=3799 )
------- Comment #4 From 2008-11-03 15:39:03 PST -------
This sounds like a good declaration of principles.

But how can we figure out what compatibility impact this change will have? Is our current auto-detection useless or useful? How do you know?
------- Comment #5 From 2008-11-03 16:27:41 PST -------
See also: <rdar://6007713>, <rdar://5934750> (which have examples of sites with similar problems).

(In reply to comment #3)
> On the other hand, MS IE behaves differently (I'm not sure exactly what it does)

Is it possible to find out? When I face a weird IE behavior that I cannot figure out myself, I'm often able to find it discussed and thoroughly bisected on the net.

We have 3 or 4 reports of problems caused by overriding an explicitly specified charset accumulated over the years. This is sufficient to strongly consider changing this behavior, but it is likely that we will have to revisit and defend it in the future, so I also would like to gather as much information as possible.
------- Comment #6 From 2009-06-24 15:47:23 PST -------
As mentioned in a WhatWG e-mail [1], IE partly avoids the problem of mislabelled CJK pages by merging 7-bit and 8-bit character sets. In particular, ISO-2022-JP and Shift_JIS are merged, which means that ISO-2022-JP mislabelled as Shift_JIS or vice versa still works correctly.

Implementing this in WebKit should reduce the need for encoding detection for Japanese.

[1] <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-April/019322.html>
------- Comment #7 From 2009-07-18 04:26:16 PST -------
The description in my previous comment was slightly inaccurate. Merging of 7-bit and 8-bit CJK encodings in IE seems to work as follows:

Declared charset -> Actual encoding used, ‘+’ indicating union

HZ -> HZ + GBK
EUC-CN or GBK -> GBK

ISO-2022-JP -> ISO-2022-JP + Windows-31J
Shift_JIS or Windows-31J -> Windows-31J

ISO-2022-KR -> ISO-2022-KR + Windows-949
EUC-KR or Windows-949 -> ISO-2022-KR + Windows-949

In other words:
— 7-bit encodings (HZ, ISO-2022-JP, ISO-2022-KR) are enhanced with the most popular and comprehensive 8-bit encoding for the same locale (GBK, Windows-31J, Windows-949);
— for Korean, the 8-bit encoding (Windows-949) is enhanced with the corresponding 7-bit encoding (ISO-2022-KR) as well; and
— ‘small’ 8-bit encodings (EUC-CN, Shift_JIS, EUC-KR) are treated as their corresponding ‘large’ superset counterparts (GBK, Windows-31J, Windows-949).

Obviously, this makes IE more resilient to encoding declaration errors and might be worth replicating.
------- Comment #8 From 2009-09-01 15:36:35 PST -------
Created an attachment (id=38891) [details]
proposed patch
------- Comment #9 From 2009-09-01 17:07:12 PST -------
Committed <http://trac.webkit.org/changeset/47950>.
------- Comment #10 From 2009-09-02 14:27:40 PST -------
Glad that this was finally resolved. Chromium has been making a local fork for this. 


(In reply to comment #7)
> The description in my previous comment was slightly inaccurate. Merging of
> 7-bit and 8-bit CJK encodings in IE seems to work as follows:
> 
> Declared charset -> Actual encoding used, ‘+’ indicating union

I'm not sure if we want to do this. I suspect that there are not many benefits while I'm afraid there is some risk. 


> — for Korean, the 8-bit encoding (Windows-949) is enhanced with the
> corresponding 7-bit encoding (ISO-2022-KR) as well; and

I don't think this is necessary. Virtually no Korean web pages uses ISO-2022-KR. 

> — ‘small’ 8-bit encodings (EUC-CN, Shift_JIS, EUC-KR) are treated as their
> corresponding ‘large’ superset counterparts (GBK, Windows-31J, Windows-949).

That's already done by Webkit (and firefox) and is even listed in HTML5 spec. There are some other subset => superset mappings done by Webkit (TIS620 < ISO-8859-11 < Windows-874 for Thai and ISO-8859-9 < windows-125? for Turkish).