Summary: | When a rare EUC-JP character is present, explicitly (and correctly) labelled EUC-JP document is mistreated as Shift_JIS | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Jungshik Shin <jshin> | ||||
Component: | Page Loading | Assignee: | Alexey Proskuryakov <ap> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | ap, darin, emacemac7, pub-webkit | ||||
Priority: | P2 | ||||||
Version: | 528+ (Nightly build) | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
URL: | http://www.google.com/search?hl=en&inlang=ja&ie=EUC-JP&oe=EUC-JP&q=%8F%A2%C3&btnG=Search | ||||||
Bug Depends on: | 16482 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Jungshik Shin
2008-10-30 16:24:57 PDT
Makes sense to me, but I don't know which use cases the encoding detector was supposed to fix by original design. Is there a chance that there is some amount of mislabeled content, correctly rendered by other browsers for whatever reasons? I guess that's unlikely. The Japanese encoding detector was originally intended at least in part to make mislabeled pages work correctly. Limiting the automatic detection only to pages that are not labeled with a charset at all will almost certainly break some websites. I don't know how to make a good decision about this. I'm not an expert on the state of the art in encoding in Japanese-language websites, nor do I know what the other major web browsers currently do about this. Ooops. I wrote a long reply last week and thought I submitted it, but apparently moved away before submitting it (I shouldn't open too many tabs :-) ) Let me rewrite what I wrote before: 1. We should never invoke it without an explicit user request even when its almost perfect. Currently, webkit does not offer a way to control it. Bug 16482 adds a settings/preference entry for that among other things. 2. Until we have a very good quality encoding detector (I'd regard none of encoding detector used in web browsers today as clearing the bar. Neither is ICU's encoding detector), we should NOT invoke it for a page with an expliclty (and more often than not, correctly) specified encoding (meta or http) even if a user turns on the detector. This is what Firefox does and what I implemented in bug 16482. On the other hand, MS IE behaves differently (I'm not sure exactly what it does) 3. When we have a really good detector, we may reconsider #2. For this particular bug, I can't get rid of built-in Japanese detector completely yet because ICU's encoding detector does not detect ISO-2022-JP, but I propose we use the same condition for invoking built-in encoding detector as I do for ICU's detector in the patch for bug 16482. How does it sound? BTW, this was independently reported for Chrome ( http://code.google.com/p/chromium/issues/detail?id=3799 ) This sounds like a good declaration of principles. But how can we figure out what compatibility impact this change will have? Is our current auto-detection useless or useful? How do you know? See also: <rdar://6007713>, <rdar://5934750> (which have examples of sites with similar problems). (In reply to comment #3) > On the other hand, MS IE behaves differently (I'm not sure exactly what it does) Is it possible to find out? When I face a weird IE behavior that I cannot figure out myself, I'm often able to find it discussed and thoroughly bisected on the net. We have 3 or 4 reports of problems caused by overriding an explicitly specified charset accumulated over the years. This is sufficient to strongly consider changing this behavior, but it is likely that we will have to revisit and defend it in the future, so I also would like to gather as much information as possible. As mentioned in a WhatWG e-mail [1], IE partly avoids the problem of mislabelled CJK pages by merging 7-bit and 8-bit character sets. In particular, ISO-2022-JP and Shift_JIS are merged, which means that ISO-2022-JP mislabelled as Shift_JIS or vice versa still works correctly. Implementing this in WebKit should reduce the need for encoding detection for Japanese. [1] <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-April/019322.html> The description in my previous comment was slightly inaccurate. Merging of 7-bit and 8-bit CJK encodings in IE seems to work as follows: Declared charset -> Actual encoding used, ‘+’ indicating union HZ -> HZ + GBK EUC-CN or GBK -> GBK ISO-2022-JP -> ISO-2022-JP + Windows-31J Shift_JIS or Windows-31J -> Windows-31J ISO-2022-KR -> ISO-2022-KR + Windows-949 EUC-KR or Windows-949 -> ISO-2022-KR + Windows-949 In other words: — 7-bit encodings (HZ, ISO-2022-JP, ISO-2022-KR) are enhanced with the most popular and comprehensive 8-bit encoding for the same locale (GBK, Windows-31J, Windows-949); — for Korean, the 8-bit encoding (Windows-949) is enhanced with the corresponding 7-bit encoding (ISO-2022-KR) as well; and — ‘small’ 8-bit encodings (EUC-CN, Shift_JIS, EUC-KR) are treated as their corresponding ‘large’ superset counterparts (GBK, Windows-31J, Windows-949). Obviously, this makes IE more resilient to encoding declaration errors and might be worth replicating. Created attachment 38891 [details]
proposed patch
Committed <http://trac.webkit.org/changeset/47950>. Glad that this was finally resolved. Chromium has been making a local fork for this. (In reply to comment #7) > The description in my previous comment was slightly inaccurate. Merging of > 7-bit and 8-bit CJK encodings in IE seems to work as follows: > > Declared charset -> Actual encoding used, ‘+’ indicating union I'm not sure if we want to do this. I suspect that there are not many benefits while I'm afraid there is some risk. > — for Korean, the 8-bit encoding (Windows-949) is enhanced with the > corresponding 7-bit encoding (ISO-2022-KR) as well; and I don't think this is necessary. Virtually no Korean web pages uses ISO-2022-KR. > — ‘small’ 8-bit encodings (EUC-CN, Shift_JIS, EUC-KR) are treated as their > corresponding ‘large’ superset counterparts (GBK, Windows-31J, Windows-949). That's already done by Webkit (and firefox) and is even listed in HTML5 spec. There are some other subset => superset mappings done by Webkit (TIS620 < ISO-8859-11 < Windows-874 for Thai and ISO-8859-9 < windows-125? for Turkish). |