Bug 97054 - Encoding detector doesn't work on a specific euc-kr case.
Summary: Encoding detector doesn't work on a specific euc-kr case.
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 97307 245305 97176
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-18 18:10 PDT by Kangil Han
Modified: 2023-01-22 20:14 PST (History)
4 users (show)

See Also:


Attachments
A bad test case (245 bytes, text/html)
2012-09-18 18:57 PDT, Kangil Han
kangil.han: review-
kangil.han: commit-queue-
Details
Safari 15.5 differs from other browsers (239.96 KB, image/png)
2022-06-05 04:00 PDT, Ahmad Saleem
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kangil Han 2012-09-18 18:10:56 PDT
This case, that has euc-kr encoded text lower than 10 characters in its html file w/o charset definition, is failed on encoding detector since ICU library always returns confidence value as '10'.
From this, I have uploaded a patch in 'http://bugs.icu-project.org/trac/ticket/9585' and waiting for review.

To adopt this case on layout(regression) test, I've tried to manipulate javaScript but realized it wouldn't be easy because encoding detector works on reading input stream level.
Therefore, I will ask webkit-dev for advice/opinion to resolve this.
Comment 1 yosin 2012-09-18 18:48:00 PDT
I think 10 characters is too few for encode detecting.
Could you tell me the tests you mentioned?

In my feeling, auto encode detecting may be feature of browser rather than webkit. It may want to know, user's language preference list, referrer page encoding/language, encoding/language in pages in links of the page, etc.
Comment 2 Kangil Han 2012-09-18 18:57:47 PDT
Created attachment 164643 [details]
A bad test case
Comment 3 Kangil Han 2012-09-18 19:00:33 PDT
(In reply to comment #1)
> I think 10 characters is too few for encode detecting.
> Could you tell me the tests you mentioned?
> 
> In my feeling, auto encode detecting may be feature of browser rather than webkit. It may want to know, user's language preference list, referrer page encoding/language, encoding/language in pages in links of the page, etc.

I attached a test case I worked on lately.
I agree with that language setting would be browser stuff.
However, we can do test encoding detector solely with WebCore. :-)
Comment 4 yosin 2012-09-18 19:16:51 PDT
How about adding method to window.internal to enable/disable auto encoding detection?

We may want to specify boosting encoding too. For EUC-KR case, detector may return GBK, BIG5, EUC-JP, etc.
Comment 5 yosin 2012-09-18 19:22:51 PDT
We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594
Although, the patch wasn't landed.

See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection.

Hope your help.
Comment 6 Kangil Han 2012-09-18 19:39:35 PDT
(In reply to comment #4)
> How about adding method to window.internal to enable/disable auto encoding detection?
> 
> We may want to specify boosting encoding too. For EUC-KR case, detector may return GBK, BIG5, EUC-JP, etc.

window.internal is also javaScript manipulation method to enable encoding detector.
The problem I've found is that It won't work because encoding detector finishes its work on reading input stream stage.
Comment 7 Kangil Han 2012-09-18 19:46:41 PDT
(In reply to comment #5)
> We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594
> Although, the patch wasn't landed.
> 
> See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection.
> 
> Hope your help.

So huge.. :P
BTW, doesn't ICU support Kanji code?
Comment 8 yosin 2012-09-18 20:43:06 PDT
(In reply to comment #7)
> (In reply to comment #5)
> > We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594
> > Although, the patch wasn't landed.
> > 
> > See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection.
> > 
> > Hope your help.
> 
> So huge.. :P
Most of them are re-factoring. You can ignore WebKit/*, WebCore/platform/*

> BTW, doesn't ICU support Kanji code?

Yes, ICU supports Kanji characters, Japanese encoding. By historical reasons, WebKit has special detector for Japanese encoding.
Comment 9 Kangil Han 2012-09-18 22:17:23 PDT
(In reply to comment #8)
> 
> Yes, ICU supports Kanji characters, Japanese encoding. By historical reasons, WebKit has special detector for Japanese encoding.

Oh, I see!
Comment 10 Ahmad Saleem 2022-06-05 04:00:22 PDT
Created attachment 460037 [details]
Safari 15.5 differs from other browsers

I am still able to reproduce the following bug in Safari 15.5 on macOS 12.4. As shown in the attached screenshots, all other browsers work correctly. Thanks!