This case, that has euc-kr encoded text lower than 10 characters in its html file w/o charset definition, is failed on encoding detector since ICU library always returns confidence value as '10'. From this, I have uploaded a patch in 'http://bugs.icu-project.org/trac/ticket/9585' and waiting for review. To adopt this case on layout(regression) test, I've tried to manipulate javaScript but realized it wouldn't be easy because encoding detector works on reading input stream level. Therefore, I will ask webkit-dev for advice/opinion to resolve this.
I think 10 characters is too few for encode detecting. Could you tell me the tests you mentioned? In my feeling, auto encode detecting may be feature of browser rather than webkit. It may want to know, user's language preference list, referrer page encoding/language, encoding/language in pages in links of the page, etc.
Created attachment 164643 [details] A bad test case
(In reply to comment #1) > I think 10 characters is too few for encode detecting. > Could you tell me the tests you mentioned? > > In my feeling, auto encode detecting may be feature of browser rather than webkit. It may want to know, user's language preference list, referrer page encoding/language, encoding/language in pages in links of the page, etc. I attached a test case I worked on lately. I agree with that language setting would be browser stuff. However, we can do test encoding detector solely with WebCore. :-)
How about adding method to window.internal to enable/disable auto encoding detection? We may want to specify boosting encoding too. For EUC-KR case, detector may return GBK, BIG5, EUC-JP, etc.
We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594 Although, the patch wasn't landed. See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection. Hope your help.
(In reply to comment #4) > How about adding method to window.internal to enable/disable auto encoding detection? > > We may want to specify boosting encoding too. For EUC-KR case, detector may return GBK, BIG5, EUC-JP, etc. window.internal is also javaScript manipulation method to enable encoding detector. The problem I've found is that It won't work because encoding detector finishes its work on reading input stream stage.
(In reply to comment #5) > We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594 > Although, the patch wasn't landed. > > See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection. > > Hope your help. So huge.. :P BTW, doesn't ICU support Kanji code?
(In reply to comment #7) > (In reply to comment #5) > > We worked encoding detection: https://bugs.webkit.org/show_bug.cgi?id=75594 > > Although, the patch wasn't landed. > > > > See WebCore::TextResourceDecoder::setUsesEncodingDetector(), how we tried to control auto encoding detection. > > > > Hope your help. > > So huge.. :P Most of them are re-factoring. You can ignore WebKit/*, WebCore/platform/* > BTW, doesn't ICU support Kanji code? Yes, ICU supports Kanji characters, Japanese encoding. By historical reasons, WebKit has special detector for Japanese encoding.
(In reply to comment #8) > > Yes, ICU supports Kanji characters, Japanese encoding. By historical reasons, WebKit has special detector for Japanese encoding. Oh, I see!
Created attachment 460037 [details] Safari 15.5 differs from other browsers I am still able to reproduce the following bug in Safari 15.5 on macOS 12.4. As shown in the attached screenshots, all other browsers work correctly. Thanks!