Bug 16482

Summary: Hook up ICU's encoding detector and add a boolean param to Settings and WebPreferences
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: New BugsAssignee: Jungshik Shin <jshin>
Status: RESOLVED FIXED    
Severity: Enhancement CC: abarth, ap, dglazkov, emacemac7, hsivonen, mal, mike
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 20534    
Bug Blocks: 21990    
Attachments:
Description Flags
incompete patch (fails in linking)
none
Updated copy of Jungshik's patch which compiles on ToT
none
Updated copy of Jungshik's patch which compiles and links on ToT
none
patch with some heuristics added to Eric's previous patch
eric: review-
patch with Eric's comments addressed
none
updated patch
ap: review-
updated patch with ap's review comments addressed
none
upated patch
none
updated patch (the same as the previous one except for a 1-line chnage)
ap: review-
patch addressing ap's concerns
none
patch (alternative) with a very minor diff from the previous one
none
tiger/qt build fix
none
Fix Tiger/Qt build and crash in setHintEncoding none

Description Jungshik Shin 2007-12-17 14:34:48 PST
ICU has an character encoding detector. It's not very well tuned. However, it's decent enough to 'resolve'  bugs like bug 14683 and bug 14500 (in the sense that they'd not be visible to end-users any more).

I have a patch that  :

  1. Hooks up FrameLoader with ICU encoding detector 
  2. Adds a field to Settings (to control whether the detector is used or not)

I'll upload it so that others can take a look. 

In addition, WebPreference needs to have the corresponding entry and it needs to have a UI so that the end user controls it as well.
Comment 1 Jungshik Shin 2007-12-20 16:58:24 PST
Created attachment 18020 [details]
incompete patch (fails in linking)

This patch does not work because 1) the version of ICU on Mac OS 10.3.x does not contain the character encoding detector. 2) I don't know Objective-C and my change to those files may be wrong. 

I added ucsdet.h header to icu header directory hoping that ICU on Mac OS 10.3.x is recent enough (3.6 or later??) to have encoding detectors, but at the link time, it fails. (I could have found that searching for ucsdet in icucore.dylib before trying). 

Anyway, I'm uploading the patch in the hope that it can be used on other platforms (with some conditional compilations added) and perhaps on Mac OS 10.4.x with a newer version of ICU.
Comment 2 Alexey Proskuryakov 2007-12-20 21:24:01 PST
Mac OS X 10.4 has ICU 3.2; 10.5 and Windows have ICU 3.6. On Linux distributions, the versions obviously vary.
Comment 3 David Kilzer (:ddkilzer) 2007-12-21 09:06:14 PST
(In reply to comment #1)
> This patch does not work because 1) the version of ICU on Mac OS 10.3.x does
> not contain the character encoding detector. 2) I don't know Objective-C and my
> change to those files may be wrong. 

Are you actually compiling WebKit on Mac OS X 10.3 Panther!?

Comment 4 Jungshik Shin 2007-12-22 16:17:55 PST
(In reply to comment #3)
> (In reply to comment #1)
> > This patch does not work because 1) the version of ICU on Mac OS 10.3.x does
> > not contain the character encoding detector. 2) I don't know Objective-C and my
> > change to those files may be wrong. 
> 
> Are you actually compiling WebKit on Mac OS X 10.3 Panther!?

Sorry my version number was off by one. I meant 10.4.x (tiger).

Comment 5 Eric Seidel (no email) 2008-01-09 18:07:17 PST
http://ads.people.com.cn/html.ng/channel_range=1/&channel_id=1&PagePos=24&size=txt&site=people
is an example of page which is rendered incorrectly due to lack of charset detection.

This is an ad from http://people.com.cn/  They are displayed in iframes in the upper right hand corner of the page.

Loading the ad frame by itself renders incorrectly in:
Safari
FireFox (with charset decoding off)
IE7

Loading the ad frame as part of people.com.cn renders correctly in:
IE7
Opera

The site is only sending text/html, no charset information.
Comment 6 Jungshik Shin 2008-01-09 18:34:34 PST
In case somebody is wondering why the example given works for Chinese (and there are many sites like that in China, Korea ).   Chinese and Koreans set their browser default encodings to the most widely used legacy encoding (GB2312/GBK, EUC-KR/CP949) and without charset specified, those pages just get rendered fine. 

However, with the default encoding set to one (ISO-8859-1) different from what's assumed by the site author, pages (iframes) without charset specified get garbled.  


Comment 7 Eric Seidel (no email) 2008-01-09 21:34:17 PST
(In reply to comment #5)
I should note that the testing I did was with browsers on an two US-english configured laptops.  One Mac OS X 10.4.11 and one Windows XP.
Comment 8 Eric Seidel (no email) 2008-08-02 02:47:53 PDT
Created attachment 22616 [details]
Updated copy of Jungshik's patch which compiles on ToT

 WebCore/icu/unicode/ucsdet.h                  |  350 +++++++++++++++++++++++++
 WebCore/loader/FrameLoader.cpp                |    5 +-
 WebCore/loader/TextResourceDecoder.cpp        |   31 ++-
 WebCore/loader/TextResourceDecoder.h          |    8 +-
 WebCore/page/Settings.cpp                     |    6 +
 WebCore/page/Settings.h                       |    4 +
 WebKit/mac/WebView/WebPreferenceKeysPrivate.h |    1 +
 WebKit/mac/WebView/WebPreferences.mm          |    1 +
 WebKit/mac/WebView/WebPreferencesPrivate.h    |    3 +
 WebKit/mac/WebView/WebView.mm                 |    1 +
 10 files changed, 405 insertions(+), 5 deletions(-)
Comment 9 Eric Seidel (no email) 2008-08-02 02:50:18 PDT
What's left to do here?

1.  Jungshik needs to provide some test cases which show this type of detection in action.
2.  Someone needs to abstract the ICU bits into an ICU-specific file.  I looked at making a TextResourceDecoderICU, but that's a loader file and not in WebCore/platform (where all the suffixed files should go).  So maybe we need to make a TextEncodingDetector.h and TextEncodingDetectorICU.cpp under platform/ and stick this one function in there?
Comment 10 Eric Seidel (no email) 2008-08-02 03:39:37 PDT
Created attachment 22617 [details]
Updated copy of Jungshik's patch which compiles and links on ToT

 WebCore/WebCore.base.exp                      |    3 +-
 WebCore/icu/unicode/ucsdet.h                  |  350 +++++++++++++++++++++++++
 WebCore/loader/FrameLoader.cpp                |    5 +-
 WebCore/loader/TextResourceDecoder.cpp        |   31 ++-
 WebCore/loader/TextResourceDecoder.h          |    8 +-
 WebCore/page/Settings.cpp                     |    6 +
 WebCore/page/Settings.h                       |    4 +
 WebKit/mac/WebView/WebPreferenceKeysPrivate.h |    1 +
 WebKit/mac/WebView/WebPreferences.mm          |   11 +
 WebKit/mac/WebView/WebPreferencesPrivate.h    |    3 +
 WebKit/mac/WebView/WebView.mm                 |    1 +
 11 files changed, 417 insertions(+), 6 deletions(-)
Comment 11 Jungshik Shin 2008-08-29 11:08:19 PDT
Created attachment 23070 [details]
patch with some heuristics  added to  Eric's previous patch 

Eric, thank you for updating my patch. 

I added some heuristics to guard against the mis-detection when a rather short chunk is thrown at the ICU detector.

As for making a test, is it possible to control  the webpreference (usesEncodingDetector, which is off by default) for the layout test?
Comment 12 Eric Seidel (no email) 2008-08-29 12:23:37 PDT
(In reply to comment #11)
> Created an attachment (id=23070) [edit]
> As for making a test, is it possible to control  the webpreference
> (usesEncodingDetector, which is off by default) for the layout test?  

Not yet... but Glenn Wilson has posted a patch to add support for such:
https://bugs.webkit.org/show_bug.cgi?id=20534
Comment 13 Eric Seidel (no email) 2009-01-20 12:17:01 PST
So gwilson has posted a patch for review for toggling preferences in DRT.  I'm supposed to review it, but he and I talked about a revised approach and I was sorta waiting for that, but I really should just r+/r- his current patch.  Will go do that after lunch.
Comment 14 Eric Seidel (no email) 2009-02-03 15:17:40 PST
Comment on attachment 23070 [details]
patch with some heuristics  added to  Eric's previous patch 

This patch looks good in general, however we need to split this ICU-specific logic out into its own file, or we'll break non-ICU ports.  I would suggest creating new files in platform/text named EncodingDectector.h EncodingDectectorNone.cpp EncodingDetectorICU.cpp and adding your logic in the appropriate places, and providing an empty stub in the None case.  qt/gtk builds can use the None, and Mac/Win/Chromium can use the ICU version.  You can also of course call the files whatever would be most correct, I'm not sure if "TextEncodingDetector" makes more sense or not.

Otherwise the patch looks great.  There are a few style issues like foo_bar should be fooBar for variable names.

Thanks for the patch!
Comment 15 Jungshik Shin 2009-02-10 14:20:14 PST
Created attachment 27539 [details]
patch with Eric's comments addressed

I addressed Eric's comment and made sure that it builds on Mac(Leopard) and Windows. TextEncodingDetection.h was added to platform/text with one actual implementation based on ICU (TextEncodingDetectionICU.cpp) and a stub implementation (TextEncodingDetectionNone.cpp). 

To build on Mac OS X, I manually edited WebCore.base.exp to export two symbols (TextResourceDecoder ctor with a new signature and Settings::setUseEncodingDetector). Is it the right way? 


Besides, there's still an issue with Mac OS 10.4. It has ICU 3.2.x so that ICU's encoding detector (added to ICU in 3.6) is not available. So, on Mac OS 10.4, link will fail not being able to find ucsdet. 

I wonder what macro is available to distinguish between different Mac OS versions.[1] When building on Tiger for deployment on Tiger, a stub implementation of detectTextEncoding() should be used in place of the current implementation in platform/text/TextEncodingDetectionICU.cpp (ucsdet.h should not be included). 

[1] Using U_ICU_VERSION_{MAJOR,MINOR}_NUM in uversion.h macros does not work because on Mac OS X, we build with ICU headers in WebKit tree (which has two macros set to 3 and 2, respectively)
Comment 16 Alexey Proskuryakov 2009-02-11 01:42:20 PST
(In reply to comment #15)
> To build on Mac OS X, I manually edited WebCore.base.exp to export two symbols
> (TextResourceDecoder ctor with a new signature and
> Settings::setUseEncodingDetector). Is it the right way? 

Yes, that's correct. We keep this file sorted though.

> I wonder what macro is available to distinguish between different Mac OS
> versions.[1]

#ifdef BUILDING_ON_TIGER

There are multiple coding style violations in this patch, which I'm not calling out, since it wasn't put for review.

The addition of encoding auto-detection has serious security implications, so I'm CC'in some security folks. In particular, using parent frame's encoding as a "hint" for subframes looks wrong - there's a reason for a canAccess() call just below the code you added for setting the hint.
Comment 17 Adam Barth 2009-02-11 09:59:04 PST
(In reply to comment #16)
> The addition of encoding auto-detection has serious security implications, so
> I'm CC'in some security folks. In particular, using parent frame's encoding as
> a "hint" for subframes looks wrong - there's a reason for a canAccess() call
> just below the code you added for setting the hint.

Yes.  Do we have a complete understanding of Firefox's behavior for auto-detecting encodings?  IE's behavior has led to lots of XSS vulnerabilities in sites.

Is encoding auto-detection specified in HTML 5?  If not, I'd recommend matching Firefox's behavior precisely.
Comment 18 Jungshik Shin 2009-02-11 11:01:47 PST
(In reply to comment #17)
> (In reply to comment #16)
> > The addition of encoding auto-detection has serious security implications, so
> > I'm CC'in some security folks. In particular, using parent frame's encoding as
> > a "hint" for subframes looks wrong - there's a reason for a canAccess() call
> > just below the code you added for setting the hint.

Thank you for the comment. I'll look into that. 
 
> Yes.  Do we have a complete understanding of Firefox's behavior for
> auto-detecting encodings?  IE's behavior has led to lots of XSS vulnerabilities
> in sites.

If I understand/remember correctly, those vulnerabilities are due to that IE auto-detects ISO-2022-XX (JP, KR) and UTF-7 (7bit encodings). 

The current implementation does not (Firefox does not auto-detect UTF-7, but I forgot whether it auto-detects ISO-2022-JP or not. I think it does). Perhaps, I should make it explicit. 

Another problem with IE (IIUC) is that autodetection kicks in without an explicit user request, in some cases.
 
> Is encoding auto-detection specified in HTML 5?  If not, I'd recommend matching
> Firefox's behavior precisely.

 

Comment 19 Adam Barth 2009-02-11 11:14:23 PST
(In reply to comment #18)
> If I understand/remember correctly, those vulnerabilities are due to that IE
> auto-detects ISO-2022-XX (JP, KR) and UTF-7 (7bit encodings). 

I'm not nearly as much of an encoding expert as you, but that matches my understanding.

> The current implementation does not (Firefox does not auto-detect UTF-7, but I
> forgot whether it auto-detects ISO-2022-JP or not. I think it does). Perhaps, I
> should make it explicit. 

Please do make this explicit.  We should also test for this explicitly in a security LayoutTest.
Comment 20 Alexey Proskuryakov 2009-02-11 11:34:33 PST
> If I understand/remember correctly, those vulnerabilities are due to that IE
> auto-detects ISO-2022-XX (JP, KR) and UTF-7 (7bit encodings).

Are HZ family encodings in the same category that we should explicitly forbid?

I think that besides forbidding encodings that affect ASCII range, we should make the model sufficiently simple to remember and explain, so that we could reasonably deal with compatibility issues and future security attack vectors.
Comment 21 Jungshik Shin 2009-02-11 16:02:02 PST
Created attachment 27576 [details]
updated patch 

I updated comments to shouldAutoDetect to make it clear when auto-detection is triggered. Here's recap:

// We use the encoding detector in two cases:
//   1. Encoding detector is turned ON and no other encoding source is
//      available (that is, it's DefaultEncoding).
//   2. Encoding detector is turned ON and the encoding is set to
//      the encoding of the parent frame, which is also auto-detected.
//   Note that condition #2 is NOT satisfied unless parent-child frame
//   relationship is compliant to the same-origin policy. If they're from
//   different domains, |m_source| would not be set to EncodingFromParentFrame
//   in the first place. 

I believe the above and *not* auto-detecting UTF-7 (and ISO-2022-XX) make this rather safe. 

In addition to the above change, I tried to take care of style nits as far as I could find.
Comment 22 Jungshik Shin 2009-02-17 13:30:35 PST
(In reply to comment #20)
> > If I understand/remember correctly, those vulnerabilities are due to that IE
> > auto-detects ISO-2022-XX (JP, KR) and UTF-7 (7bit encodings).
> 
> Are HZ family encodings in the same category that we should explicitly forbid?

Ooops. I missed your question before posting the updated the patch. Yes, it is and the current implementation (ICU's encoding detector) does not auto-detect HZ, either. For other implementations, shall I add a promiment comment to that effect (do not auto-detect UTF-7, ISO-2022-xx and HZ)? 

> I think that besides forbidding encodings that affect ASCII range, we should
> make the model sufficiently simple to remember and explain, so that we could
> reasonably deal with compatibility issues and future security attack vectors.

Did my additional comment make it clear enough or do you want some more clarification/simplification(?)? 


Comment 23 Jungshik Shin 2009-02-23 12:50:28 PST
Comment on attachment 27576 [details]
updated patch 

Additional piece of information: IE suffers (more) from XSS using UTF-7 because it's UTF-7 decoder did not validate the input (or couldn't cope with invalid sequence in a safe manner). 

Anyway, can somebody take a look at the patch? Thank you.
Comment 24 Alexey Proskuryakov 2009-02-27 05:43:47 PST
Comment on attachment 27576 [details]
updated patch 

+         - add TextEncodingDetection* to platform/text.

This class should be named TextEncodingDetector, not TextEncodingDetection.

+           Tiger comes with ICU 3.2 with encoding detector.

"which doesn't support encoding detection"

+        * WebCore.vcproj/WebCore.vcproj:
+        * WebCore.xcodeproj/project.pbxproj:

Having per-file and per-function comments makes it easier for reviewers and others who will read a patch to understand it. This is particularly important for interesting patches like this one.

Please also edit makefiles for other build systems (GNUmakefile.am, WebCore.pro, WebCore.scons, WebCoreSources.bkl).

+    , m_hintDecoder(NULL)

We use 0, not NULL in C++ code. There are several uses of NULL in the patch, please fix them all.

+    return m_usesEncodingDetector
+        && (m_source == DefaultEncoding
+        || (m_source == EncodingFromParentFrame && m_hintDecoder
+        && m_hintDecoder->source() == AutoDetectedEncoding)); 

This code is very difficult to read due to indenting. There is no need to make lines that short, and I think that this condition can be made much easier to read if you don't wrap it at 72 characters).

     // Do the auto-detect if our default encoding is one of the Japanese ones.
     // FIXME: It seems wrong to change our encoding downstream after we have already done some decoding.
-    if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && encoding().isJapanese())
+    if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && encoding().isJapanese()) {
         detectJapaneseEncoding(data, len);
+    } else if (shouldAutoDetect()) {
+       TextEncoding detectedEncoding;
+       if (detectTextEncoding(data, len, hintEncodingName(), &detectedEncoding))
+           setEncoding(detectedEncoding, AutoDetectedEncoding);
+   }

Our coding style says that there should be no braces around single-line blocks. The comment before this code no longer matches what it does (it is no longer only about Japanese).

Indenting on the last line here is incorrect.

I think that we should resolve this FIXME before making encoding detection more common. It should be a separate bug blocking this one.

 String TextResourceDecoder::flush()
 {
+   // For HTML and XML document, if we can not find proper encoding even
+   // document is completely loaded, we need to use automatically detect
+   // the encoding of document if other conditions for autodetection is
+   // satisfied.

This comment should explain why HTML and XML are special. What about CSS, for example?

+    void setHintDecoder(const TextResourceDecoder* hintDecoder) {
+        m_hintDecoder = hintDecoder;
+    }

The opening brace should go on the next line.

+    EncodingSource source() const { return m_source; }

I don't see where this new public method us used outside TextResourceDecoder. Let's not add it if it is not needed.

+    const char* hintEncodingName() const {
+        return m_hintDecoder ? m_hintDecoder->encoding().name() : NULL;
+    }

Same comments about opening brace and NULL.

+    const TextResourceDecoder* m_hintDecoder;
+    //RefPtr<TextResourceDecoder> m_hintDecoder;

We don't commit commented out code. How are you going to manage hint decoder lifetime?

+bool detectTextEncoding(const char* data, size_t len,
+                        const char* hintEncodingName,
+                        TextEncoding* detectedEncoding)
+{
+    *detectedEncoding = TextEncoding();
+#ifdef BUILDING_ON_TIGER
+    // Tiger came with ICU 3.2 and does not have the encoding detector.
+    return false;
+#else

This will cause unused argument warnings when built on Tiger, please add UNUSED_PARAM macros for unused arguments.

Index: WebCore/platform/text/TextEncodingDetectionNone.cpp
===================================================================
--- WebCore/platform/text/TextEncodingDetectionNone.cpp	(revision 0)
+++ WebCore/platform/text/TextEncodingDetectionNone.cpp	(revision 0)
@@ -0,0 +1,11 @@
+namespace WebCore {
+
+bool detectTextEncoding(const char* data, size_t len,
+                        const char* hintEncodingName,
+                        TextEncoding* detectedEncoding)
+{
+    *detectedEncoding = TextEncoding();
+    return false;
+}
+
+}

This file needs copyright comments and includes. Same issue with unused arguments.

         Settings* settings = m_frame->settings();
-        m_decoder = TextResourceDecoder::create(m_responseMIMEType, settings ? settings->defaultTextEncodingName() : String());
+        if (settings) {

You can write this in one line as 
         if (Setting* settings = m_frame->settings()) {

+            if (parentFrame && parentFrame->document())
+                m_decoder->setHintDecoder(parentFrame->document()->decoder());

A different origin parent frame shouldn't affect child frame decoding for security reasons, even through a hint. So, I'm not sure why the hint is needed at all. Am I perhaps missing something?

+	} else {
+            m_decoder = TextResourceDecoder::create(m_responseMIMEType, String());
+        }

There is a tab on "else" line. We don't use braces around single-line blocks.

This looks good for me for the most part. The biggest issues I see are lack of clarity about hint encoding usage (and its decoder lifetime), and the FIXME I asked to fix in a blocking bug.
Comment 25 Jungshik Shin 2009-03-06 01:25:00 PST
Created attachment 28353 [details]
updated patch with ap's review comments addressed

Thank you for the review. Can you take another look? 

* Added/fixed copyright/license in newly added files (I copied it from other chrome-contributed files)
* Fixed style nits (NULL, indentation, tab, etc)
* Added more comments to clarify what it does in ChangeLog and source files.
Comment 26 Jungshik Shin 2009-03-06 01:41:12 PST
> I think that we should resolve this FIXME before making encoding detection more
> common. It should be a separate bug blocking this one.

I filed bug 24420. It's kinda tricky because whether or not to invoke an encoding detection partly depends on what we read (and decode) so far. Would you consider taking this patch first with the premise that I'll follow it up with bug 24420.

In case of Japanese encoding detection, I think it should be removed altoghther (see bug 21990 [1]) or at least should be invoked only when a newly added settings entry (usesEncodingDetector) is ON. 

As for hintEncoding, I added a check for security origin. Previously, I replied that it's not necessary because there's a check in TextResourceDecoder (shouldAutoDetect). It turned out that there are cases the check misses. What is hintEncoding for? TextEncodingDetectorICU.cpp uses it to find a 'compatible' (currently its compatibility check is limited to the name match) encoding among candidates. Other implementations can do more sophisticated things. Also, in the future, hintEncodingName may have to be replaced by encodingHint (that can carry more information like TLD, UI language, parent/referrer encoding etc). 

I added TextEncodingDetectorICU.cpp to various build files if there are other FooICU.cpp files. Otherwise, I added TextEncodingDetectorNone.cpp. 

[1] If you need some hard numbers to justify removing that, I can provide that (not immediately). 
Comment 27 Alexey Proskuryakov 2009-03-06 02:12:06 PST
(In reply to comment #26)
> I filed bug 24420. It's kinda tricky because whether or not to invoke an
> encoding detection partly depends on what we read (and decode) so far. Would
> you consider taking this patch first with the premise that I'll follow it up
> with bug 24420.

I'm just worried that fixing bug 24420 will require re-starting decoding, with all the associated perils of repeated script execution etc. It's very tricky indeed!
Comment 28 Jungshik Shin 2009-03-09 11:03:15 PDT
(In reply to comment #27)
> (In reply to comment #26)

> I'm just worried that fixing bug 24420 will require re-starting decoding, with
> all the associated perils of repeated script execution etc. It's very tricky
> indeed!

You meant "NOT fixing bug 24420 will require re-starting decoding .... etc", didn't you? I agree that there is a problem. 

Don't we have the same problem with 'meta charset' declaration (or do we stop search for meta as soon as we see 'script tag'? I'm being lazy here not checking the current code)? If we do have the same problem with meta charset search, this one should be less severe than that because it's off by default, shouldn't it? 




Comment 29 Alexey Proskuryakov 2009-03-09 11:10:07 PDT
(In reply to comment #28)
> You meant "NOT fixing bug 24420 will require re-starting decoding .... etc",
> didn't you? I agree that there is a problem. 

I guess I meant both at once - I'd like to see at least an initial plan for bug 24420 before giving a green light to more encoding detection.

> Don't we have the same problem with 'meta charset' declaration (or do we stop
> search for meta as soon as we see 'script tag'?

We do not have this problem with meta charset (we pre-scan until document body or 1024 byte boundary, whichever comes last, and we ignore charset declarations that appear after we started tokenizing input).
Comment 30 Jungshik Shin 2009-03-10 13:12:27 PDT
(In reply to comment #29)
> (In reply to comment #28)
> > You meant "NOT fixing bug 24420 will require re-starting decoding .... etc",
> > didn't you? I agree that there is a problem. 
> 
> I guess I meant both at once - I'd like to see at least an initial plan for bug
> 24420 before giving a green light to more encoding detection.
...
> We do not have this problem with meta charset (we pre-scan until document body
> or 1024 byte boundary, whichever comes last, and we ignore charset declarations
> that appear after we started tokenizing input).

How about doing a similar "pre-scanning"(?) with auto-detection? We can ask autodetector to  return "an indicator of undeterministic/unknown) until it reaches a confidence level higher than a certain value (that thresold should be internal to each implementation of autodetector). Then,  we can set a flag (similar to m_checkedForHeadCharset). Would it be a good plan for bug 24420? 




Comment 31 Alexey Proskuryakov 2009-03-11 01:13:10 PDT
Yes, it may we ll be. Are you well familiar with how Gecko encoding detector works? How much data does it normally need to choose an encoding with confidence (and how much data would you expect our detector to need)?

I'd suggest that we discuss bug 24420 there.
Comment 32 Jungshik Shin 2009-03-12 10:06:53 PDT
Comment on attachment 28353 [details]
updated patch with ap's review comments addressed

This got bit-rotten due to a recent change (r41551) that got rid of TextDecoder class. I'll upload a new patch.
Comment 33 Jungshik Shin 2009-03-23 14:51:45 PDT
Created attachment 28869 [details]
upated patch
Comment 34 Jungshik Shin 2009-03-24 12:25:11 PDT
Created attachment 28897 [details]
updated patch (the same as the previous one except for a 1-line chnage)

This is the same as the previous one except that in flush(), I put 'shouldAutoDetect()' before other checks to short-circuit  in the default case where auto-detection is OFF.  This was tested in Chromium trunk on Windows (the previous patches have been in Chromium for over a year).
Comment 35 Alexey Proskuryakov 2009-03-25 06:00:03 PDT
Comment on attachment 28897 [details]
updated patch (the same as the previous one except for a 1-line chnage)

Index: WebCore/GNUmakefile.am
===================================================================

Most of changed project files are not listed in ChangeLog.

+            // Set the hint encoding to the parent frame encoding only if
+            // the parent and the current frames share the security origin.
+            // TODO: We may consider relaxing this later for 'relatively
+            // safe' encodings (that is, encodings other than non-ASCII 7bit
+            // encodings). 

We use FIXME, not TODO. I think that the first part of the comment should say why we are doing this, not just what happens below (in which case, the TODO part could probably be omitted altogether).

-        if (m_encoding.isEmpty()) {
...
+        if (m_encoding.isEmpty() && canReferToParentFrameEncoding(m_frame, parentFrame))
+            m_decoder->setEncoding(parentFrame->document()->inputEncoding(), TextResourceDecoder::EncodingFromParentFrame);
+        else {
             m_decoder->setEncoding(m_encoding,
                 m_encodingWasChosenByUser ? TextResourceDecoder::UserChosenEncoding : TextResourceDecoder::EncodingFromHTTPHeader);
         }

This looks like an unintended change - the else branch can now be taken even if m_encoding is empty.

+    const TextResourceDecoder* m_hintDecoder;

What guarantees that m_hintDecoder will exist long enough? This decoder is never used as a decoder - all we need from it is an encoding name, and source - and the latter can even be checked in setHintDecoder()! So, m_hintDecoder could be replaced with a single "const char* m_hintEncoding" variable.

+#if ENABLE(MAC_ENCODING)

The changes to TextCodecICU.cpp changes don't look like they belong to this patch.

r-, because I'm very suspicious of m_hintDecoder lifetime (especially if someone adds other sources for hints in the future), and it looks quite easy to fix resolutely. But this looks mostly ready to go!
Comment 36 Jungshik Shin 2009-03-25 16:30:43 PDT
Created attachment 28952 [details]
patch addressing ap's concerns

Thank you for the review. Here's an updated patch. Please, take another look.
Comment 37 Jungshik Shin 2009-03-25 16:34:14 PDT
Created attachment 28953 [details]
patch (alternative) with a very minor diff from the previous one

It's almost identical to attachment 28952 [details] except that it adds m_hintEncodingSource and set it (as well as m_hintEncoding) in setHintEncoding and check it against AutoDetectedEncoding in shouldAutoDetect. With it added, I can omit some comments (because the code becomes self-explanatory). 

Anyway, either this or the previous one is fine with me. 
Alexey, please, choose as you like.
Comment 38 Alexey Proskuryakov 2009-03-25 23:09:56 PDT
Comment on attachment 28952 [details]
patch addressing ap's concerns

r=me

I prefer this version, because it uses less memory (this is not critical, because TextResourceDecoder objects are not too numerous, but a good practice anyway).
Comment 39 Alexey Proskuryakov 2009-03-25 23:17:58 PDT
It came to my mind that the const char* hint can be further simplified to a bool - if there is hint that is going to be used, it always matches the decoder's codec, so all we need to know is whether auto-detect is pending, and reset the bool when appropriate. But I do not want to hold landing this.
Comment 40 Dimitri Glazkov (Google) 2009-03-26 09:17:38 PDT
Hurray!!! So, this is good to land?
Comment 41 Eric Seidel (no email) 2009-03-26 11:21:10 PDT
Alexey, I assume you'll want to land this yourself.  If that's not the case, just let one of us chromie's know and we'll be happy to land this on Jungshik's behalf.  I'm confused by the two attachments not knowing which to land. :)
Comment 42 Alexey Proskuryakov 2009-03-26 13:24:41 PDT
Please feel free to land the one with r+, as usual :)
Comment 43 Eric Seidel (no email) 2009-03-26 15:43:35 PDT
I had to add a symbol export to get it to link:

diff --git a/WebCore/WebCore.base.exp b/WebCore/WebCore.base.exp
index b7af344..cae0464 100644
--- a/WebCore/WebCore.base.exp
+++ b/WebCore/WebCore.base.exp
@@ -603,6 +603,7 @@ __ZN7WebCore8Settings22setShowsURLsInToolTipsEb
 __ZN7WebCore8Settings23setDefaultFixedFontSizeEi
 __ZN7WebCore8Settings23setEditableLinkBehaviorENS_20EditableLinkBehaviorE
 __ZN7WebCore8Settings23setNeedsTigerMailQuirksEb
+__ZN7WebCore8Settings23setUsesEncodingDetectorEb
 __ZN7WebCore8Settings24setApplicationChromeModeEb
 __ZN7WebCore8Settings24setTextAreasAreResizableEb
 __ZN7WebCore8Settings25setDeveloperExtrasEnabledEb

But otherwise things seemed to compile fine.  Still running the tests...
Comment 44 Jungshik Shin 2009-03-26 17:09:24 PDT
(In reply to comment #43)
> I had to add a symbol export to get it to link:

>  __ZN7WebCore8Settings23setNeedsTigerMailQuirksEb
> +__ZN7WebCore8Settings23setUsesEncodingDetectorEb

Thanks, Eric, for fixing the patch and landing it. My patch has that symbol added, but it got slightly bit-rotten because __ZN7WebCore8Settings23setNeedsTigerMailQuirksEb was reently added.  
Comment 45 Jungshik Shin 2009-03-26 19:28:35 PDT
Created attachment 29000 [details]
tiger/qt build fix

UnusedParam.h should have been included by TextEncodingDetector{ICU,None}.cpp for Tiger and platforms where ICU is not used.
Comment 46 Jungshik Shin 2009-03-26 19:37:02 PDT
Created attachment 29001 [details]
Fix Tiger/Qt build and crash in setHintEncoding

This patch has a crash fix for setHintEncoding in addition to Tiger/Qt build fix in the previous patch.
Comment 47 Cameron Zwarich (cpst) 2009-03-26 19:45:47 PDT
Comment on attachment 29001 [details]
Fix Tiger/Qt build and crash in setHintEncoding

r=me
Comment 48 Cameron Zwarich (cpst) 2009-03-26 19:52:52 PDT
Comment on attachment 29001 [details]
Fix Tiger/Qt build and crash in setHintEncoding

Clearing review flag because I landed this in r42026. I also made the UnusedParam.h #include unconditional on Mark Rowe's advice.
Comment 49 David Levin 2009-04-08 17:33:45 PDT
Comment on attachment 28952 [details]
patch addressing ap's concerns

Clearing r+ (to move out of commit queue) since this was landed http://trac.webkit.org/changeset/42022
Comment 50 Jungshik Shin 2009-04-23 15:01:22 PDT
fixed. thanks all. 
Comment 51 Henri Sivonen 2013-12-02 04:57:34 PST
Do I understand correctly that in bug 14500, you saw that the page didn't work, because WebKit doesn't support "late meta" and instead of adding "late meta" support, you added heuristic detection support? That seems like an odd choice. Do you run the detector only across the first 1024 bytes in order to avoid introducing the issues that "late meta" support would have introduced?
Comment 52 Alexey Proskuryakov 2013-12-02 08:32:55 PST
> instead of adding "late meta" support, you added heuristic detection support

Many WebKit ports currently don't enable ICU encoding detection (I'm not even sure if any ports do). That was a Chromium thing. We have some old custom built detection for Japanese only though.

An encoding detector solves an additional issue with pages that don't have any encoding declaration. So, it's not just a replacement for re-parsing a page when a meta is seen too late.

Regardless, restarting parsing that already had side effects is not something I want to happen, and I don't think that it's necessary in practice.

> Do you run the detector only across the first 1024 bytes in order to avoid introducing the issues that "late meta" support would have introduced?

Meta pre-parser runs over the whole HEAD or 1024 bytes, whichever is larger.