Bug 17873 - Encoding override should not be persistent
Summary: Encoding override should not be persistent
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.5
: P2 Normal
Assignee: Jungshik Shin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-15 23:46 PDT by johnnyding
Modified: 2017-07-18 08:29 PDT (History)
5 users (show)

See Also:


Attachments
patch v1 for fixing this problem (2.65 KB, patch)
2008-03-16 04:27 PDT, johnnyding
no flags Details | Formatted Diff | Diff
patch v2 for fixing this problem by using different way (3.28 KB, patch)
2008-03-16 05:06 PDT, johnnyding
no flags Details | Formatted Diff | Diff
patch v2 for fixing this problem by using different way (3.26 KB, patch)
2008-03-16 20:19 PDT, johnnyding
ap: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description johnnyding 2008-03-15 23:46:56 PDT
Once you override the encoding for the current page by selecting one in View/Encoding menu, it is persistent and don't reset it when you go to a new page which explicitly specifies the encoding via HTTP or meta tag.

1. Go to zh.wikipedia.org (UTF-8 page)
2. Override the encoding to GBK. The page will be garbled. This is expected because it's forced to be interpreted as GBK although the page is in UTF-8.
3. Go to http://news.google.com/news?ned=cn (UTF-8 page)
4. The page should be interpreted as UTF-8, but is still interpreted as GBK.

IE and Firefox will reset the encoding override when visiting new page. 
The only difference between IE and Firefox is if users go to a new page in a sub-frame(in FRAME or IFRAME) instead of main frame, IE still resets the encoding override for the new page opened in sub-frame, but Firefox does not.
Comment 1 johnnyding 2008-03-15 23:59:06 PDT
The encoding override functionality (in WebView::setCustomTextEncoding) calls FrameLoader::reloadAllowingStaleData to override encoding.  Since the FrameLoadType:FrameLoadTypeReloadAllowingStaleData is only used to reload current page by using user-selected encoding. So I think the easy way to fix his problem is

In DocumentLoader::overrideEncoding()
Check wthether current page's load type is reloadAllowingStaleData, If yes, then return right override encoding info, othervise return empty encoding.
In FrameLoader::void FrameLoader::load(const KURL& newURL, const String& referrer, FrameLoadType newLoadType,const String& frameName, Event* event, PassRefPtr<FormState> formState)
Do not change the load type to FrameLoadTypeSame when load URL is same URL, but load type is reloadAllowingStaleData, then the child frame can get correct load type.

If WebKit want to follow FireFox way, do not reset the encoding override when new page is opened in child-frame. just need to check whether the load type of ancestor of the child-frame is FrameLoadTypeReloadAllowingStaleData or not.

A patch is coming soon
Comment 2 Alexey Proskuryakov 2008-03-16 01:50:37 PDT
Something we should not regress is navigating within a site that reports an incorrect encoding in its HTTP headers. For example, I'm sometimes seeing sites that send their text as MacCyrillic when seeing "Mac" in our user agent string, but the headers incorrectly say "macintosh", which is an alias of MacRoman.
Comment 3 johnnyding 2008-03-16 04:14:41 PDT
I agree. That is why we have encoding override mechanism. 
I just think we should give user chance to look th page with original specified encoding when first time visiting a new page. But now in safari, once you specified a encoding override in one tab, then the tab would always use the specified encoding to decode all pages after that. That is incorrect.
Comment 4 johnnyding 2008-03-16 04:27:04 PDT
Created attachment 19792 [details]
patch v1 for fixing this problem

For this patch, I have a link error in my local build in Mac Leopard. The following is detail info. Since I am not familiar with Object C programming. Would anyone please guide me if you know where is the problem. Thanks.

Undefined symbols:
  "__ZNK7WebCore14DocumentLoader16overrideEncodingEv", referenced from:
      -[WebDataSource textEncodingName] in WebDataSource.o
      -[WebFrame(WebInternal) _receivedData:textEncodingName:] in WebFrame.o
      __ZN20WebFrameLoaderClient11createFrameERKN7WebCore4KURLERKNS0_6StringEPNS0_21HTMLFrameOwnerElementES6_bii in WebFrameLoaderClient.o
      -[WebView _mainFrameOverrideEncoding] in WebView.o
ld: symbol(s) not found
Comment 5 johnnyding 2008-03-16 05:06:28 PDT
Created attachment 19793 [details]
patch v2 for fixing this problem by using different way

Also if you think changing behavior of DocumentLoader::overrideEncoding is not good way. How about to call DocumentLoader::overrideEncoding("") in method FrameLoader::load when load type is not FrameLoadTypeReloadAllowingStaleData?  Please see my patch v2. It's different with patch v1.
Comment 6 johnnyding 2008-03-16 20:19:32 PDT
Created attachment 19821 [details]
patch v2 for fixing this problem by using different way

update my pacth v2
Comment 7 Jungshik Shin 2008-03-17 10:49:42 PDT
(In reply to comment #2)
> Something we should not regress is navigating within a site that reports an
> incorrect encoding in its HTTP headers. 

Is there a reliable way to determine that new pages loaded are in the same site and they have the same problem as the page for which a user overrides the encoding?  
I guess the former is possible, but I'm afraid it's hard to determine the latter. 

If what I'm writing is the case, I guess it's better to be simple (to understand/explain) and predictable and to do what other browsers do. 

After all, those misconfigured sites need to correct themselves. And if they went extra length to convert their pages to  MacCyrillic (on the fly) ,   wouldn't they listen to a request for either  not doing it (just use the encoding they use for Windows) or using the correct charset name? 
Comment 8 Alexey Proskuryakov 2008-03-17 13:38:40 PDT
(In reply to comment #7)
> After all, those misconfigured sites need to correct themselves. And if they
> went extra length to convert their pages to  MacCyrillic (on the fly) ,  
> wouldn't they listen to a request for either  not doing it (just use the
> encoding they use for Windows) or using the correct charset name? 

Sadly, they don't listen.

This usually happens not because of conscious choice, but because of using poorly written "localized" Apache configs that people downloaded somewhere, or got with their Linux/FreeBSD distros.
Comment 9 Alexey Proskuryakov 2008-04-28 04:21:50 PDT
Comment on attachment 19792 [details]
patch v1 for fixing this problem

Clearing review flag, since this breaks the build. To fix the build, one could add __ZNK7WebCore14DocumentLoader16overrideEncodingEv to WebCore.exp, but I think that it would be better not to hide the tricky logic behind this innocent-looking function.
Comment 10 Alexey Proskuryakov 2008-04-28 04:46:38 PDT
Comment on attachment 19821 [details]
patch v2 for fixing this problem by using different way

As mentioned above, I do not have enough evidence to agree that the current behavior is broken from the user point of view. I was also going to say that we cannot change the semantics of -[WebView setCustomTextEncodingName:] API, but then I noticed that its own documentation (as well as a comment in WebView.h) says "The text encoding automatically goes back to the default when the top level frame changes to a new location."

The difference seems to be too elaborate to be a simple bug. Perhaps there is a longer story behind this?

This patch as is breaks sites such as <http://www.mdf.ru/>, so I don't think that we want to take it. Firefox seems to have the same problem, although its charset auto-detection is very configurable, and it is possible that its ability to remember previously chosen overrides can remedy that in some cases.

There is certainly an issue to fix here, but I think that it would probably be appropriate to treat this as a documentation bug.
Comment 11 Jungshik Shin 2009-02-09 15:55:44 PST
Firefox does NOT have this issue. In both Firefox and IE, the manual encoding override is not persistent. I can't say about IE for sure, but as for Firefox, I can say for sure that  it's never been persistent but always has been applied ONLY to a single page being shown when a user manually overrides the encoding (needless to day, turning on/off and selecting an auto-encoding detector is persistent in Firefox). I also recently talked to a long-time Firefox I18N QA to make sure that's indeed the case and he confirmed it. 

You can try FF3 at www.mdf.ru (overrides the encoding to MacCyrillic on Mac at the front page and follow the link with the default encoding set to, say, ISO-8859-1 and
with the auto-detection turned off). 

You do have a point about sites like www.mdf.ru. If the site just used windows-1251 without encoding specified(the best would be for them to label the encoding correctly), Russian users even on Mac would be just happy by setting the default encoding to Windows-1251 because that's the encoding most likely for them to come across in the absence of an explicit encoding detection. 

Because www.mdf.ru emits MacCyrillic, it makes things complicated. Setting the default encoding to windows-1251 still breaks the site on Mac. In case of Firefox, turning on 'Russian' encoding detector (even Universal detector works fine for MacCyrillic) kinda solves the issue without the encoding override persistency. ICU encoding detecotor (bug 16482) does not detect MacCyrillic so that fixing bug 16482 does not help in this case. 

Given all these, which would you like most, Alexey?

1. Make the manual encoding override NOT persistent as in Firefox
2a Make it persistent as long as a user stays in the "same" domain regardless of the encoding detection quality (I have a plan to improve or replace ICU encoding detector with a much better one).
2b same as #2a for now, but will take option #1 once we have an encoding detector on par with or better than that of Firefox. 





Comment 12 Alexey Proskuryakov 2009-02-10 00:45:32 PST
> Firefox does NOT have this issue.

I meant, it doesn't work well with this site, in the same way that WebKit would if we took the patch (modulo encoding auto-detection). Sorry for being unclear.

Option 2a sounds reasonable, if we decide that the current behavior is broken (which I don't think we have established already).
Comment 13 Jungshik Shin 2009-02-10 16:33:15 PST
(In reply to comment #12)
> > Firefox does NOT have this issue.
> 
> I meant, it doesn't work well with this site, in the same way that WebKit would
> if we took the patch (modulo encoding auto-detection). Sorry for being unclear.

Thank you for the clarification. 

> Option 2a sounds reasonable, if we decide that the current behavior is broken
> (which I don't think we have established already).

As for whether the current behavior is better than that of what's proposed (or that of Firefox), consider the following scenario:

  1. A user on Mac overrides the encoding to MacCyrillic at www.mdf.ru because the site does not specify the encoding (http, meta) and his default encoding (to use in such a case) is Windows-1251. 
  2. He browses for a while at the site and everything seems fine.
  3. He goes to another site which correctly specifies the encoding (either meta or http) that is not MacCyrillic. (say, news.google.ru in UTF-8)
  4. Despite 3, still MacCyrillic will be used totally garbling the page. 
  5. He has to override once more to UTF-8.
  6. He moves to yet another site that correctly specifies the encoding (this time, ISO-8859-5. I'm on purpose not taking an example of Windows-1251 - the default encoding of the user in this case to make the case clearer). 
  7. It's garbled (because now UTF-8 is in force) and he has to 'overrides' the encoding once more to ISO-8859-5
 

Steps 3/4/5 and 6/7 have to be repeated everytime a user moves to another site with an encoding different from the current overriden encoding. With this huge inconvenience and the unnecessary 'punishment' of standard-compliant web sites, what we get is that a user does not have to change the encoding at some sites  that certainly need  fixing. (they need to specify the encoding explicitly). 

As I mentioned earlier, setting the default encoding to use in the absence of explicit encoding declaration (and UTF BOM) to the most likely one for a given user (Russian speakers would set it to windows-1251) would relieve a user of having to override the encoding while browsing within a site like like www.mdf.ru. That works well for Safari users on Windows (the site uses Windows-1251 [1]), but does not on Mac (the site uses MacCyrillic without a label and a Russian user would not set it to MacCyrillic even on Mac because there are lot more unlabelled sites in windows-1251 than in MacCyrillic among sites he visits). 

To me, that's an evangelism issue for those sites (yeah, needless to say, we should accomodate as many web sites as possible, but ...)

Because the gain we get (mentioned in step 1) is rather Mac-specific, I think we can consider yet another alternative if we agree with me (I hope I'm more convincing this time than before :-)) that the current behavior is not desirable. 

  2c. Do 2a but only on Mac (or any combination of PLATFORM-macros)

I'm taking this bug (I keep forgetting this bug :-)). 



[1] Obviously, for multi-lingual users (e.g. Russian and Japanese), that'd not work well. 
Comment 14 Alexey Proskuryakov 2009-02-23 13:01:10 PST
>  2c. Do 2a but only on Mac (or any combination of PLATFORM-macros)

I think that platform-specific behaviors can be considered for problems with known significant user impact. We're doing this for some editing features, for example. I'd need more evidence to be convinced that encoding overrides fall in this category.
Comment 15 Jungshik Shin 2009-02-24 17:56:29 PST
Alexey, thank you for the reply. 

Implementing option 2a reliably turned out to be rather tricky. How about another alternative of making the behavior dependent on a new settings entry (isEncodingOverridePersistent)? Then, the decision can be made outside WebCore by each 'embedder'(?). 

I realized that Safari has 'Default' in Encoding menu. I had been wondering what it's for and recently found that it's for resetting 'user override encoding'. As long as Safari has that UI, perhaps, it's better to keep the option on (unless it's ok to make that option redundant). 

In the meantime, we may want to give another shot at evangelism for Russian web sites that are consistently mislabelled/unlabelled and use MacCyrillic (throughout sites) (comment #7). Before I heard about them from you, I haven't come across sites that are so consistent in their mislabelling/unlabelling. Could you list some popular web sites with that problem? 
Comment 16 Alexey Proskuryakov 2009-02-24 23:02:34 PST
Unfortunately, I don't have any list of such sites.
Comment 17 Darin Fisher (:fishd, Google) 2009-02-24 23:10:55 PST
Hey Alexey, what are your thoughts about the runtime setting Jungshik proposes?  

Some background:  Chrome currently has modifications to WebKit to make the encoding override not be persistent.  We'd like to eliminate those forks to WebKit in favor of some kind of solution here.  I'd prefer not to take a functionality change to accomplish the unforking, so could a runtime setting be an acceptable option?
Comment 18 Alexey Proskuryakov 2009-02-24 23:17:08 PST
If there were any evidence that the current behavior causes problems for users, I'd be very willing to consider changing it unconditionally.
Comment 19 Jungshik Shin 2009-02-25 11:39:32 PST
(In reply to comment #18)
> If there were any evidence that the current behavior causes problems for users,
> I'd be very willing to consider changing it unconditionally.

What do you think of the scenario outlined in comment #13? In Safari, a user has to select 'Default' when he moves out of a site that requires overriding encoding and navigates to a site that correctly identifies the encoding. Having to choose 'Default' is better than having to know the encoding for the new site and to pick that up. However, in my opinion, it's still  "punishing" sites that do the 'right' thing (in order to make sites that need some 'evangelism' work better).

Comment 20 Ian Fette 2009-03-03 17:21:26 PST
Hi Alexey,

I'm trying to understand the comments thus far. From what I understand, the argument is that making encoding override persistent helps, when you're on a site that is doing "the wrong thing" (specifying an incorrect charset, or not specifying any at all and ending up with an incorrect auto-detection). I don't think any of us are disagreeing that it's a good problem to fix to make sites like mdf.ru work - clearly if I'm browsing around mdf.ru I don't want to have to select the charset on every single page.

However, when I leave a site that's doing "the wrong thing", I think that we (in Chrome at least) would like to be able to clear the persistence at that point. For instance, if I navigate from mdf.ru to nytimes.com, it's not clear why I would want to render nytimes.com using MacCyrillic (or whatever I had to set the override to be to make mdf.ru). As a user, when I override the charset, I'm saying "I can't read this page. I want to be able to read this page - here's some additional information to help make the page reasonable." As a browser, I interpret that as "Hey, I now have good information I can use to make this page reasonable. I'm going to use it." When the user views the next page on the same website (domain), I still have reasonably good information (another page on mdf.ru uses MacCyrillic, this page probably does too), so I would still want to use it. When the user views a page on a totally different website (nytimes.com), I don't really have any good information from the user about how to make nytimes.com display properly, while I do have information from nytimes.com (it specifies a charset). At that point, we would like to be able to use the good information I have. That's a product call we've made from the Chrome side, if other embedders of WebKit want to do something different that's reasonable too, and I'd like to see if there's some way we can arrive at a solution that allows those decisions to be made. I think we've demonstrated a use case that is broken by the current behavior of making encoding override persistent, you've demonstrated a use case where making encoding override persistent helps, and an (optional) compromise has been offered that satisfies both use cases but doesn't force anyone to change if they don't want to. As has been stated earlier, we're really trying hard to unfork here, so I would appreciate any comments you could offer that could help us move forward.
Comment 21 Alexey Proskuryakov 2009-03-04 06:10:20 PST
I still think that the most promising approach is 2a.

Something to keep in mind is that we are not discussing a new feature - to the contrary, this is a behavior users lived happily with for many years. I'm now seeking advice from some people who may have more information about actual user experience with Safari. If you could find complaints about the current behavior (e.g. in blogs or discussion forums) that would help a lot. If you could find comments saying that Safari behavior is better than Firefox one, that would help a lot, too!

Obviously, the lack of a "Default" item in Chrome menu changes a lot, as explained in comment 15. Do you think that is worth considering the fix to just add this item to Chrome menu?

We should also consider that Firefox has a much more complicated behavior. My understanding is that it remembers the last encoding used for a page, and uses it when re-opening it later, even after a browser restart. And it uses the encoding override for the current document when navigating to a document with unspecified encoding in the same frame. I don't think that we should aim to match that exactly.
Comment 22 Jungshik Shin 2009-03-06 02:12:54 PST
Coming from Firefox, I had wondered what 'Default' is for in Safari as I wrote earlier. I now know what that is. However, there's another difference between Safari's encoding menu on the one hand and Chrome/Firefox/IE's on the other hand. Safari does not indicate what the current encoding is unless it's overriden manually by a user (that is, if it gets the encoding from http header or meta charset, or BOM detection, the encoding menu points to 'Default'). In other browsers, no matter how encoding is determined, the current encoding is indicated in the menu. 

As for adding 'Default' ('clear the override') to Chrome, I thought about that before, but I'm afraid it's 'confusing' to most Chrome users (most of them being IE/Firefox users at least for now). 

Can we agree to disagree for now and introduce a settings to control the behavior? Thanks.
Comment 23 Alexey Proskuryakov 2009-03-06 02:57:49 PST
According to a discussion I had, here is what we think would be the best behavior:

1. Encoding override should persist when navigating to a new page from the same domain.
2. The override should be used as a default when navigating to a page in a different domain (so, it will be used if the latter doesn't specify an encoding).
3. These rules should also be followed when opening a page in a different frame (a different tab, a pop-up etc).

What do you think? I surely hope that we can decide on a common behavior.

Long term, the best direction should obviously be to try and make manual overrides less necessary. The ideal would be to remove the whole encoding menu.
Comment 24 Jungshik Shin 2009-03-09 10:37:31 PDT
Thank you for your reply. 

The easiest one first: As for removing the encoding menu, I think it'll be a really lo....ng term goal :-). I agree that it'll be nice/ideal to live with out the encoding menu, but I doubt it'll become possible even 5 years from now :-) (I really hope this will turn out wrong !)

As for your three proposed behaviors, I think #1 is reasonable as a middle point between what the current Webkit does and what patches uploaded here does. In addition, it can help users in some cases if we implement that right. 

I'm afraid #2 and #3 (#3 may be different from #2) are not such a good idea. When a user moves out of a domain that needs overriding (e.g. mdrf.ru which needs  overriding to MacCyrillic on Mac) and navigates to another domain that does NOT specify the encoding, IMHO, it's better for a browser to use the default encoding configured by the user. Let me take an example. 

1. I'm assuming that most of Russian web sites that do NOT explicitly specify the encoding serve windows-1251-encoded pages to all the users (whether a user comes from Mac or not). 
2. As a result, most Russian users would set the default encoding of their browsers to windows-1251
3. A user visits one of a small number of Russian web sites that serve MacCyrillic (e.g. mdrf.ru ) pages to Mac browsers and overrides the encoding to MacCyrillic
4. After navigating for a while, he moves to another web site, 'bar.ru' (whose encoding is NOT specified but is actually windows-1251).
5. If we do your #2, 'bar.ru' would be garbled although it could've been rendered correctly if we just used the default encoding of the user (in this case, windows-1251). 

Is my assumption in step 1 above valid? Then, I don't have to take another example, I guess. 

Perhaps, a bit more compelling(??) case would be a French speaker (who has a rather good command of Russian and visits Russian web sites from time to time). His default encoding is likely to be windows-1252 (or iso-8859-1. they're treated the same in most web browsers including webkit-based ones as you know well). If we replace 'windows-1251' and 'Russian' with 'windows-1252/iso-8859-1' and 'French' in step 1 and 'bar.run' with 'bar.fr' in step 4 and 5, I guess it becomes rather clear what can be an issue with your proposal #2. 


Let's go back to the patch uploaded here (that is bit-rotten) combined with a preference to control the behavior vs your #1 (and my #2a). You're certainly in favor or doing your #1 (my #2a) and I think your #1 is reasonable.  Would you consider a patch doing the former (current patch updated + a preference) if I promise that I will follow that up with implementing your #1 soon (not likely to be within a few weeks, though)? Thank you in advance for your consideration :-)


 
Comment 25 Alexey Proskuryakov 2009-03-09 11:00:10 PDT
I actually think that #2 makes sense. Perhaps it can be amended as "The override should be used as a default when navigating to a page in a different domain _via clicking a link_". So, a French user reading Russian Web sites will stay with windows-1251 encoding. In my experience, reading foreign language sites is by far the most common reason to override the encoding, and the vast majority of sites in each language either use the same encoding, or have explicitly specified encoding.

Are you proposing to add the preference as a short-term solution, with a plan to standardize on a common behavior soon, and to remove the preference? In my opinion, that's fine. Adding the preference permanently is not totally unacceptable, but I don't think that it would be good engineering.

I just had a thought that modifying WebCore might not be even necessary to reset encoding override on each navigation - aren't there enough callbacks made to detect navigations and reset frame encoding from the client? I didn't investigate this idea to make sure that it's implementable.
Comment 26 Jungshik Shin 2009-03-10 12:01:49 PDT
Thank you for your reply. In the previous comment, I meant 'preference' as a short-term measure until we implement a common behavior. Because you said you're ok with that as a temporary measure, I'll make a patch with that. 

As for achieving what's requested in this bug without changing WebCore, I looked into that possibility a few weeks ago when I figured out what 'Default' in Safari's encoding menu is for (resetting the overridden encoding). My thought was that perhaps we can automatically do what 'selecting Default' triggers. Shortly after that, I gave up that idea, but I don't remember why. I'll take a brief look at doing on the client-side before making a patch for the 'temporary' measure.