Bug 18270 - treating x-user-defined different from windows-1252 breaks some Indian web sites
Summary: treating x-user-defined different from windows-1252 breaks some Indian web sites
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 525.x (Safari 3.1)
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL: http://loksatta.com/
Depends on:
Reported: 2008-04-01 18:07 PDT by Jungshik Shin
Modified: 2008-10-12 15:39 PDT (History)
2 users (show)

See Also:

patch (layout test missing) (981 bytes, patch)
2008-09-08 17:03 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
patch with layout test (5.18 KB, patch)
2008-10-10 16:49 PDT, Jungshik Shin
darin: review+
Details | Formatted Diff | Diff
patch with Darin's review comment addressed (5.15 KB, patch)
2008-10-10 18:28 PDT, Jungshik Shin
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2008-04-01 18:07:55 PDT
1. Download MillenniumVarun font from http://loksatta.com/MillenniumVarun.zip
2. Go to http://loksatta.com/
3. Half of characters are rendered with empty boxes or question marks.
4. Set the encoding manually to 'Western European'
5. All the characters are rendered correctly. 

The page has 3 meta charset declarations. The first is 'x-user-defined' and Safari (as well as FF) maps [0x80-0xff] to a PUA block ( U+F780-U+F7FF ?), but MillenniumVarun font installed above does not have any glyph in that range. 

It's their fault. They should have done either of the following:

 - use 'Symbol' cmap (rather than Microsoft Windows 1255 cmap) in their font
 - specify their page encoding as windows-1255 (or iso-8859-1) rather than x-user-defined. webkit treats them synonymously. 

Nonetheless, MS IE does not have a problem with the above page (and pages like that). I didn't  
enable webfont for IE so that it's 'thanks to' IE's handling of 'x-user-defined'.   

Perhaps, webkit also has to treat x-user-defined as an alias to windows-1255. That would save some code, too (TextCodecUserDefined.cpp).  

There are still a lot of Indian web pages with this problem, I believe although Unicode becomes more and more widespread.
Comment 1 Alexey Proskuryakov 2008-04-01 23:25:43 PDT
See bug 15555, where we decided to match the Firefox implementation.

I assume you meant windows-1252, not windows-1255?

My understanding is that IE treats x-user-defined differently based on some Registry settings, so matching it fully is not really possible.
Comment 2 Jungshik Shin 2008-04-15 13:10:33 PDT
Yes, I meant windows-1252 :-). Do you know what registry setting controls IE's behavior? 

 It appears that out of the box, IE treats x-user-defined synonymously with Windows-1252.  If that's the case, the majority of Indian users would regard  Safari as not working for them.

Of course, there's a risk of doing this because some web sites may rely on [0x80-0xFF] of x-user-defined being mapped to [U+F780 - U+F7FF] (hope I remember this range correctly). However, given that IE's default appears not to support this, the risk should be rather minimal. 

Comment 3 Alexey Proskuryakov 2008-04-16 02:30:37 PDT
No, I don't really understand how x-user-defined works in IE, or how it is controlled. Sources such as <http://msdn2.microsoft.com/en-us/library/aa752010(VS.85).aspx> and <http://www.ntu.edu.sg/home5/pg03053527/Tips/TipsForIE/Tip_for_IE.htm> suggest that it is not just a synonym for windows-1252 though.

People are using x-user-defined to work around some XMLHttpRequest restrictions, so it is importatnt to keep thing compatible with Firefox.
Comment 4 Jungshik Shin 2008-09-08 17:03:56 PDT
Created attachment 23280 [details]
patch (layout test missing)

We can do something like this. I ran webkit-test and no test broke due to this patch.  

I did spot checks of a few Indian web sites that declare x-user-defined and the fonts used there. Their fonts have glyphs for U+2018 (corresponding to 0x93 in windows-1252) and U+2014 (0x97 in windows-1252) but do not have glyphs for U+0093 or U+0097 (unicode characters for 0x93/0x97 if they're interpreted as ISO-8859-1).

There are several dozens (if not more) of 'font-encoded' fonts used for Indian web sites. So, this spot check is far from complete.  Some sites declare ISO-8859-1 and their fonts cover indeed ISO-8859-1 repertoire rather than windows-1252 repertoire. What this patch does is to alias 'x-user-defined' to 'windows-1252' and does not break those sites with an "iso-8859-1 font". 

It's not very clear what IE does, but judging from the fact that a lot of Indian web sites get away with 'x-user-defined' and 'windows-1252 fonts',  I guess this patch brings webkit closer to IE's behavior without breaking FF compatibility with XHR. 

If this is acceptable, I'll make a layout test for this.
Comment 5 Alexey Proskuryakov 2008-09-09 01:32:30 PDT
I don't really like the idea of different decoding rules depending on where the encoding is specified. I'd be happier if the change was made unconditionally. Yes, this poses a theoretical risk for some XHR requests, but I'd prefer a decision to employ an ugly workaround like this to be made based on knowledge of real life severity.

+    // treat x-user-defined as windows-1252 (bug 18270)

With so many WebKit bug trackers around, it's better to use complete URLs like https://bugs.webkit.org/show_bug.cgi?id=18270.

> I did spot checks of a few Indian web sites

It would be highly desirable to have a list including several sites that are fixed with this patch. Changes like this tend to do both good and harm (and to be incomplete), so having a list of sites that were fixed may help to make an informed decision in the future.
Comment 6 Jungshik Shin 2008-09-18 17:07:47 PDT
40 Indian web sites (mostly newspaper web sites) were tested. Out of 40, 8 use x-user-defined and are 'saved' by this change.  I guess, this stat is not a very strong argument. If you want a more convincing stat, I can generate one, but it'll take a while. 

BTW, I don't like this hack, either. :-)

Comment 7 Alexey Proskuryakov 2008-09-18 23:31:05 PDT
This is very convincing, in my opinion - please do provide the URLs for future reference.
Comment 8 Jungshik Shin 2008-09-23 16:56:52 PDT
Sorry for the delay. Here's the list of sites that are fixed by the change:

Web site name           Language        URL                             Font  

Bartaman Patrika	Bengali		http://www.bartamanpatrika.com/	Aaadurga
Ananda Bazar Patrika	Bengali		http://www.anandabazar.com/	AabpBengali
Gujarat Samachar	Gujarati	http://www.gujaratsamachar.com/	Gopika
Kannada Prabha		Kannada		http://www.kannadaprabha.com	KNW-TTNandi
Bombay Samachar		Gujarati	http://bombaysamachar.com/	KrishnaWeb
Loksatta 		Marathi		http://www.loksatta.com/	MilleniumVarun
Deepika			Malayalam	http://www.deepika.com/		ML-TTKarthika
Arabia365		Malayalam	http://www.arabia365.com/	MLW-TTKarthika

Comment 9 Jungshik Shin 2008-10-10 16:49:51 PDT
Created attachment 24281 [details]
patch with layout test
Comment 10 Darin Adler 2008-10-10 17:19:28 PDT
Comment on attachment 24281 [details]
patch with layout test

+        * ChangeLog:

The ChangeLog itself should not be mentioned in the ChangeLog.

The strcasecmp function is generally not good to use anywhere in WebKit because its behavior depends on the current locale (as in the standard C setlocale function). But I see it used in two other places inside the text encoding machinery, so I suppose I can't reject this use. The correct thing to do would be to make an ASCII version of this function like the functions in <wtf/ASCIICType.h>, which exist for the same reason. It could maybe go in WTF, perhaps <wtf/ASCIICString.h> or maybe in PlatformString.h as an overload of equalIgnoringCase.

+    if (source == EncodingFromMetaTag
+        && strcasecmp(encoding.name(), "x-user-defined") == 0)
+        m_decoder.reset("windows-1252"); 
+    else if (source == EncodingFromMetaTag || source == EncodingFromXMLHeader || source == EncodingFromCSSCharset)        

I think this first if would look better all on one line. It wouldn't be any longer than the next if statement below it. I usually indent the "&&" one extra level to avoid lining it up perfectly with the statement inside the if

Comment 11 Jungshik Shin 2008-10-10 18:28:14 PDT
Created attachment 24286 [details]
patch with Darin's review comment addressed

Thank you for the review. I addressed your review comments. Can you check this in? 

I'll file a separate bug about case-sensitive comparision for ASCII strings. (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's generated by PrepareChangelog script. Perhaps, I misused it somehow? ).
Comment 12 Darin Adler 2008-10-11 15:37:33 PDT
(In reply to comment #11)
> Thank you for the review. I addressed your review comments. Can you check this
> in?

I'm not sure I'll have a chance. I'm working on reviewing first; I'll check patches in if I can get through enough of the review queue. But it's in the "to be checked in queue" along with other reviewed patches. I'm sure that one of the committers will get to it eventually.

> I'll file a separate bug about case-sensitive comparision for ASCII strings.


> (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's
> generated by PrepareChangelog script. Perhaps, I misused it somehow? ). 

Three thoughts on that (all minor things):

    1) You probably modified the ChangeLog file before running the prepare-ChangeLog script.

    2) The prepare-ChangeLog script does include the ChangeLog if you've already modified it before invoking that script. Maybe we should change that, since it's almost never useful.

    3) The prepare-ChangeLog script is intended to help you write a change log. Too many people working on this project treat what it writes out as "sacred", but it's really just a tool to save time; it sometimes generates incorrect things like bad function names and we shouldn't just check them in.
Comment 13 Darin Adler 2008-10-12 15:16:18 PDT
Grabbing this bug now. I'm about to land it.
Comment 14 Darin Adler 2008-10-12 15:39:07 PDT