Bug 18270

Summary: treating x-user-defined different from windows-1252 breaks some Indian web sites
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: Page LoadingAssignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, ojan
Priority: P2    
Version: 525.x (Safari 3.1)   
Hardware: All   
OS: All   
URL: http://loksatta.com/
Attachments:
Description Flags
patch (layout test missing)
none
patch with layout test
darin: review+
patch with Darin's review comment addressed darin: review+

Jungshik Shin
Reported 2008-04-01 18:07:55 PDT
1. Download MillenniumVarun font from http://loksatta.com/MillenniumVarun.zip 2. Go to http://loksatta.com/ 3. Half of characters are rendered with empty boxes or question marks. 4. Set the encoding manually to 'Western European' 5. All the characters are rendered correctly. The page has 3 meta charset declarations. The first is 'x-user-defined' and Safari (as well as FF) maps [0x80-0xff] to a PUA block ( U+F780-U+F7FF ?), but MillenniumVarun font installed above does not have any glyph in that range. It's their fault. They should have done either of the following: - use 'Symbol' cmap (rather than Microsoft Windows 1255 cmap) in their font - specify their page encoding as windows-1255 (or iso-8859-1) rather than x-user-defined. webkit treats them synonymously. Nonetheless, MS IE does not have a problem with the above page (and pages like that). I didn't enable webfont for IE so that it's 'thanks to' IE's handling of 'x-user-defined'. Perhaps, webkit also has to treat x-user-defined as an alias to windows-1255. That would save some code, too (TextCodecUserDefined.cpp). There are still a lot of Indian web pages with this problem, I believe although Unicode becomes more and more widespread.
Attachments
patch (layout test missing) (981 bytes, patch)
2008-09-08 17:03 PDT, Jungshik Shin
no flags
patch with layout test (5.18 KB, patch)
2008-10-10 16:49 PDT, Jungshik Shin
darin: review+
patch with Darin's review comment addressed (5.15 KB, patch)
2008-10-10 18:28 PDT, Jungshik Shin
darin: review+
Alexey Proskuryakov
Comment 1 2008-04-01 23:25:43 PDT
See bug 15555, where we decided to match the Firefox implementation. I assume you meant windows-1252, not windows-1255? My understanding is that IE treats x-user-defined differently based on some Registry settings, so matching it fully is not really possible.
Jungshik Shin
Comment 2 2008-04-15 13:10:33 PDT
Yes, I meant windows-1252 :-). Do you know what registry setting controls IE's behavior? It appears that out of the box, IE treats x-user-defined synonymously with Windows-1252. If that's the case, the majority of Indian users would regard Safari as not working for them. Of course, there's a risk of doing this because some web sites may rely on [0x80-0xFF] of x-user-defined being mapped to [U+F780 - U+F7FF] (hope I remember this range correctly). However, given that IE's default appears not to support this, the risk should be rather minimal.
Alexey Proskuryakov
Comment 3 2008-04-16 02:30:37 PDT
No, I don't really understand how x-user-defined works in IE, or how it is controlled. Sources such as <http://msdn2.microsoft.com/en-us/library/aa752010(VS.85).aspx> and <http://www.ntu.edu.sg/home5/pg03053527/Tips/TipsForIE/Tip_for_IE.htm> suggest that it is not just a synonym for windows-1252 though. People are using x-user-defined to work around some XMLHttpRequest restrictions, so it is importatnt to keep thing compatible with Firefox.
Jungshik Shin
Comment 4 2008-09-08 17:03:56 PDT
Created attachment 23280 [details] patch (layout test missing) We can do something like this. I ran webkit-test and no test broke due to this patch. I did spot checks of a few Indian web sites that declare x-user-defined and the fonts used there. Their fonts have glyphs for U+2018 (corresponding to 0x93 in windows-1252) and U+2014 (0x97 in windows-1252) but do not have glyphs for U+0093 or U+0097 (unicode characters for 0x93/0x97 if they're interpreted as ISO-8859-1). There are several dozens (if not more) of 'font-encoded' fonts used for Indian web sites. So, this spot check is far from complete. Some sites declare ISO-8859-1 and their fonts cover indeed ISO-8859-1 repertoire rather than windows-1252 repertoire. What this patch does is to alias 'x-user-defined' to 'windows-1252' and does not break those sites with an "iso-8859-1 font". It's not very clear what IE does, but judging from the fact that a lot of Indian web sites get away with 'x-user-defined' and 'windows-1252 fonts', I guess this patch brings webkit closer to IE's behavior without breaking FF compatibility with XHR. If this is acceptable, I'll make a layout test for this.
Alexey Proskuryakov
Comment 5 2008-09-09 01:32:30 PDT
I don't really like the idea of different decoding rules depending on where the encoding is specified. I'd be happier if the change was made unconditionally. Yes, this poses a theoretical risk for some XHR requests, but I'd prefer a decision to employ an ugly workaround like this to be made based on knowledge of real life severity. + // treat x-user-defined as windows-1252 (bug 18270) With so many WebKit bug trackers around, it's better to use complete URLs like https://bugs.webkit.org/show_bug.cgi?id=18270. > I did spot checks of a few Indian web sites It would be highly desirable to have a list including several sites that are fixed with this patch. Changes like this tend to do both good and harm (and to be incomplete), so having a list of sites that were fixed may help to make an informed decision in the future.
Jungshik Shin
Comment 6 2008-09-18 17:07:47 PDT
40 Indian web sites (mostly newspaper web sites) were tested. Out of 40, 8 use x-user-defined and are 'saved' by this change. I guess, this stat is not a very strong argument. If you want a more convincing stat, I can generate one, but it'll take a while. BTW, I don't like this hack, either. :-)
Alexey Proskuryakov
Comment 7 2008-09-18 23:31:05 PDT
This is very convincing, in my opinion - please do provide the URLs for future reference.
Jungshik Shin
Comment 8 2008-09-23 16:56:52 PDT
Sorry for the delay. Here's the list of sites that are fixed by the change: Web site name Language URL Font Bartaman Patrika Bengali http://www.bartamanpatrika.com/ Aaadurga Ananda Bazar Patrika Bengali http://www.anandabazar.com/ AabpBengali Gujarat Samachar Gujarati http://www.gujaratsamachar.com/ Gopika Kannada Prabha Kannada http://www.kannadaprabha.com KNW-TTNandi Bombay Samachar Gujarati http://bombaysamachar.com/ KrishnaWeb Loksatta Marathi http://www.loksatta.com/ MilleniumVarun Deepika Malayalam http://www.deepika.com/ ML-TTKarthika Arabia365 Malayalam http://www.arabia365.com/ MLW-TTKarthika
Jungshik Shin
Comment 9 2008-10-10 16:49:51 PDT
Created attachment 24281 [details] patch with layout test
Darin Adler
Comment 10 2008-10-10 17:19:28 PDT
Comment on attachment 24281 [details] patch with layout test + * ChangeLog: The ChangeLog itself should not be mentioned in the ChangeLog. The strcasecmp function is generally not good to use anywhere in WebKit because its behavior depends on the current locale (as in the standard C setlocale function). But I see it used in two other places inside the text encoding machinery, so I suppose I can't reject this use. The correct thing to do would be to make an ASCII version of this function like the functions in <wtf/ASCIICType.h>, which exist for the same reason. It could maybe go in WTF, perhaps <wtf/ASCIICString.h> or maybe in PlatformString.h as an overload of equalIgnoringCase. + if (source == EncodingFromMetaTag + && strcasecmp(encoding.name(), "x-user-defined") == 0) + m_decoder.reset("windows-1252"); + else if (source == EncodingFromMetaTag || source == EncodingFromXMLHeader || source == EncodingFromCSSCharset) I think this first if would look better all on one line. It wouldn't be any longer than the next if statement below it. I usually indent the "&&" one extra level to avoid lining it up perfectly with the statement inside the if r=me
Jungshik Shin
Comment 11 2008-10-10 18:28:14 PDT
Created attachment 24286 [details] patch with Darin's review comment addressed Thank you for the review. I addressed your review comments. Can you check this in? I'll file a separate bug about case-sensitive comparision for ASCII strings. (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's generated by PrepareChangelog script. Perhaps, I misused it somehow? ).
Darin Adler
Comment 12 2008-10-11 15:37:33 PDT
(In reply to comment #11) > Thank you for the review. I addressed your review comments. Can you check this > in? I'm not sure I'll have a chance. I'm working on reviewing first; I'll check patches in if I can get through enough of the review queue. But it's in the "to be checked in queue" along with other reviewed patches. I'm sure that one of the committers will get to it eventually. > I'll file a separate bug about case-sensitive comparision for ASCII strings. Thanks. > (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's > generated by PrepareChangelog script. Perhaps, I misused it somehow? ). Three thoughts on that (all minor things): 1) You probably modified the ChangeLog file before running the prepare-ChangeLog script. 2) The prepare-ChangeLog script does include the ChangeLog if you've already modified it before invoking that script. Maybe we should change that, since it's almost never useful. 3) The prepare-ChangeLog script is intended to help you write a change log. Too many people working on this project treat what it writes out as "sacred", but it's really just a tool to save time; it sometimes generates incorrect things like bad function names and we shouldn't just check them in.
Darin Adler
Comment 13 2008-10-12 15:16:18 PDT
Grabbing this bug now. I'm about to land it.
Darin Adler
Comment 14 2008-10-12 15:39:07 PDT
Note You need to log in before you can comment on or make changes to this bug.