18270 – treating x-user-defined different from windows-1252 breaks some Indian web sites

RESOLVED FIXED 18270

treating x-user-defined different from windows-1252 breaks some Indian web sites

https://bugs.webkit.org/show_bug.cgi?id=18270

Summary treating x-user-defined different from windows-1252 breaks some Indian web sites

Jungshik Shin

Reported 2008-04-01 18:07:55 PDT

1. Download MillenniumVarun font from http://loksatta.com/MillenniumVarun.zip 2. Go to http://loksatta.com/ 3. Half of characters are rendered with empty boxes or question marks. 4. Set the encoding manually to 'Western European' 5. All the characters are rendered correctly. The page has 3 meta charset declarations. The first is 'x-user-defined' and Safari (as well as FF) maps [0x80-0xff] to a PUA block ( U+F780-U+F7FF ?), but MillenniumVarun font installed above does not have any glyph in that range. It's their fault. They should have done either of the following: - use 'Symbol' cmap (rather than Microsoft Windows 1255 cmap) in their font - specify their page encoding as windows-1255 (or iso-8859-1) rather than x-user-defined. webkit treats them synonymously. Nonetheless, MS IE does not have a problem with the above page (and pages like that). I didn't enable webfont for IE so that it's 'thanks to' IE's handling of 'x-user-defined'. Perhaps, webkit also has to treat x-user-defined as an alias to windows-1255. That would save some code, too (TextCodecUserDefined.cpp). There are still a lot of Indian web pages with this problem, I believe although Unicode becomes more and more widespread.

Attachments
patch (layout test missing) (981 bytes, patch) 2008-09-08 17:03 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
patch with layout test (5.18 KB, patch) 2008-10-10 16:49 PDT, Jungshik Shin	darin: review+	Details Formatted Diff Diff
patch with Darin's review comment addressed (5.15 KB, patch) 2008-10-10 18:28 PDT, Jungshik Shin	darin: review+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2008-04-01 23:25:43 PDT

See bug 15555, where we decided to match the Firefox implementation. I assume you meant windows-1252, not windows-1255? My understanding is that IE treats x-user-defined differently based on some Registry settings, so matching it fully is not really possible.

Jungshik Shin

Comment 2 2008-04-15 13:10:33 PDT

Yes, I meant windows-1252 :-). Do you know what registry setting controls IE's behavior? It appears that out of the box, IE treats x-user-defined synonymously with Windows-1252. If that's the case, the majority of Indian users would regard Safari as not working for them. Of course, there's a risk of doing this because some web sites may rely on [0x80-0xFF] of x-user-defined being mapped to [U+F780 - U+F7FF] (hope I remember this range correctly). However, given that IE's default appears not to support this, the risk should be rather minimal.

Alexey Proskuryakov

Comment 3 2008-04-16 02:30:37 PDT

No, I don't really understand how x-user-defined works in IE, or how it is controlled. Sources such as <http://msdn2.microsoft.com/en-us/library/aa752010(VS.85).aspx> and <http://www.ntu.edu.sg/home5/pg03053527/Tips/TipsForIE/Tip_for_IE.htm> suggest that it is not just a synonym for windows-1252 though. People are using x-user-defined to work around some XMLHttpRequest restrictions, so it is importatnt to keep thing compatible with Firefox.

Jungshik Shin

Comment 4 2008-09-08 17:03:56 PDT

Created attachment 23280 [details] patch (layout test missing) We can do something like this. I ran webkit-test and no test broke due to this patch. I did spot checks of a few Indian web sites that declare x-user-defined and the fonts used there. Their fonts have glyphs for U+2018 (corresponding to 0x93 in windows-1252) and U+2014 (0x97 in windows-1252) but do not have glyphs for U+0093 or U+0097 (unicode characters for 0x93/0x97 if they're interpreted as ISO-8859-1). There are several dozens (if not more) of 'font-encoded' fonts used for Indian web sites. So, this spot check is far from complete. Some sites declare ISO-8859-1 and their fonts cover indeed ISO-8859-1 repertoire rather than windows-1252 repertoire. What this patch does is to alias 'x-user-defined' to 'windows-1252' and does not break those sites with an "iso-8859-1 font". It's not very clear what IE does, but judging from the fact that a lot of Indian web sites get away with 'x-user-defined' and 'windows-1252 fonts', I guess this patch brings webkit closer to IE's behavior without breaking FF compatibility with XHR. If this is acceptable, I'll make a layout test for this.

Alexey Proskuryakov

Comment 5 2008-09-09 01:32:30 PDT

I don't really like the idea of different decoding rules depending on where the encoding is specified. I'd be happier if the change was made unconditionally. Yes, this poses a theoretical risk for some XHR requests, but I'd prefer a decision to employ an ugly workaround like this to be made based on knowledge of real life severity. + // treat x-user-defined as windows-1252 (bug 18270) With so many WebKit bug trackers around, it's better to use complete URLs like https://bugs.webkit.org/show_bug.cgi?id=18270. > I did spot checks of a few Indian web sites It would be highly desirable to have a list including several sites that are fixed with this patch. Changes like this tend to do both good and harm (and to be incomplete), so having a list of sites that were fixed may help to make an informed decision in the future.

Jungshik Shin

Comment 6 2008-09-18 17:07:47 PDT

40 Indian web sites (mostly newspaper web sites) were tested. Out of 40, 8 use x-user-defined and are 'saved' by this change. I guess, this stat is not a very strong argument. If you want a more convincing stat, I can generate one, but it'll take a while. BTW, I don't like this hack, either. :-)

Alexey Proskuryakov

Comment 7 2008-09-18 23:31:05 PDT

This is very convincing, in my opinion - please do provide the URLs for future reference.

Jungshik Shin

Comment 8 2008-09-23 16:56:52 PDT

Sorry for the delay. Here's the list of sites that are fixed by the change: Web site name Language URL Font Bartaman Patrika Bengali http://www.bartamanpatrika.com/ Aaadurga Ananda Bazar Patrika Bengali http://www.anandabazar.com/ AabpBengali Gujarat Samachar Gujarati http://www.gujaratsamachar.com/ Gopika Kannada Prabha Kannada http://www.kannadaprabha.com KNW-TTNandi Bombay Samachar Gujarati http://bombaysamachar.com/ KrishnaWeb Loksatta Marathi http://www.loksatta.com/ MilleniumVarun Deepika Malayalam http://www.deepika.com/ ML-TTKarthika Arabia365 Malayalam http://www.arabia365.com/ MLW-TTKarthika

Jungshik Shin

Comment 9 2008-10-10 16:49:51 PDT

Created attachment 24281 [details] patch with layout test

Darin Adler

Comment 10 2008-10-10 17:19:28 PDT

Comment on attachment 24281 [details] patch with layout test + * ChangeLog: The ChangeLog itself should not be mentioned in the ChangeLog. The strcasecmp function is generally not good to use anywhere in WebKit because its behavior depends on the current locale (as in the standard C setlocale function). But I see it used in two other places inside the text encoding machinery, so I suppose I can't reject this use. The correct thing to do would be to make an ASCII version of this function like the functions in <wtf/ASCIICType.h>, which exist for the same reason. It could maybe go in WTF, perhaps <wtf/ASCIICString.h> or maybe in PlatformString.h as an overload of equalIgnoringCase. + if (source == EncodingFromMetaTag + && strcasecmp(encoding.name(), "x-user-defined") == 0) + m_decoder.reset("windows-1252"); + else if (source == EncodingFromMetaTag || source == EncodingFromXMLHeader || source == EncodingFromCSSCharset) I think this first if would look better all on one line. It wouldn't be any longer than the next if statement below it. I usually indent the "&&" one extra level to avoid lining it up perfectly with the statement inside the if r=me

Jungshik Shin

Comment 11 2008-10-10 18:28:14 PDT

Created attachment 24286 [details] patch with Darin's review comment addressed Thank you for the review. I addressed your review comments. Can you check this in? I'll file a separate bug about case-sensitive comparision for ASCII strings. (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's generated by PrepareChangelog script. Perhaps, I misused it somehow? ).

Darin Adler

Comment 12 2008-10-11 15:37:33 PDT

(In reply to comment #11) > Thank you for the review. I addressed your review comments. Can you check this > in? I'm not sure I'll have a chance. I'm working on reviewing first; I'll check patches in if I can get through enough of the review queue. But it's in the "to be checked in queue" along with other reviewed patches. I'm sure that one of the committers will get to it eventually. > I'll file a separate bug about case-sensitive comparision for ASCII strings. Thanks. > (btw, I wonder why 'ChangeLog' contained 'ChangeLog' line because it's > generated by PrepareChangelog script. Perhaps, I misused it somehow? ). Three thoughts on that (all minor things): 1) You probably modified the ChangeLog file before running the prepare-ChangeLog script. 2) The prepare-ChangeLog script does include the ChangeLog if you've already modified it before invoking that script. Maybe we should change that, since it's almost never useful. 3) The prepare-ChangeLog script is intended to help you write a change log. Too many people working on this project treat what it writes out as "sacred", but it's really just a tool to save time; it sometimes generates incorrect things like bad function names and we shouldn't just check them in.

Darin Adler

Comment 13 2008-10-12 15:16:18 PDT

Grabbing this bug now. I'm about to land it.

Darin Adler

Comment 14 2008-10-12 15:39:07 PDT

http://trac.webkit.org/changeset/37530

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 525.x (Safari 3.1)

Hardware All

OS All

Product WebKit

Component Page Loading

Assignee

Darin Adler

Reported

2008-04-01 18:07 PDT

Modified

2008-10-12 15:39 PDT History

CC List

2 users Show

URL

http://loksatta.com/

Keywords

Depends on

Blocks