Bug 24473 - RTL: haaretz.co.il - wrong charset in HTTP headers (csISOLatinHebrew instead of Windows-1255) leading to a mirrored rendering
Summary: RTL: haaretz.co.il - wrong charset in HTTP headers (csISOLatinHebrew instead ...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Evangelism (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL: http://www.haaretz.co.il/captain/page...
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-09 16:22 PDT by Jungshik Shin
Modified: 2009-06-04 11:31 PDT (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2009-03-09 16:22:12 PDT
The page at URL is encoded in windows-1255 (and in-file meta charset says so), but the server emits 

Content-Type: text/html; charset=csISOLatinHebrew

Because csISOLatinHebrew is aliased to ISO-8859-8 (visual) and 'usesVisualOrdering' returns true for ISO-8859-8 (as it should), all the text nodes seem to be reversed before being rendered. 

Somehow, Firefox does much less than the complete reversal (here and there, words in Latin letters are reversed, but curiously, Hebrew sentences are not).  
IE also recognizes that csISOLatinHebrew as 'visual' (View | Encoding has 'Hebrew (ISO Visual)' checked when displaying the page) and reverses the text content for a simple test case, but NOT real web pages at haaretz.co.il (a popular newspaper site in Israel). 

This is obviously an evangelism issue and multiple contacts have been made to ask them to correct the issue, but we haven't heard back. 

One drastic measure would be to alias csISOLatinHebrew to windows-1255, but without knowing how many sites use 'csISOLatinHebrew' as its correct meaning (ISO-8859-8), it's hard to make a decision.

Chromium bug : http://crbug.com/3352
Comment 1 Jeremy Moskovich 2009-03-11 02:03:50 PDT
This is a well known issue with considerable legacy of numerous people contacting haaretz and presenting the bug to them over the last few years.

The chrome bug has additional information, Mozilla also have documentation on this issue:
https://bugzilla.mozilla.org/show_bug.cgi?id=308187
Comment 2 Jeremy Moskovich 2009-04-23 15:13:11 PDT
Does anyone have anything against adding a site-specific hack in WebKit to ignore the content-type http header for haaretz.co.il?

Considering the significant history of this issue, I think it's safe to say there's been more than enough attempts at outreach.  The site remains broken for WebKit users which is what ultimately matters.
Comment 3 Darin Adler 2009-04-23 15:37:20 PDT
We could also consider making this work by getting closer to IE behavior. The claim is that IE works because it chokes on the quote marks in the header field and thus ignores the encoding.
Comment 4 Jeremy Moskovich 2009-04-23 15:49:59 PDT
Darin: Please forgive me if I'm misunderstanding, but what makes you think that?

The linked Mozilla & Chrome bugs have some more background on this issue.  As I understand this bug the issue is that the Content-type is specified twice:
* Once in the HTTP Content-Type header - bogus value (see Jungshik's analysis).
* Once in the Meta tag - which specifies the content type correctly

Firefox & Webkit both use the HTTP header rather than the Meta tag and thus the page appears garbled.
IE uses the value in Meta tag and thus displays the page OK (my understanding is that this is a long standing IE bug which they can't fix because of sites such as this one).

The Chrome bug has an example of a page that is served from 2 servers, one of which doesn't output the http header and thus looks fine in all browsers.
Comment 5 Darin Adler 2009-04-23 15:53:26 PDT
(In reply to comment #4)
> Darin: Please forgive me if I'm misunderstanding, but what makes you think
> that?

This comment <https://bugzilla.mozilla.org/show_bug.cgi?id=308187#c17>.

"MSIE displays Haaretz pages in the correct encoding for the same reason Firefox 1.0 did: it does not support quotes around encoding names in HTTP headers. So it does violate a standard, but not this one."

We could consider making ourselves work with all the same sites IE does by being bug-compatible with them in this respect. And yes, a site-specific quirk for haaretz.co.il is also worth considering.
Comment 6 Jeremy Moskovich 2009-04-23 17:07:28 PDT
Thanks Darin, FYI I used Fiddler to spoof the page request and pass back a version of the Content-Type header without the quotes.

In my test it appears IE7 still picks the meta tag, so while it may be a good idea to match IE's rejection of content-type headers containging quotes, it appears that won't fix this issue.
Comment 7 Jeremy Moskovich 2009-04-23 17:10:54 PDT
Please ignore the second paragraph of my previous comment, I hit commit too soon :(
Comment 8 Jungshik Shin 2009-05-07 11:58:47 PDT
First of all, my comment in mozilla bugzilla at https://bugzilla.mozilla.org/show_bug.cgi?id=308187#c20 is wrong. When I wrote that, I thought IE puts a higher precedence on meta than on http (I had seen so many pages with conflicting http charset and meta charset with meta being correct that only worked in IE. I don't know how now). It turned out that it does not. So, Uri's comment at https://bugzilla.mozilla.org/show_bug.cgi?id=308187#c17 is right on. 

As for being a bug compatible with IE by choking on quotation marks in charset param of C-T header fields, I don't feel very comfortable with that. There's a possibility that we may break some web sites that we don't currently break. Sure, they're also broken in IE, but do we want that?  A hypothetical case (perhaps a rare case) is as following:

1. The default charset of a browser is set to, say, GBK
2. A user visits a page which emits the following HTTP header field:

Content-Type: text/html; charset="EUC-KR"

And, there's no meta charset declaration but the page is encoded in EUC-KR.

3. Webkit and Firefox interpret the page correctly as in EUC-KR

4. IE interprets it as GBK (the default charset). 

IE can get away with this because the majority of visitors to the site are Koreans and their default encoding is set to EUC-KR. 

So, although I'm kinda annoyed by haaretz.co.il's failure/refusal? to fix their bug for so many years, I'm inclined toward special casing it (and hopefully only a few other). 


 
Comment 9 Alexey Proskuryakov 2009-05-08 00:51:42 PDT
Special casing seems fine to me in this case, although it's not something we do lightly in general. Being a site-specific hack, it will need to be implemented in a way sensitive to Safari's "Disable Site-specific Hacks" setting.
Comment 10 Jeremy Moskovich 2009-06-01 06:58:15 PDT
Haaretz fixed the issue on their end, closing.
Comment 11 Jeremy Moskovich 2009-06-01 08:46:21 PDT
Reopening, not all pages where fixed, e.g.:
http://themarker.captain.co.il/captain/objects/ResponseDetails.jhtml?resNo=4877847&itemno=1089232&cont=2
Comment 12 Jeremy Moskovich 2009-06-04 11:31:34 PDT
captain.co.il is now fixed as well.