66185 – Sniff UTF-8 instead of defaulting to WINDOWS-1252 (or other locale defaults)

Leif Halvard Silli

Reported 2011-08-13 06:30:55 PDT

ISSUE: When a HTML page is UTF-8 encoded, but there is no page internal encoding declaratation, no BOM and also no accompanying external encoding info inside HTTP or MIME, then Webkit will default to WINDOWS-1252 instead of sniffing the encoding to be UTF-8. BACKGROUND: HTML5's encoding sniffing algorithm, step 7, states: ]] The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET] [Note:] The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding. [PPUTF8] [UTF8DET] [[ HOW TO REPRODUCE THIS BUG: 1. Verify that Webkit's encoding choice is set to Default (or Automatic) 2. Open the page http://malform.no/testing/html5/bom/normal-HTML-BOMless-HTTPcharsetLESS (That HTML page has no BOM, and no accompanying external encoding info inside the HTTP Content-Type: header and no internal encoding declaration.) EXPECTED RESULT: Webkit should sniff the page to be UTF-8 encoded. ACTUAL RESULT: Webkit instead defaults to WINDOWS-1252 (more correclty: to the default encoding for the current locale) COMMENTS: * By default, Chrome, Opera and IE (at least version 8) do *NOT* have this bug * Byt default, Firefox *DOES* have this bug

Alexey Proskuryakov

Comment 1 2011-08-13 20:19:51 PDT

WebKit does not default to Latin-1 itself, it uses whatever preference an embedding application has set. For example, you can set UTF-8 as default encoding in Safari View->TextEncoding menu.

Alexey Proskuryakov

Comment 2 2011-08-13 20:21:20 PDT

Note that sniffing for UTF-8 is not a requirement in HTML5, it's only a MAY permission. This could be an enhancement request, but if treated as a Major bug, it's INVALID.

Leif Halvard Silli

Comment 3 2011-08-14 03:35:05 PDT

(In reply to comment #2) > Note that sniffing for UTF-8 is not a requirement in HTML5, it's only a MAY permission. This could be an enhancement request, but if treated as a Major bug, it's INVALID. It must be said to invalid to say that the bug is invalid because of the importance attributed to it - I have never experienced such a thing with any bug I filed anywhere before. If you acknowledgde the bug, but disagree with the importance level, then the logical thing would be to accept the bug but adjust the importance level.

Alexey Proskuryakov

Comment 4 2011-08-14 11:02:45 PDT

If it were an enhancement request, I'd WONTFIX it, because there is no reason to fix it without a clearly demonstrated benefit.

Leif Halvard Silli

Comment 5 2011-08-14 12:07:37 PDT

The encoding sniffing algorithm of HTML5 is, btw, not created for fun. What kind of benefit are you looking for? How do you want it demonstrated? Some benefits: 1) Better interoperability with XML, whose default is UTF-8. 2) Better interoperability with the oither, dominating Web browsers, which do sniff. 3) Better results for the user, as it is difficult to guess UTF-8 unless it is UTF-8. 4) UTF-8 is the trend and the goal What more do you need to see demonstrated? What makes you in doubt about the benefits?

Alexey Proskuryakov

Comment 6 2011-08-14 21:29:20 PDT

Any change should benefit someone. That someone could be users, Web page developers, or browser developers (in order of decreasing importance). Implementing UTF-8 sniffing in WebKit will not benefit users, because there are no known pages that we display incorrectly because of this. But no sniffing algorithm is perfect, so there is risk of false positive detection, and some real life pages may get broken. It will not benefit Web developers, because it would make WebKit behavior less predictable. For best compatibility, they will still need to declare charset explicitly, and when they forget to, they risk that WebKit or some other browser won't detect charset. Note that different engines will implement sniffing differently, increasing the burden on Web developers. It will not benefit browser developers. For us, it's just more code, with its own bugs, including possible security ones. Widely implementing a useless MAY-level feature will also mean that authors will start relying on it (intentionally or not), which further increases the barrier of entry for new browsers, hurting competition, and eventually end users, too. Without strong evidence of end users getting incorrectly decoded pages because of this, implementing UTF-8 sniffing in WebKit will be a clear loss for every group listed above.

Leif Halvard Silli

Comment 7 2011-08-15 02:35:51 PDT

(In reply to comment #6) > Any change should benefit someone. That someone could be users, Web page developers, or browser developers (in order of decreasing importance). +1 > Implementing UTF-8 sniffing in WebKit will not benefit users, because there are no known pages that we display incorrectly because of this. - Do you assume that authors always test their page in Webkit? - Do you question one of the most important principles in the design of HTML5 (namely: that UAs should behave the same way, since authors often test in a single browser) ? - Is it only a matter of "display"? What if the UTF-8 page does not display any non-ASCII _letters_ but, for inssance contains directly typed no-break-speace characters? This is enough for Chrome to sniff it as UTF-8. Chrome and IE will then send the form UTF-8 encoded, while Webkit will use Windows-1252. - Further more, as Webkit seems to reuse its HTML parsing code as much as possible in its XML parser, implementing UTF-8 detection could perhaps also improve the current (not so perfect) handling of UTF-8 in XML pages. > But no sniffing algorithm is perfect, so there is risk of false positive detection, and some real life pages may get broken. This sounds more like FUD than a real argument. (But I hope someone who can explain UTF-8 detection better than I can, can step in.) > It will not benefit Web developers, because it would make WebKit behavior less predictable. For best compatibility, they will still need to declare charset explicitly, and when they forget to, they risk that WebKit or some other browser won't detect charset. Note that different engines will implement sniffing differently, increasing the burden on Web developers. This does not sound convincing. At least 2 mayor UAs (Chrome/IE) *do* perform detection. Webkit would become the 3rd. Which in itself would be argument in favour of also implementing UTF-8 detection in the fourth (Firefox). Etc. I fail to see that this would be bad for developers. As for different implementation: If it becomes an issue, then this - too - can be standarized in a spec. Further more: Chrome has already implemented it - so you may have access to an open source implementation that you, the developers, can reuse. > It will not benefit browser developers. For us, it's just more code, with its own bugs, including possible security ones. Widely implementing a useless MAY-level feature will also mean that authors will start relying on it (intentionally or not), which further increases the barrier of entry for new browsers, hurting competition, and eventually end users, too. Here you admit what I spoke about above: Authors might test a page in Chrome only - or in IE only. I fail to see that it is worse to rely on UTF-8 detection than it is to rely on the Windows-1252 default. (On the contrary: it is better to rely on UTF-8, due to its many benefits.) > Without strong evidence of end users getting incorrectly decoded pages because of this, implementing UTF-8 sniffing in WebKit will be a clear loss for every group listed above. If this - "clearl loss for every group listed above" - is how you see it, then you should perhaps file a bug against HTML5, to test your arguments? Meanwhile, I answered to all your claims. I hope that someone who can more convincingly argue in favour of UTF-8 detection, would also comment on your arguments.

Leif Halvard Silli

Comment 8 2011-08-15 03:17:42 PDT

One, rather familar use case: 1. A UTF-8 encoded Web page is served with encoding info inside HTTP Content-Type: only - and not inside the document. 2. A Safari user saves the page as source code to the desktop. 3 The user opens the page from the desktop in Safari. HOPEFUL RESULT: Safari will use UTF-8 detection and use the same encoding as online. ACTUAL RESULT: Safari defaults to Windows-1251

Alexey Proskuryakov

Comment 9 2011-08-15 08:52:55 PDT

> If this - "clearl loss for every group listed above" - is how you see it, then you should perhaps file a bug against HTML5, to test your arguments? It's a MAY, not a MUST, so I don't care. Please respect Bugzilla etiquette, and don't re-open bugs unless you have a good reason to. I have already told you that a good reason would be presenting real life Web pages that misrender or misbehave when UTF-8 sniffing is not performed.

Leif Halvard Silli

Comment 10 2011-08-15 16:46:46 PDT

(In reply to comment #9) > […] a good reason would be presenting real life Web pages that misrender or misbehave when UTF-8 sniffing is not performed. (1) Pages without internal encoding info will misrender/misbehave when saved to the harddisk. Examples: * http://store.apple.com/no (and http://store.apple.com/dk and http://store.apple.com/se) * http://ntntv.gov.eg/ NOTES: - When saving the Apple Store page as source code, with Safari, and reloading the saved page, then it works fine. But if I open the page in another Webkit - e.g. iCab, then it does not work fine any more (and opposite too - if I save with iCab and open in Safari, then it don't work). So, clearly, Safari does somethign else *instead* of the UTF-8 detection algorithm - may be it is related to features of that page or may be Safari stores some metadata somewhere. - In contrast, page number 2 (http://ntntv.gov.eg/) is misrendered as soon as it is saved to the harddisk and reloaded. (2) HTML5's encoding sniffing algorithm lists 10 locales whose suggested default encoding is UTF-8. One should think that this will lead to several UTF-8 encoded pages that needs to be sniffed when the user does not use that locale. Here are two examples: a) http://iranlinkbox.ir NOTES: The reason why that iranian page is misrendred in Webkit, is because the <meat@charset> element in the DOM is located in the <body> rather than in the <head>. The reason why Firefox nevertheless handles it, is because it implements step 3 of the algorithm (which is also a MAY) - where it searches for @charset in the first 1024 bytes - if one removes the <meta@charset> from the page, then it fails in FIrefox. But in Chrome, the encdoing is detected even if the <meta@charset> is fremoved - thus we can conclude that it is the UTF-8 detection that steps in. b) http://www.galenika.rs/index.php?lang=RUS NOTES: This page has very malformed, internal encoding info. Therefore it fails in both Safari and Firefox. But in Chrome, Opera and IE - it works.

Leif Halvard Silli

Comment 11 2011-08-15 16:49:06 PDT

(In reply to comment #10) Reopen, because I documented real live web pages that misrender because of lack of encoding detections.

Alexey Proskuryakov

Comment 12 2011-08-15 17:01:13 PDT

OK, let's track such pages here, and if the list starts looking important, we can consider using sniffing.

Leif Halvard Silli

Comment 13 2011-08-15 17:24:39 PDT

I mentioned two examples: 1) pages without internal @charset which misrender if you save them to the harddisk and 2) pages which misrenders online. Should we track both?

Alexey Proskuryakov

Comment 14 2011-08-15 17:31:20 PDT

There is no doubt that pages of the first kind are numerous, no need to list those here.

Leif Halvard Silli

Comment 15 2011-08-16 04:20:57 PDT

(In reply to comment #10) > a) http://iranlinkbox.ir (For the record, *all* pages of the iranlinkbox.ir Web site seems to be affected.) More affected examples: * Most or all sub domains of http://javanblog.com - (From the site's META elemetn data: "JavanBlog.com - Free persian weblog service For Persian Users - The Best & Most Professional Persian's Internet Community - Where Users Can Start a Blog based on Their Diaries & Journals Easily.". Example of broken sub domains: # http://tejaratclick.javanblog.com # http://urd.javanblog.com/post-18841.htmlداغ (Wikipedia link to that page found her: http://az.wikipedia.org/wiki/رزن_بولگه‌سی#sitat_qeyd-2) # http://bavil.javanblog.com/ (Wikipedia links to that page: http://fa.wikipedia.org/wiki/دهستان_باویل#.D9.85.D9.86.D8.A7.D8.A8.D8.B9 )

Leif Halvard Silli

Comment 16 2011-08-16 05:11:53 PDT

Another example * http://www.panet.co.il/ (An Arabic language website under the .il domain) Some concrete example pages: # http://www.panet.co.il/online/index.html # http://www.panet.co.il/online/talkback/219335.html # http://www.panet.co.il/online/talkback/

Leif Halvard Silli

Comment 17 2011-08-16 05:28:30 PDT

(In reply to comment #16) > Another example > > * http://www.panet.co.il/ (An Arabic language website under the .il domain) > Some concrete example pages: > # http://www.panet.co.il/online/index.html > # http://www.panet.co.il/online/talkback/219335.html > # http://www.panet.co.il/online/talkback/ Actually, it seems primarily/only to be the pages inside the /talkback/ directory that are affected.

Leif Halvard Silli

Comment 18 2011-08-24 04:29:02 PDT

http://www.servicecompaniet.no/ (which forms subframe at http://service.pointon.no/ServiceRegistration/ServiceReg.aspx ) Description: * Technically, the encoding issue of the above page could be solved by *adhering to the charset parameter of the HTTP COntent-Type: header - which specifies UTF-8. * Instead, Safari/Webkit (but not Chrome or other browsers) default to the the default encoding (because - probably - the "mother frame" - defaults to the default encoidng.) How to experience the issue: * Go to this page http://www.servicecompaniet.no/ The page uses Western encodng * Click one of the links with the text 'registrer på web' These links are located inside the sub frame * You then get a forms page in the subframe The forms page is UTF-8 encoded and the encoidng is set via charset parameter inside the HTTP Content-Type: header. EXPECTED: That Safari pics up the encoding ACTUAL RESULT: Safari ignores the HTTP header and defaults to the default encoding instead. Chrome seems to sniff to solve this problem.

Alexey Proskuryakov

Comment 19 2011-08-24 10:34:27 PDT

> ACTUAL RESULT: Safari ignores the HTTP header and defaults to the default encoding instead. How did you decide that a default encoding is used for this frame? I'm pretty sure that we're correctly using UTF-8 as specified by Content-Type header field.

Leif Halvard Silli

Comment 20 2011-08-24 13:11:35 PDT

(In reply to comment #19) > > ACTUAL RESULT: Safari ignores the HTTP header and defaults to the default encoding instead. > > How did you decide that a default encoding is used for this frame? I'm pretty sure that we're correctly using UTF-8 as specified by Content-Type header field. Hm. Answer: By testing several times on my Leopard Mac and my Snow Leopard mac, in iCab and in Safari. I looked at the letters, and they were malformed like when UTF-8 is read as Win-1252. I also had to use the form, and the form was sent as WIN-1252 rather than UTF-8. But now, several hours later, things seems to work - except when I manually select the Western encoding, which is expected, with the way Safari works. So either I - nevertheless - made an error when I tested, e.g. I may - despite that I was aware of the problem - have manually selected win-1252 in the current Tab. Or - eventually - the page has somehow been updated during the day. Given the uncertainty, I of course withdraw my claims about this page.

Leif Halvard Silli

Comment 21 2011-09-25 11:45:27 PDT

WEbkit does not pick up this page as UTF-8: http://www.kitanosawa.com/hp_php_html/index.php * contains <meta http-equiv="content-style-type" content="text/css; charset=utf-8" /> (for CSS) instead of proper charset tag * Firefox picks it up correctly if "Japanese encoding sniffing" isused (CharDet) * Chrome picks up the encoding * Opera picsk up the encoidng. * IE9 does *not* pick up the encoding

Eric Seidel (no email)

Comment 22 2012-10-27 01:49:45 PDT

I got lost somewhere along this long argument, sorry. If you have specific examples of pages we fail to sniff correctly, but that other browsers do, please open a separate bug, and I or others are happy to take a look. I'm sorry this bug got off the rails as it did, but I'm closing for now. Thanks for the bug!