This site has pages that begin with: <?xml version="1.0" encoding="UTF-16"?> <html xmlns:msxsl="urn:schemas-microsoft-com:xslt"> <head> <meta content="text/html; charset=windows-1251" http-equiv="Content-Type"> The actual encoding is windows-1251. Since UTF-16 is impossible in eight-bit documents like this, it can/should be ignored. Firefox and MacIE render this site correctly. The proposed patch fixes this problem, and provides test cases for it and for rdar://3182977 ("unicode" encoding handled as UTF-16 rather than UTF-8 at www.delcom-eng.com).
Created attachment 4525 [details] proposed patch
(the test cases go to fast/encoding)
Comment on attachment 4525 [details] proposed patch Um, this doesn't look right, clearing the review flag...
Comment on attachment 4525 [details] proposed patch This doesn't look quite right to me. Testing with other browsers long ago, I found that pages marked "UTF-16" (not XML pages, but HTML ones) in <meta> tags were treated as UTF-8 by other browsers. Not default encoding (Windows Latin-1 for the "default default"), but specifically UTF-8. This patch changes that behavior to make some XML cases work better; I think that's incorrect.
Perhaps, Decoder should know if it's decoding HTML or XML (in Firefox, encoding from <meta> tags doesn't seem to be used for XML)... I'll try to figure out the correct behavior.
Created attachment 4584 [details] allow meta to override encoding from XML declaration WinIE (but not Firefox) indeed treats HTML pages marked UTF-16 in <meta> tags as UTF-8, thank you for noticing! This new patch includes a regression test for this, too. Allowing <meta> to override XML encoding seems to match what Firefox does for HTML. For XHTML, Firefox ignores <meta>, but Safari doesn't - the patch doesn't change this (although also allows such overriding).
Comment on attachment 4584 [details] allow meta to override encoding from XML declaration I don't understand the logic here. You say that Gecko ignores <meta> elements entirely for "real XHTML". And you say that this patch leaves WebKit respecting <meta> elements for "real XHTML" and goes further, allowing such <meta> tags to override the character set specified in the XML declaration. This sounds like the wrong direction to go if we're looking for compatibility with Gecko. Can you clarify why this is a desirable change?
Comment on attachment 4584 [details] allow meta to override encoding from XML declaration I think I see what's going on. This site isn't "real XHTML". It's "XHTML being served with a plain HTML MIME type". I guess in that case we want to match what the other major browsers do. Do they look at the character set in the XML header at all in cases like this?
(In reply to comment #8) > Do they look at the character set in the XML header at all in cases like this? Yes, Firefox and Opera do look at it - a test is at <http://nypop.com/~ap/webkit/xhtml.html>. MacIE doesn't; cannot say about WinIE (browsershots.org doesn't work with it at the moment). Although it's unfortunate that this patch slightly changes the "real XHTML" behavior, making it less similar to Firefox, I think that this can only be handled by making Decoder know about what kind of source it parses, which looks like a separate undertaking.
Comment on attachment 4584 [details] allow meta to override encoding from XML declaration OK, I'm convinced now. r=me
Filed the <meta> in "real XHTML" issue as bug 5620.
Bumping priority to P1, because the patch also fixes a regression in bug 5823.
ap: landing would be even easier if you provided the test case in patch form... ChangeLog entry as a bonus. :)
nm, I now see that's included in your patch! Thanks, landing now.