Bug 66056 - The XML parser doesn't ignore user's encoding choice for XML files
Summary: The XML parser doesn't ignore user's encoding choice for XML files
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Major
Assignee: Nobody
URL: http://malform.no/testing/html5/bom/x...
Keywords:
Depends on:
Blocks: 66106
  Show dependency treegraph
 
Reported: 2011-08-11 07:16 PDT by Leif Halvard Silli
Modified: 2011-08-13 16:59 PDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-08-11 07:16:53 PDT
ISSUE: 

   Webkit fails to *ignore* user's choice of encoding for XML files.

BACKGROUND:

   According to section 4.3.3 of the XML 1.0 spec, it is a FATAL ERROR if the page is in another encoding than the declared (explicitliy or implicitly/default) encoding:

]]
   In the absence of information provided by an external transport protocol 
   (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding
   declaration to be presented to the XML processor in an encoding other 
   than that named in the declaration, or for an entity which begins with 
   neither a Byte Order Mark nor an encoding declaration to use an encoding
   other than UTF-8.
[[

THUS: It ought to be impossible (aka "FATAL ERROR) to interpret the page with another encoding than the declared (explicitly or implicitley/default) encoding.

However, Webkit does not behave that way.


WAYS TO REPRODUCE THIS BUG:

 -- variant 1 --

1. In a browser in the Webkit family (including nightly build), go to the "Text Encodings" submenu of the "View" menu and select "Western (Macintosh)". NOTE: This step changes - for the current window or tab - the default encoding from "Default/Automatic" to the encoding that you selected.

2. Now, within the same window or tab, visit one of these XHTML (application/xhtml+xml) pages:
2.1. http://malform.no/testing/html5/bom/cyrillic-encoding-declaration
2.2. http://malform.no/testing/html5/bom/cyrillic-http-charset
       NOTE:
Page 2.1. includes an internal XML encoding declaration: <?xml version="1.0" encoding="KOI8-R" ?>
Page 2.2. is served with the charset=KOI8-R in the HTTP Content-Type: header

 -- variant 2 -- (the opposite way)

1. With the encoding set to "Default/Automatic", visit of these XHTML (application/xhtml+xml) pages:
1.1. http://malform.no/testing/html5/bom/cyrillic-encoding-declaration
1.2. http://malform.no/testing/html5/bom/cyrillic-http-charset

2. Now, manually choose the encoding "Western (Macintosh)" from the encoding menu 


EXPECTED RESULTS - FOR BOTH VARIANTS:  Webkit should ignore that the user changed the default encoding to "Western (Macintosh)" and instead, in accordance with section 4.3.3. of XML 1.0,  assume that the encoding of the page to be  in the declared encoding.
  
ACTUAL RESULTS:  Webkit instead pays respect to the user's choice of default encoding (i.e. it renders the page as 'Western (Macintosh)'). Also, it does so, without displaying a fatal error.

COMMENTS:

[OTHER PARSERS:] Firefox does not have this bug. Opera *does* have a similar bug. I don't know if IE9 has this bug. I don't think XML parsers in general (e.g. XMLlib2) have this bug.
Comment 1 Alexey Proskuryakov 2011-08-11 11:20:05 PDT
What is the difference between this bug and bug 66055?
Comment 2 Leif Halvard Silli 2011-08-11 12:47:36 PDT
(In reply to comment #1)
> What is the difference between this bug and bug 66055?

The XML 1.0 spec discerns between internal encoding declaration (including the UTF-8 default ) and accompanied (a.k.a. external) encoding declaration delivered together with file over the protocol or file system. And hence, the difference between the two bugs goes like this:

* Bug 66055 is about Webkit's failure to obey UTF-8  as the [internal] default encoding (over the user chosen encoding).
* Bug 66056 - this bug - is about Webkit's failure to let the accompanied external protocol take precedence (over the user chosen encoding).

I can see that from Webkit's point of view, the problem perhaps is one and the same: In both cases, Webkit listens to what the user specifies in the Text Encoding menu, rather than doing what the spec says. 

And: As long as the user does not touch the Text Encoding menu - but leaves it set to 'Default'/'Automatic', then Webkit does select the correct encoding - with one exception: If the page contains the BOM, then it ignores the HTTP charset=foo attribute (See bug 66084 for the XML parser. There is a related bug for the HTML parser as well - bug 66085.)

I think there should be separate bugs, but that bug 66055, bug 66056 as well as bug 66084 and bug bug 66085, should be seen together - perhaps they should be marked as depending on oneanother?
Comment 3 Leif Halvard Silli 2011-08-11 13:13:41 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > What is the difference between this bug and bug 66055?

> * Bug 66055 is about Webkit's failure to obey UTF-8  as the [internal] default encoding (over the user chosen encoding).
> * Bug 66056 - this bug - is about Webkit's failure to let the accompanied external protocol take precedence (over the user chosen encoding).

OK. I see that I was not as systematic as I thought ... since this bug, bug 66056, points to one test file with an external encoding declaration as well as to one with an internal encoding declaration ...

As such, there might not be a reason to have two bugs, provided that all developers are in no doubt about the fact that XML files should default to UTF-8 whenever there is no other info.

So, I change my justification: The fact that XML files should default to UTF-8 whenever no other info is present, is a so important feature of XML that it warrants its own bug.
Comment 4 Alexey Proskuryakov 2011-08-11 13:37:27 PDT
Thank you for the clarification. There is no way we would ignore explicit user request to force a specific encoding via menu. That would just make no sense.

If you want the encoding menu to be unavailable for XML documents, that's something that can be considered (although personally, I'm not enthusiastic about that). Given the degree of specificity in the bugs you've filed, I think that it would be better to file a separate bug about that.
Comment 5 Leif Halvard Silli 2011-08-11 14:30:37 PDT
(In reply to comment #4)
> Thank you for the clarification. There is no way we would ignore explicit user request to force a specific encoding via menu. That would just make no sense.

* So do you prefer to break with the XML specification? 

* You do not speak to the facts: WebkitI *do* accept that it is impossible to override explicit user requests: when XML (or HTML) document includes  is a  BOM, then it *is* impossible.  (See bug 66084 and bug 66085, which mentions this behaviour). And the menus is not grayed out when this happens - it simply has no effect.

> If you want the encoding menu to be unavailable for XML documents, that's something that can be considered

Usually, on the Macintosh at least, when a menu is "unavailable", it is grayed out, to signal that it has no effect. Is that what you mean?

For the record, FIrefox does, for XML, ignore the explicit user requests to force a specific encoidng via the menu. It does not gray anthying out - it just has no effect. (But I would not mind if it was grayed out, of course.)

> (although personally, I'm not enthusiastic about that).

Why are you not enthusiastic? Is it because you disagree with XML 1.0? Or is there some other reason? Please note that XML is supposed to have "draconian error handling", but that this bug makes the error handling less draconian. Do you want XML to be less draconian? 

>  Given the degree of specificity in the bugs you've filed, I think that it would be better to file a separate bug about that.

Is it the thing that when there is a menu that lets the user choose something, then it should have effect - is that your issue? Is that why you say stamp it as INVALID?
Comment 6 Alexey Proskuryakov 2011-08-11 14:48:18 PDT
> Usually, on the Macintosh at least, when a menu is "unavailable", it is grayed out, to signal that it has no effect. Is that what you mean?

Yes. Just like it's greyed out for image documents, for example. I think that making it unavailable would resolve this issue to your preference, so I'm not answering your other questions at this point.
Comment 7 Leif Halvard Silli 2011-08-11 15:03:04 PDT
(In reply to comment #6)
> > Usually, on the Macintosh at least, when a menu is "unavailable", it is grayed out, to signal that it has no effect. Is that what you mean?
> 
> Yes. Just like it's greyed out for image documents, for example. I think that making it unavailable would resolve this issue to your preference, so I'm not answering your other questions at this point.

I am in the process of filing such a bug. However, one thing I noticed, in that regard: The user might change the encoding to e.g. KOI8-R *before* visiting the XML page. Thus, it is not enough to just gray out the menu - Webkit must *also* ignore the user set encoding choice, if it is going to work.
Comment 8 Alexey Proskuryakov 2011-08-11 15:06:06 PDT
That's a good point. Another thing is that an XML document may be in a subframe inside an HTML one.
Comment 9 Leif Halvard Silli 2011-08-11 15:41:14 PDT
(In reply to comment #8)

Filed bug 66106. I filed it agains the XML component - you may want to link it to some other component - I don't know which is correct.

Regarding good point: This issue should also be related to bug 17873 "Encoding override should not be persistent", which asks that Webkit shoudl behave like Firefox: Firefox goes into default state for every page load. That is: it listens to what the page or the server says *everytime it loades the page*, even if the user has selected another enocing. That way,  the user choice is always ignored (perhaps with the exception for when the user uses one of the chardet encoding detection choices?), except that the user might force another encoding by choing an encoding *after* the page has loaded.  I think the Firefox behaviour is the better one here - and the Firefox behaviour is also more in line with what I have asked for in these bugs.
Comment 10 Leif Halvard Silli 2011-08-13 03:00:42 PDT
(In reply to comment #8)
> That's a good point. Another thing is that an XML document may be in a subframe inside an HTML one.

Since it seems we agree that solving this bug is a prerequisite for solving 66106, I don't believe that it is correct that its status is RESOLVED -> INVALID.  Thus, since it looks to be the most useful thing that I am authorized to do, I placed it in UNCONFIRMED.
Comment 11 Leif Halvard Silli 2011-08-13 03:16:58 PDT
Alexey, in bug 66084 and bug 66085 you said:

]]
Comment #1 From Alexey Proskuryakov 2011-08-12 21:44:00 PST (-) [reply]
A BOM is most authoritative indication of encoding, because there are few ways to get it wrong. It's much easier to get an encoding declaration or an HTTP header wrong.

There are some synthetic examples of strings in other encodings that can be mistaken for a BOM, but it hasn't been a practical issue.
[[

Don't you see that the same argument is true for XML files, when it comes to user's manual text encoding choice?

Because, the user's encoding choice is much more likely to be incorrect than the encoding specified by the file itself.  Thus you are helping the user if youi ignore his or her choice.

This is so because XML files's strict encoding rules - including FATAL ERROR rules, which for the most part are well understood and supported by the tools and editors that produce XML files.

Even when the HTTP Content-Type: specifies something for an application/xhtml+xml file, the HTTP is more likely to be correct than the user's encoding choice. This is so, once again, because it is a FATAL ERROR if the HTTP specifies something which is is incompatible with the file's real encoding. [*]

[*] EXCEPTION: Unfortunately - or fortuneattly, it is mostly only when an (UTF-8 encoded) file includes a BOM that it is possible to detect that the HTTP header specifies an incorrect encoding.
Comment 12 Alexey Proskuryakov 2011-08-13 09:09:20 PDT
As mentioned before, this may be a prerequisite to fixing the other bug, but I don't really think that we should "fix" either.
Comment 13 Leif Halvard Silli 2011-08-13 12:53:02 PDT
(In reply to comment #8)
> That's a good point. Another thing is that an XML document may be in a subframe inside an HTML one.


Perhaps I misunderstood what you meant by this? At first I interpreted this as support of the solution I suggested - namely, to adhere to XML 1.0 encoding rules. But perhaps not?

So, to test, I produced the page you were talking about  - this HTML page is WINDOWS-1252 labelled  (originally it is UTF-8 encoded) with a polyglot XHTML subframe (hence it is UTF-8 encoded and it also includes the HTML charset declaration) :

        http://malform.no/testing/html5/bom/frame

BROWSER RESULTS:

* IE9 and Firefox: they treat the subframe as UTF-8 - thus respecting the encoding default, over both mother page's encoding and over the user's choice as well.

* Webkit: it lets the subframe inherit the encoding from the HTML page. (At the very least it respects the UTF-8 encoding if there is BOM.)

* Opera: behaves like Webkit (except that it doesn't even respect the BOM)

* IE8 and below sniffs it as HTML, and also respect the HTML encoding declaration. (If I drop the encoding declaration, then it defaults to WINDOWS-1251.


If you ask me, this is a big failure for Webkit and Opera, in every way. Completely illogical behaviour.
Comment 14 Leif Halvard Silli 2011-08-13 16:59:22 PDT
This also affects SVG files.

TEST FILE: http://malform.no/testing/html5/bom/frame6
    HTML FILE FEATURES: 
            WIN-1252 encoded w/SVG image embedded in IMG, OBJECT and IFRAME.
    SVG FILE FEATURES: 
            UTF-8 encoded *with* encoding declaration. 
            Text of the SVG file is 'Hello world!" in Russian.

By default, this page looks as it should - everything is decoded correctly. *HOWEVER* the text of the HTML file is not readable, becaus of my willfull mislabeling. Thus, let us imagine that the user manually selects an encoding (any encoding except UTF-8), in order to make it more readble.

EXPECTED RESULT: 
            SVG file rendering to be unaffected for both IMG, OBJECT and IFRAME

ACTUAL RESULT: 
            The <img> remains unaffected, as expected.
            But for <object> + <iframe>, encoding is overridden (making the text unreadable)