Bug 11011

Summary: External CSS is parsed as iso-8859-1 even though the main document is utf-8
Product: WebKit Reporter: Simon Pieters <zcorpan>
Component: CSSAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: robburns1, zcorpan
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
URL: http://simon.html5.org/test/css/syndata/representation/001.htm
Attachments:
Description Flags
work in progress
none
proposed fix darin: review+

Description Simon Pieters 2006-09-24 14:12:31 PDT
iso-8859-1 (or perhaps windows-1252?) is assumed for external style sheets, even though charset=utf-8 has been specified in Content-Type HTTP header. See http://www.w3.org/TR/CSS21/syndata.html#q23 for how encoding information should be gathered for CSS.

Steps to reproduce:
1. Use non-ascii characters in a utf-8 stylesheet in 'content'.
2. Specify charset=utf-8 in Content-Type HTTP header.

Expected results:
The style sheet should be parsed with the specified encoding (utf-8).

Actual results:
The style sheet is parsed as iso-8859-1 (or windows-1252?).
Comment 1 Alexey Proskuryakov 2006-09-25 06:31:57 PDT
This stylesheet comes without a charset in Content-Type:

$ curl -I http://simon.html5.org/test/css/syndata/representation/support/001.css
HTTP/1.1 200 OK
<...>
Content-Type: text/css

However, we do have a bug - the charset must be inherited from the main document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this is hardly practical.
Comment 2 Robert Burns 2006-09-25 13:37:50 PDT
(In reply to comment #1)
> This stylesheet comes without a charset in Content-Type:
> 
> $ curl -I
> http://simon.html5.org/test/css/syndata/representation/support/001.css
> HTTP/1.1 200 OK
> <...>
> Content-Type: text/css
> 
> However, we do have a bug - the charset must be inherited from the main
> document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that
> UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this
> is hardly practical.
> 

I'm not sure what you mean by hardly practical The spec says that If the encoding can't be determined through any of the previous means then UTF-8 should be assumed (rather than assuming 8859-1 or something else). Could you elaborate what you mean by hardly practical?
Comment 3 Alexey Proskuryakov 2006-09-25 13:45:45 PDT
What I meant was that rules 1-4 are always sufficient (the referring resource always has a charset), so rule 5 is never used. And even if viewing standalone CSS files, the browser default is probably a better approximation.
Comment 4 Robert Burns 2006-09-25 13:58:24 PDT
(In reply to comment #3)
> What I meant was that rules 1-4 are always sufficient (the referring resource
> always has a charset), so rule 5 is never used. And even if viewing standalone
> CSS files, the browser default is probably a better approximation.
> 

I see. Also I see the sample stylesheet has no BOM either so rule 2 would not kick in.
Comment 5 Simon Pieters 2006-09-25 15:00:14 PDT
(In reply to comment #1)
> This stylesheet comes without a charset in Content-Type:
> [...]

Sorry, fixed now. The bug still applies.

> However, we do have a bug - the charset must be inherited from the main
> document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that
> UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this
> is hardly practical.

Yeah. That broadens the scope of this bug, however. I'll create more test cases on this.
Comment 6 Robert Burns 2006-09-25 16:08:37 PDT
(In reply to comment #5)
> (In reply to comment #1)
> > This stylesheet comes without a charset in Content-Type:
> > [...]
> 
> Sorry, fixed now. The bug still applies.
> 
> > However, we do have a bug - the charset must be inherited from the main
> > document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that
> > UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this
> > is hardly practical.
> 
> Yeah. That broadens the scope of this bug, however. I'll create more test cases
> on this.
> 

As far as I can tell this is loading as UTF-8. Even when I set the default in the nightly build to Western (ISO Latiin-1).
Comment 7 Alexey Proskuryakov 2006-09-25 21:23:53 PDT
Yes, charset from Content-Type (and @charset) has been already fixed - you can use nightly builds from <http://nightly.webkit.org> for testing.
Comment 8 Simon Pieters 2006-09-26 03:13:42 PDT
(In reply to comment #7)
> Yes, charset from Content-Type (and @charset) has been already fixed
 
Ok. I've removed charset from Content-Type in the test case again so that it matches this bug report.
Comment 9 Alexey Proskuryakov 2006-09-27 13:30:20 PDT
Created attachment 10805 [details]
work in progress

This implements inheriting the charset from a document, but not from a referring stylesheet yet. Parts of the implementation are common between CSS and XSL, so I wonder if XSL has similar inheritance rules.
Comment 10 Simon Pieters 2006-09-27 15:06:54 PDT
(In reply to comment #9)
> [...] Parts of the implementation are common between CSS
> and XSL, so I wonder if XSL has similar inheritance rules.

XSL doesn't inherit encoding. XSL is XML and so should get XML treatment when it comes to encoding.
Comment 11 Alexey Proskuryakov 2006-09-30 13:19:43 PDT
Created attachment 10851 [details]
proposed fix

This fixes a bunch of cases; one loose end that might remain is handling of dynamically modified stylesheets, e.g.:
s = document.implementation.createCSSStyleSheet(...);
s.insertRule('@charset ...', 0);
s.insertRule('@import...', 1);

However, I don't think the behavior in such cases is rigorously specified, and testing to match other browsers seems infeasible (Firefox 1.5 doesn't even have createCSSStyleSheet()).
Comment 12 Darin Adler 2006-09-30 13:49:25 PDT
Comment on attachment 10851 [details]
proposed fix

+#include "CSSCharsetRule.h"

Why is this additional include needed in CSSStyleSheet.cpp?

+StyleSheet *StyleSheetList::item(unsigned index)

Should move the * to the left too while we're at it.

+            c->setCSSStyleSheet(String(ResponseURL(m_response)), m_decoder->encoding().name(), m_sheet);

Do we really need the explicit String() here?

r=me
Comment 13 Alexey Proskuryakov 2006-09-30 14:19:37 PDT
Committed revision 16689.

> Why is this additional include needed in CSSStyleSheet.cpp?

  I forgot to remove it (originally considered looking at CSSCharsetRule to support dynamic stylesheet manipulation). Fixed. 

> Should move the * to the left too while we're at it.

  Done.

> Do we really need the explicit String() here?

  Apparently not, fixed.
Comment 14 Darin Adler 2006-10-02 11:08:01 PDT
This change broke the loading of the error page's style sheet! Working on that (Radar 4759256).
Comment 15 Robert Burns 2007-02-21 19:08:51 PST
Lookin over the recommenation on this again, I think we sshould interpret it more charitably.

http://www.w3.org/TR/CSS21/syndata.html#q23

Rather than suggesting that it can never get to setp 6, I think the recommendation means, by the charset of the referring document, the actual declared charset. Certainly every document has an imputed charset (by some method), however if the referring document chraset is likewise unkown, I think the recommendation is suggesting to use UTF-8. 

Now, I'm not sure if this is how other browsers would interpret it. Peraps some testing could be done on that. However, I think the more we can assume UTF-8 the better. If we could get away from these other charsets that would be a good thing. Certainly if it's not ISO-8859-1  or unicode it would be more common to include an @harset rule at the beginning of a document. Or to include the charset in the reference link or the http header. So with that all left out, the decision is over whether to assume latin or assume UTF-8 (for the most part).  I would vote for UTF-8 without significant flow of other browsers in the opposite direction.
Comment 16 Alexey Proskuryakov 2007-02-22 01:04:45 PST
If you have examples where WebKit gives different results than Firefox, please open new bugs for them!
Comment 17 Ian 'Hixie' Hickson 2007-02-22 15:41:33 PST
Regarding comment %15 above: We (the CSSWG) meant the assumed charset of the referring page, not the declared charset. The "assume UTF-8" step only applies if there's no referring page (e.g. in a CSS editor opening a CSS file directly).
Comment 18 Robert Burns 2007-02-22 16:40:31 PST
Regarding comment #15 again: Well, I'm not sure that one person can speak for the entire WG. However, I took a look at CSS2 which is much clearer on this. However, CSS2 doesn't include the fallback of the referring document's imputed or declared charset:

"4. charset of referring style sheet or document (if any)"

I guess the "if any" refers to teh word "document" and not "charset of"

It would be better if CSS2.1 made it clear by saying something like:

"4. the declared or imputed charset of referring style sheet or document (if any)"

It's not good when updated recommendations add to the confusion.