Bug 11011

Summary: External CSS is parsed as iso-8859-1 even though the main document is utf-8
Product: WebKit Reporter: Simon Pieters <zcorpan>
Component: CSSAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: robburns1, zcorpan
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
URL: http://simon.html5.org/test/css/syndata/representation/001.htm
Attachments:
Description Flags
work in progress
none
proposed fix darin: review+

Simon Pieters
Reported 2006-09-24 14:12:31 PDT
iso-8859-1 (or perhaps windows-1252?) is assumed for external style sheets, even though charset=utf-8 has been specified in Content-Type HTTP header. See http://www.w3.org/TR/CSS21/syndata.html#q23 for how encoding information should be gathered for CSS. Steps to reproduce: 1. Use non-ascii characters in a utf-8 stylesheet in 'content'. 2. Specify charset=utf-8 in Content-Type HTTP header. Expected results: The style sheet should be parsed with the specified encoding (utf-8). Actual results: The style sheet is parsed as iso-8859-1 (or windows-1252?).
Attachments
work in progress (3.83 KB, patch)
2006-09-27 13:30 PDT, Alexey Proskuryakov
no flags
proposed fix (35.93 KB, patch)
2006-09-30 13:19 PDT, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2006-09-25 06:31:57 PDT
This stylesheet comes without a charset in Content-Type: $ curl -I http://simon.html5.org/test/css/syndata/representation/support/001.css HTTP/1.1 200 OK <...> Content-Type: text/css However, we do have a bug - the charset must be inherited from the main document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this is hardly practical.
Robert Burns
Comment 2 2006-09-25 13:37:50 PDT
(In reply to comment #1) > This stylesheet comes without a charset in Content-Type: > > $ curl -I > http://simon.html5.org/test/css/syndata/representation/support/001.css > HTTP/1.1 200 OK > <...> > Content-Type: text/css > > However, we do have a bug - the charset must be inherited from the main > document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that > UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this > is hardly practical. > I'm not sure what you mean by hardly practical The spec says that If the encoding can't be determined through any of the previous means then UTF-8 should be assumed (rather than assuming 8859-1 or something else). Could you elaborate what you mean by hardly practical?
Alexey Proskuryakov
Comment 3 2006-09-25 13:45:45 PDT
What I meant was that rules 1-4 are always sufficient (the referring resource always has a charset), so rule 5 is never used. And even if viewing standalone CSS files, the browser default is probably a better approximation.
Robert Burns
Comment 4 2006-09-25 13:58:24 PDT
(In reply to comment #3) > What I meant was that rules 1-4 are always sufficient (the referring resource > always has a charset), so rule 5 is never used. And even if viewing standalone > CSS files, the browser default is probably a better approximation. > I see. Also I see the sample stylesheet has no BOM either so rule 2 would not kick in.
Simon Pieters
Comment 5 2006-09-25 15:00:14 PDT
(In reply to comment #1) > This stylesheet comes without a charset in Content-Type: > [...] Sorry, fixed now. The bug still applies. > However, we do have a bug - the charset must be inherited from the main > document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that > UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this > is hardly practical. Yeah. That broadens the scope of this bug, however. I'll create more test cases on this.
Robert Burns
Comment 6 2006-09-25 16:08:37 PDT
(In reply to comment #5) > (In reply to comment #1) > > This stylesheet comes without a charset in Content-Type: > > [...] > > Sorry, fixed now. The bug still applies. > > > However, we do have a bug - the charset must be inherited from the main > > document (rule 4 from the quoted spec). Technically, CSS 2.1 also suggests that > > UTF-8 should be a default (rule 5), but since CSS files aren't standalone, this > > is hardly practical. > > Yeah. That broadens the scope of this bug, however. I'll create more test cases > on this. > As far as I can tell this is loading as UTF-8. Even when I set the default in the nightly build to Western (ISO Latiin-1).
Alexey Proskuryakov
Comment 7 2006-09-25 21:23:53 PDT
Yes, charset from Content-Type (and @charset) has been already fixed - you can use nightly builds from <http://nightly.webkit.org> for testing.
Simon Pieters
Comment 8 2006-09-26 03:13:42 PDT
(In reply to comment #7) > Yes, charset from Content-Type (and @charset) has been already fixed Ok. I've removed charset from Content-Type in the test case again so that it matches this bug report.
Alexey Proskuryakov
Comment 9 2006-09-27 13:30:20 PDT
Created attachment 10805 [details] work in progress This implements inheriting the charset from a document, but not from a referring stylesheet yet. Parts of the implementation are common between CSS and XSL, so I wonder if XSL has similar inheritance rules.
Simon Pieters
Comment 10 2006-09-27 15:06:54 PDT
(In reply to comment #9) > [...] Parts of the implementation are common between CSS > and XSL, so I wonder if XSL has similar inheritance rules. XSL doesn't inherit encoding. XSL is XML and so should get XML treatment when it comes to encoding.
Alexey Proskuryakov
Comment 11 2006-09-30 13:19:43 PDT
Created attachment 10851 [details] proposed fix This fixes a bunch of cases; one loose end that might remain is handling of dynamically modified stylesheets, e.g.: s = document.implementation.createCSSStyleSheet(...); s.insertRule('@charset ...', 0); s.insertRule('@import...', 1); However, I don't think the behavior in such cases is rigorously specified, and testing to match other browsers seems infeasible (Firefox 1.5 doesn't even have createCSSStyleSheet()).
Darin Adler
Comment 12 2006-09-30 13:49:25 PDT
Comment on attachment 10851 [details] proposed fix +#include "CSSCharsetRule.h" Why is this additional include needed in CSSStyleSheet.cpp? +StyleSheet *StyleSheetList::item(unsigned index) Should move the * to the left too while we're at it. + c->setCSSStyleSheet(String(ResponseURL(m_response)), m_decoder->encoding().name(), m_sheet); Do we really need the explicit String() here? r=me
Alexey Proskuryakov
Comment 13 2006-09-30 14:19:37 PDT
Committed revision 16689. > Why is this additional include needed in CSSStyleSheet.cpp? I forgot to remove it (originally considered looking at CSSCharsetRule to support dynamic stylesheet manipulation). Fixed. > Should move the * to the left too while we're at it. Done. > Do we really need the explicit String() here? Apparently not, fixed.
Darin Adler
Comment 14 2006-10-02 11:08:01 PDT
This change broke the loading of the error page's style sheet! Working on that (Radar 4759256).
Robert Burns
Comment 15 2007-02-21 19:08:51 PST
Lookin over the recommenation on this again, I think we sshould interpret it more charitably. http://www.w3.org/TR/CSS21/syndata.html#q23 Rather than suggesting that it can never get to setp 6, I think the recommendation means, by the charset of the referring document, the actual declared charset. Certainly every document has an imputed charset (by some method), however if the referring document chraset is likewise unkown, I think the recommendation is suggesting to use UTF-8. Now, I'm not sure if this is how other browsers would interpret it. Peraps some testing could be done on that. However, I think the more we can assume UTF-8 the better. If we could get away from these other charsets that would be a good thing. Certainly if it's not ISO-8859-1 or unicode it would be more common to include an @harset rule at the beginning of a document. Or to include the charset in the reference link or the http header. So with that all left out, the decision is over whether to assume latin or assume UTF-8 (for the most part). I would vote for UTF-8 without significant flow of other browsers in the opposite direction.
Alexey Proskuryakov
Comment 16 2007-02-22 01:04:45 PST
If you have examples where WebKit gives different results than Firefox, please open new bugs for them!
Ian 'Hixie' Hickson
Comment 17 2007-02-22 15:41:33 PST
Regarding comment %15 above: We (the CSSWG) meant the assumed charset of the referring page, not the declared charset. The "assume UTF-8" step only applies if there's no referring page (e.g. in a CSS editor opening a CSS file directly).
Robert Burns
Comment 18 2007-02-22 16:40:31 PST
Regarding comment #15 again: Well, I'm not sure that one person can speak for the entire WG. However, I took a look at CSS2 which is much clearer on this. However, CSS2 doesn't include the fallback of the referring document's imputed or declared charset: "4. charset of referring style sheet or document (if any)" I guess the "if any" refers to teh word "document" and not "charset of" It would be better if CSS2.1 made it clear by saying something like: "4. the declared or imputed charset of referring style sheet or document (if any)" It's not good when updated recommendations add to the confusion.
Note You need to log in before you can comment on or make changes to this bug.