Bug 74815 - Non-BMP Unicode character codes aren't properly unescaped in CSS
: Non-BMP Unicode character codes aren't properly unescaped in CSS
Status: NEW
: WebKit
CSS
: 528+ (Nightly build)
: Unspecified Unspecified
: P3 Normal
Assigned To:
:
:
:
: 69083
  Show dependency treegraph
 
Reported: 2011-12-18 10:42 PST by
Modified: 2012-04-30 15:46 PST (History)


Attachments
A zip file containing a demonstration of the bug (74.93 KB, application/zip)
2011-12-18 22:34 PST, P.J. Onori
no flags Details
demo of literal character working fine (152 bytes, text/html)
2011-12-20 17:27 PST, Alexey Proskuryakov
no flags Details
reduced test case (297 bytes, text/html)
2012-01-12 11:01 PST, Mathias Bynens
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2011-12-18 10:42:31 PST
Glyphs in what I'm suspecting to be anything above the Basic Multilingual Plane (0x0000-0xffff) gets a diamond with a question mark in the middle, even when a glyph exists at the value.
------- Comment #1 From 2011-12-18 21:51:56 PST -------
Could you please provide a test case?

Given that you mention the diamond with a question mark, I suspect that you're seeing this problem on Mac, but then I'm confused because Mac WebKit certainly supports non-BMP characters. Are you seeing this in Safari on Mac, some other WebKit based browser on Mac, or some entirely different platform?
------- Comment #2 From 2011-12-18 22:34:40 PST -------
Created an attachment (id=119817) [details]
A zip file containing a demonstration of the bug

The zip file contains an HTML file which displays characters at specific unicode values. The table shows the corresponding Unicode value on the right-most column.
------- Comment #3 From 2011-12-18 22:36:02 PST -------
I've checked this on Safari v5.1, Chrome v16 and the latest Webkit build - all on the Mac. Let me know if there's anything else I can provide.
------- Comment #4 From 2011-12-20 17:26:59 PST -------
This is specifically an issue with parsing strings like '\01f3a4'. It will work if you just paste a Unicode character in your CSS (and add a @charset rule to make sure it's decoded correctly).

I don't know if we're matching the spec or not here.
------- Comment #5 From 2011-12-20 17:27:22 PST -------
Created an attachment (id=120122) [details]
reduced test case
------- Comment #6 From 2011-12-20 20:24:45 PST -------
Thanks Alexey. That works for me in Safari 5.1 and Webkit (Chrome doesn't seem to support it). 

I poked around as well and couldn't discern if this isn't appropriate. Pragmatically, it may be prove frustrating for people looking at the CSS file with a typeface that doesn't contain those glyphs. But that's not technically your problem. ;)
------- Comment #7 From 2011-12-20 20:26:42 PST -------
My lord, apologies for such a poorly-written comment. It's been a long day...
------- Comment #8 From 2012-01-12 10:46:55 PST -------
(In reply to comment #4)
> This is specifically an issue with parsing strings like '\01f3a4'. It will work if you just paste a Unicode character in your CSS (and add a @charset rule to make sure it's decoded correctly).
> 
> I don't know if we're matching the spec or not here.

FWIW, the spec is here: http://www.w3.org/TR/CSS21/syndata.html#characters / http://www.w3.org/TR/css3-syntax/#characters

It doesn’t mention anything about UTF-16 or surrogate pairs in escapes (which are thus non-standard, although they happen to be supported in WebKit); only Unicode / ISO 10646 code points are allowed in CSS escape sequences. This kind of CSS escape sequence doesn’t work in WebKit for characters outside the BMP, which is what this bug is about. For more info, see this mailing list discussion: http://lists.w3.org/Archives/Public/www-style/2012Jan/thread.html#msg536

For example, `\1d306 ` or `\01d306` are supposed to be CSS escape sequences for the “tetragram for centre” symbol (U+1D306), but they currently don’t work in WebKit.

(In reply to comment #5)
> Created an attachment (id=120122) [details] [details]
> reduced test case

I’m not sure how that test case helps, as it doesn’t contain a CSS escape sequence, just the literal character. Am I missing something?

Here’s an appropriate test case: http://jsfiddle.net/mathias/jY7ra/ The first escape sequence (used with `html:before`) is the standard one. WebKit is the only engine this fails in.
------- Comment #9 From 2012-01-12 11:01:14 PST -------
Created an attachment (id=122271) [details]
reduced test case
------- Comment #10 From 2012-01-12 11:46:53 PST -------
(From update of attachment 120122 [details])
> I’m not sure how that test case helps

It was meant as a demonstration that the issue is more limited in scope than originally reported. I chose a poor description for the attachment, sorry for the confusion.
------- Comment #11 From 2012-01-12 21:25:01 PST -------
This is a tokenizer level issue (AP thanks for CC'ing me). Would not be much trouble to fix it in the custom written tokenizer after it is landed, just adding some extra parsing to the escape sequences.
------- Comment #12 From 2012-01-24 08:15:10 PST -------
Note that this also affects `document.querySelector` and `document.querySelectorAll`.

Failing test case:

data:text/html;charset=utf-8,%3C!DOCTYPE%20html%3E%3Ctitle%3EMothereffing%20CSS%20escapes%20example%3C%2Ftitle%3E%3Cstyle%3Epre%7Bbackground%3A%23eee%3Bpadding%3A.5em%7Dp%7Bdisplay%3Anone%7D%23ab%5Ca9%20de%5C1d306%20fg%7Bdisplay%3Ablock%7D%3C%2Fstyle%3E%3Ch1%3E%3Ca%20href%3D%22http%3A%2F%2Fmothereff.in%2Fcss-escapes%23ab%25C2%25A9de%25F0%259D%258C%2586fg%22%3EMothereffing%20CSS%20escapes%3C%2Fa%3E%20example%3C%2Fh1%3E%3Cpre%3E%3Ccode%3Eab%C2%A9de%F0%9D%8C%86fg%3C%2Fcode%3E%3C%2Fpre%3E%3Cp%20id%3D%22ab%C2%A9de%F0%9D%8C%86fg%22%3EIf%20you%20can%20read%20this%2C%20the%20escaped%20CSS%20selector%20worked.%20%3C%2Fp%3E%3Cscript%3Edocument.getElementById('ab%C2%A9de%F0%9D%8C%86fg').innerHTML%20%2B%3D%20'%20%3Ccode%3Edocument.getElementById%3C%2Fcode%3E%20worked.'%3Bdocument.querySelector('%23ab%5C%5Ca9%20de%5C%5C1d306%20fg').innerHTML%2B%3D'%20%3Ccode%3Edocument.querySelector%3C%2Fcode%3E%20worked.'%3C%2Fscript%3E

(In reply to comment #11)
> This is a tokenizer level issue (AP thanks for CC'ing me). Would not be much trouble to fix it in the custom written tokenizer after it is landed, just adding some extra parsing to the escape sequences.

Out of curiosity, when will the custom-written tokenizer land (if it hasn’t already)? Any bug tickets I can subscribe to?
------- Comment #13 From 2012-01-24 12:27:41 PST -------
> Out of curiosity, when will the custom-written tokenizer land (if it hasn’t already)? Any bug tickets I can subscribe to?

https://bugs.webkit.org/show_bug.cgi?id=70107

I just got an r+ to it, but I will land it tomorrow because I want to see the bots.
------- Comment #14 From 2012-01-24 12:43:23 PST -------
FWIW, I’ve just deployed some changes to my CSS escaper tool to make it easier to create test cases for this bug. E.g. click the “example” link on http://mothereff.in/css-escapes#1%F0%9D%8C%86.

(In reply to comment #13)
> I just got an r+ to it, but I will land it tomorrow because I want to see the bots.

That’s awesome news!
------- Comment #15 From 2012-01-30 00:32:32 PST -------
https://bugs.webkit.org/show_bug.cgi?id=70107 is now RESOLVED FIXED, landed here: http://trac.webkit.org/changeset/106217
------- Comment #16 From 2012-01-30 00:36:20 PST -------
Better test case that will show a red/lime background depending on success/failure: data:text/html;charset=utf-8,<!DOCTYPE%20html><title>Mothereffing%20CSS%20escapes%20example<%2Ftitle><style>pre%7Bbackground%3A%23eee%3Bpadding%3A.5em%7D.test%7Bdisplay%3Anone%7D%23b%5Ca9%20de%5C1d306%20fg%7Bdisplay%3Ablock%7D.pass%7Bbackground%3Alime%7D.fail%7Bbackground%3Ared%7D<%2Fstyle><h1><a%20href%3D"http%3A%2F%2Fmothereff.in%2Fcss-escapes%231b%25C2%25A9de%25F0%259D%258C%2586fg">Mothereffing%20CSS%20escapes<%2Fa>%20example<%2Fh1><pre><code>b%C2%A9de%F0%9D%8C%86fg<%2Fcode><%2Fpre><p%20id%3D"b%C2%A9de%F0%9D%8C%86fg"%20class%3Dtest>If%20you%20can%20read%20this%2C%20the%20escaped%20CSS%20selector%20worked.%20<%2Fp><p>Standard%20CSS%20character%20escape%20sequences%20for%20supplementary%20Unicode%20characters%20aren%E2%80%99t%20currently%20supported%20in%20WebKit.%20<strong>This%20test%20case%20will%20fail%20in%20those%20browsers.<%2Fstrong>%20It%E2%80%99s%20better%20to%20leave%20these%20characters%20unescaped.<%2Fp><script>var%20el%3Ddocument.getElementsByTagName('p')%5B0%5D%3Btry%7Bdocument.getElementById('b%5Cxa9de%5Cud834%5Cudf06fg').innerHTML%20%2B%3D%20'%20<code>document.getElementById<%2Fcode>%20worked.'%3Bdocument.querySelector('%23b%5C%5Ca9%20de%5C%5C1d306%20fg').innerHTML%2B%3D'%20<code>document.querySelector<%2Fcode>%20worked.'%3Bel.className%3D'pass'%7Dcatch(e)%7Bel.innerHTML%3D'FAIL'%3Bel.className%3D'fail'%7D<%2Fscript>

Short URL: http://mths.be/bel