Bug 74815

Summary: Non-BMP Unicode character codes aren't properly unescaped in CSS
Product: WebKit Reporter: P.J. Onori <onorpj>
Component: CSSAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: ap, eoconnor, john.david.dalton, mathias, pjkixx, szledan, zherczeg
Priority: P3    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 69083    
Attachments:
Description Flags
A zip file containing a demonstration of the bug
none
demo of literal character working fine
none
reduced test case none

Description P.J. Onori 2011-12-18 10:42:31 PST
Glyphs in what I'm suspecting to be anything above the Basic Multilingual Plane (0x0000-0xffff) gets a diamond with a question mark in the middle, even when a glyph exists at the value.
Comment 1 Alexey Proskuryakov 2011-12-18 21:51:56 PST
Could you please provide a test case?

Given that you mention the diamond with a question mark, I suspect that you're seeing this problem on Mac, but then I'm confused because Mac WebKit certainly supports non-BMP characters. Are you seeing this in Safari on Mac, some other WebKit based browser on Mac, or some entirely different platform?
Comment 2 P.J. Onori 2011-12-18 22:34:40 PST
Created attachment 119817 [details]
A zip file containing a demonstration of the bug

The zip file contains an HTML file which displays characters at specific unicode values. The table shows the corresponding Unicode value on the right-most column.
Comment 3 P.J. Onori 2011-12-18 22:36:02 PST
I've checked this on Safari v5.1, Chrome v16 and the latest Webkit build - all on the Mac. Let me know if there's anything else I can provide.
Comment 4 Alexey Proskuryakov 2011-12-20 17:26:59 PST
This is specifically an issue with parsing strings like '\01f3a4'. It will work if you just paste a Unicode character in your CSS (and add a @charset rule to make sure it's decoded correctly).

I don't know if we're matching the spec or not here.
Comment 5 Alexey Proskuryakov 2011-12-20 17:27:22 PST
Created attachment 120122 [details]
demo of literal character working fine
Comment 6 P.J. Onori 2011-12-20 20:24:45 PST
Thanks Alexey. That works for me in Safari 5.1 and Webkit (Chrome doesn't seem to support it). 

I poked around as well and couldn't discern if this isn't appropriate. Pragmatically, it may be prove frustrating for people looking at the CSS file with a typeface that doesn't contain those glyphs. But that's not technically your problem. ;)
Comment 7 P.J. Onori 2011-12-20 20:26:42 PST
My lord, apologies for such a poorly-written comment. It's been a long day...
Comment 8 Mathias Bynens 2012-01-12 10:46:55 PST
(In reply to comment #4)
> This is specifically an issue with parsing strings like '\01f3a4'. It will work if you just paste a Unicode character in your CSS (and add a @charset rule to make sure it's decoded correctly).
> 
> I don't know if we're matching the spec or not here.

FWIW, the spec is here: http://www.w3.org/TR/CSS21/syndata.html#characters / http://www.w3.org/TR/css3-syntax/#characters

It doesn’t mention anything about UTF-16 or surrogate pairs in escapes (which are thus non-standard, although they happen to be supported in WebKit); only Unicode / ISO 10646 code points are allowed in CSS escape sequences. This kind of CSS escape sequence doesn’t work in WebKit for characters outside the BMP, which is what this bug is about. For more info, see this mailing list discussion: http://lists.w3.org/Archives/Public/www-style/2012Jan/thread.html#msg536

For example, `\1d306 ` or `\01d306` are supposed to be CSS escape sequences for the “tetragram for centre” symbol (U+1D306), but they currently don’t work in WebKit.

(In reply to comment #5)
> Created an attachment (id=120122) [details]
> reduced test case

I’m not sure how that test case helps, as it doesn’t contain a CSS escape sequence, just the literal character. Am I missing something?

Here’s an appropriate test case: http://jsfiddle.net/mathias/jY7ra/ The first escape sequence (used with `html:before`) is the standard one. WebKit is the only engine this fails in.
Comment 9 Mathias Bynens 2012-01-12 11:01:14 PST
Created attachment 122271 [details]
reduced test case
Comment 10 Alexey Proskuryakov 2012-01-12 11:46:53 PST
Comment on attachment 120122 [details]
demo of literal character working fine

> I’m not sure how that test case helps

It was meant as a demonstration that the issue is more limited in scope than originally reported. I chose a poor description for the attachment, sorry for the confusion.
Comment 11 Zoltan Herczeg 2012-01-12 21:25:01 PST
This is a tokenizer level issue (AP thanks for CC'ing me). Would not be much trouble to fix it in the custom written tokenizer after it is landed, just adding some extra parsing to the escape sequences.
Comment 12 Mathias Bynens 2012-01-24 08:15:10 PST
Note that this also affects `document.querySelector` and `document.querySelectorAll`.

Failing test case:

data:text/html;charset=utf-8,%3C!DOCTYPE%20html%3E%3Ctitle%3EMothereffing%20CSS%20escapes%20example%3C%2Ftitle%3E%3Cstyle%3Epre%7Bbackground%3A%23eee%3Bpadding%3A.5em%7Dp%7Bdisplay%3Anone%7D%23ab%5Ca9%20de%5C1d306%20fg%7Bdisplay%3Ablock%7D%3C%2Fstyle%3E%3Ch1%3E%3Ca%20href%3D%22http%3A%2F%2Fmothereff.in%2Fcss-escapes%23ab%25C2%25A9de%25F0%259D%258C%2586fg%22%3EMothereffing%20CSS%20escapes%3C%2Fa%3E%20example%3C%2Fh1%3E%3Cpre%3E%3Ccode%3Eab%C2%A9de%F0%9D%8C%86fg%3C%2Fcode%3E%3C%2Fpre%3E%3Cp%20id%3D%22ab%C2%A9de%F0%9D%8C%86fg%22%3EIf%20you%20can%20read%20this%2C%20the%20escaped%20CSS%20selector%20worked.%20%3C%2Fp%3E%3Cscript%3Edocument.getElementById('ab%C2%A9de%F0%9D%8C%86fg').innerHTML%20%2B%3D%20'%20%3Ccode%3Edocument.getElementById%3C%2Fcode%3E%20worked.'%3Bdocument.querySelector('%23ab%5C%5Ca9%20de%5C%5C1d306%20fg').innerHTML%2B%3D'%20%3Ccode%3Edocument.querySelector%3C%2Fcode%3E%20worked.'%3C%2Fscript%3E

(In reply to comment #11)
> This is a tokenizer level issue (AP thanks for CC'ing me). Would not be much trouble to fix it in the custom written tokenizer after it is landed, just adding some extra parsing to the escape sequences.

Out of curiosity, when will the custom-written tokenizer land (if it hasn’t already)? Any bug tickets I can subscribe to?
Comment 13 Zoltan Herczeg 2012-01-24 12:27:41 PST
> Out of curiosity, when will the custom-written tokenizer land (if it hasn’t already)? Any bug tickets I can subscribe to?

https://bugs.webkit.org/show_bug.cgi?id=70107

I just got an r+ to it, but I will land it tomorrow because I want to see the bots.
Comment 14 Mathias Bynens 2012-01-24 12:43:23 PST
FWIW, I’ve just deployed some changes to my CSS escaper tool to make it easier to create test cases for this bug. E.g. click the “example” link on http://mothereff.in/css-escapes#1%F0%9D%8C%86.

(In reply to comment #13)
> I just got an r+ to it, but I will land it tomorrow because I want to see the bots.

That’s awesome news!
Comment 15 Mathias Bynens 2012-01-30 00:32:32 PST
https://bugs.webkit.org/show_bug.cgi?id=70107 is now RESOLVED FIXED, landed here: http://trac.webkit.org/changeset/106217
Comment 16 Mathias Bynens 2012-01-30 00:36:20 PST
Better test case that will show a red/lime background depending on success/failure: data:text/html;charset=utf-8,<!DOCTYPE%20html><title>Mothereffing%20CSS%20escapes%20example<%2Ftitle><style>pre%7Bbackground%3A%23eee%3Bpadding%3A.5em%7D.test%7Bdisplay%3Anone%7D%23b%5Ca9%20de%5C1d306%20fg%7Bdisplay%3Ablock%7D.pass%7Bbackground%3Alime%7D.fail%7Bbackground%3Ared%7D<%2Fstyle><h1><a%20href%3D"http%3A%2F%2Fmothereff.in%2Fcss-escapes%231b%25C2%25A9de%25F0%259D%258C%2586fg">Mothereffing%20CSS%20escapes<%2Fa>%20example<%2Fh1><pre><code>b%C2%A9de%F0%9D%8C%86fg<%2Fcode><%2Fpre><p%20id%3D"b%C2%A9de%F0%9D%8C%86fg"%20class%3Dtest>If%20you%20can%20read%20this%2C%20the%20escaped%20CSS%20selector%20worked.%20<%2Fp><p>Standard%20CSS%20character%20escape%20sequences%20for%20supplementary%20Unicode%20characters%20aren%E2%80%99t%20currently%20supported%20in%20WebKit.%20<strong>This%20test%20case%20will%20fail%20in%20those%20browsers.<%2Fstrong>%20It%E2%80%99s%20better%20to%20leave%20these%20characters%20unescaped.<%2Fp><script>var%20el%3Ddocument.getElementsByTagName('p')%5B0%5D%3Btry%7Bdocument.getElementById('b%5Cxa9de%5Cud834%5Cudf06fg').innerHTML%20%2B%3D%20'%20<code>document.getElementById<%2Fcode>%20worked.'%3Bdocument.querySelector('%23b%5C%5Ca9%20de%5C%5C1d306%20fg').innerHTML%2B%3D'%20<code>document.querySelector<%2Fcode>%20worked.'%3Bel.className%3D'pass'%7Dcatch(e)%7Bel.innerHTML%3D'FAIL'%3Bel.className%3D'fail'%7D<%2Fscript>

Short URL: http://mths.be/bel
Comment 17 Mathias Bynens 2015-04-30 00:26:09 PDT
This seems fixed. Feel free to mark this bug as RESOLVED FIXED.
Comment 18 Alexey Proskuryakov 2015-04-30 09:22:12 PDT

*** This bug has been marked as a duplicate of bug 76152 ***