Bug 4920 - Non-BMP characters in JavaScript identifiers
Summary: Non-BMP characters in JavaScript identifiers
Status: RESOLVED INVALID
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 4885
  Show dependency treegraph
 
Reported: 2005-09-10 13:06 PDT by Alexey Proskuryakov
Modified: 2011-06-14 12:15 PDT (History)
3 users (show)

See Also:


Attachments
test case (668 bytes, text/html)
2005-09-10 13:06 PDT, Alexey Proskuryakov
no flags Details
A patch to remove KJS::UChar (nearly compiles) (21.06 KB, patch)
2007-11-19 03:41 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Sorry ggaren, I am in your lexer fixing your bugz (4.61 KB, patch)
2007-11-20 02:12 PST, Eric Seidel (no email)
ggaren: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-10 13:06:20 PDT
WebKit doesn't permit non-BMP characters (ones with codes above 0xFFFF) in JavaScript identifiers. 
There's no indication that they shouldn't work just like BMP ones.

Firefox also doesn't pass this test case.
Comment 1 Alexey Proskuryakov 2005-09-10 13:06:42 PDT
Created attachment 3848 [details]
test case
Comment 2 Eric Seidel (no email) 2007-11-19 03:30:15 PST
The function in question:

UChar Lexer::convertUnicode(int c1, int c2, int c3, int c4)
{
  // FIXME: This conversion is lossy. See http://bugs.webkit.org/show_bug.cgi?id=4920.
  return UChar((convertHex(c1) << 4) + convertHex(c2), (convertHex(c3) << 4) + convertHex(c4));
}

Calls UChar(unsigned short, unsigned short) which seems to be totally bogus.  You can't make a UTF16 "char" out of a UTF16 surrogate pair (as far as I can tell).

I don't see why the lexer would be combining UTF16 surrogate pairs down into a single 16 bit value.
Comment 3 Eric Seidel (no email) 2007-11-19 03:41:31 PST
Created attachment 17395 [details]
A patch to remove KJS::UChar (nearly compiles)

Here is a patch to remove KJS::UChar (which I think is part of the root of this problem).  There are still two compile errors in this patch, one of which touches on this bug (since the method which uses UChar(unsigned short, unsigned short) is totally bogus.  So I'm attaching this patch for Alexey or someone else with massive unicode skillz to complete.
Comment 4 Eric Seidel (no email) 2007-11-19 22:16:27 PST
Well, let me clarify.  KJS::UChar isn't the "root of this problem".  KJS::UChar(unsigned short, unsigned short) is the root of this problem.  It's a busted function which tries to do the impossible: "turn this surrogate pair into a single utf16 codepoint".  The problem is caused by the Lexer not inserting the real surrogate pair into the parsed stream, and instead is combing the surrogates into a single broken 16-bit value.  Removing UChar (and thus the broken constructor) is one way of fixing the problem. Another would be to just simply remove the constructor from UChar. :)
Comment 5 Eric Seidel (no email) 2007-11-19 23:08:05 PST
Um.  I was wrong.  This has nothing to do with Lexer::convertUnicode() That function, although strange, is right for its purpose. Likewise UChar(unsigned short, unsigned short) really should be UChar(unsigned char, unsigned char) but seems to be used correctly.
Comment 6 Eric Seidel (no email) 2007-11-20 02:12:00 PST
Created attachment 17414 [details]
Sorry ggaren, I am in your lexer fixing your bugz

That was actually quite simple.  About to SunSpider this puppy to make sure we haven't regressed to badly as a result.
Comment 7 Eric Seidel (no email) 2007-11-20 02:44:52 PST
Tested with SunSpider.  It gave me totally wacky results.  Some slower, some faster.  Net change: 0.
Comment 8 Alexey Proskuryakov 2007-11-20 03:20:21 PST
Comment on attachment 17414 [details]
Sorry ggaren, I am in your lexer fixing your bugz

+      } else if (U16_IS_SURROGATE_LEAD(current) && isIdentStart(U16_GET_SUPPLEMENTARY(current, next1))) {

This looks great to me, but I'd like someone with more JS knowlege to review.

One thing I'm not sure about is what will happen if next1 is not available (but JS doesn't use incremental parsing, does it?)
Comment 9 Sam Weinig 2007-11-20 13:59:18 PST
Comment on attachment 17414 [details]
Sorry ggaren, I am in your lexer fixing your bugz

This needs a changelog and a testcase.  r=me
Comment 10 Geoffrey Garen 2007-11-26 11:39:35 PST
Comment on attachment 17414 [details]
Sorry ggaren, I am in your lexer fixing your bugz

If you reach the end of the input stream before reading next1, next1 is set to -1. Calling U16_GET_SUPPLEMENTARY on -1 is not valid because "The result is undefined if the input values are not lead and trail surrogates."

Calling U16_IS_SURROGATE_LEAD on current is also not valid because U16_IS_SURROGATE_LEAD assumes that "c is a surrogate code point."

You need to check U16_IS_LEAD(current) and U16_IS_TRAIL(next1) before calling isIdentStart() or isIdentPart().

You should also add 2 test cases: one for when next1 is -1 and bad things happen, and one for when current seems like a lead surrogate even though it is not a surrogate code point and bad things happen.
Comment 11 Eric Seidel (no email) 2011-05-17 10:45:28 PDT
V8 doesn't pass the test case either.  Are we sure this is still a valid bug?
Comment 12 Gavin Barraclough 2011-06-13 22:22:00 PDT
(In reply to comment #11)
> V8 doesn't pass the test case either.  Are we sure this is still a valid bug?

I don't believe this bug is valid.

Section 6 of ES5 defines the meaning of 'character' and 'Unicode character' within the spec.  'Character' refers to precisely one 16-bit UTF-16 Code Unit.  Only the exact phrase 'Unicode character' should refer to a unicode encoded character, possibly represented by a surrogate pair in UTF-16.

Section 7.6 defines the set of characters permissible in Identifiers in terms of the categories of 'characters', i.e. 16-bit Code Units.  Considered individually (as the designation 'character' requires) the two halves of an abstract character formed by a surrogate pair are of code point category Cs, while is not a permissible code point category for inclusion within an identifier.

As such, 𐐀 is not a valid identifier per the ES5 spec.
Comment 13 Alexey Proskuryakov 2011-06-13 23:08:45 PDT
I think that the spec is just being sloppy here, as it is with everything else Unicode related, but compatibility beats common sense.

> Considered individually (as the designation 'character' requires) the two halves of an abstract character formed by a surrogate pair are of code point category Cs

I do not think that this is how it works. Only a Unicode character can have a category, so the halves in UTF-16 encoding don't have categories at all. In other words, U+D801 has category Cs, but bytes 0xD801 don't have a category.
Comment 14 Gavin Barraclough 2011-06-14 02:38:22 PDT
> I do not think that this is how it works. Only a Unicode character can have a category, so the halves in UTF-16 encoding don't have categories at all. In other words, U+D801 has category Cs, but bytes 0xD801 don't have a category.

I'm afraid I don't have quite the same reading of the Unicode spec as you.  The Unicode spec defines types for Unicode Code Points, which are specified as simply being any integer in the range 0..0x10FFFF.  Is it valid to request a Code Point type for the value 0xD801? - yes - it has a type, which is Cs.

And I think that a clearer point here is it that this seems unequivocally to be the intention of the ES5 spec.  ES5 clearly specifies that the source text should be being lexed one Unicode code unit at a time.  The specification defines both the term 'SourceCharacter' and 'character' correspond to a single UTF-16 Code Unit (deliberately drawing a distinction with "Unicode characters", below).  Arising from this definition, the specification of elements within an IdentifierName are only considering the Unicode Code Point type from a single Code Unit at a time.

Unless and until the EMCA spec changes to instead define 'SourceCharacter' and 'character' to be a "Unicode character" (which may be represented by more than one code unit), we shouldn't change our behaviour here - I think it's pretty clear that we are in compliance with the ES5 spec as it current stands.
Comment 15 Alexey Proskuryakov 2011-06-14 09:05:05 PDT
I didn't reopen this bug, because I don't think that there is enough practical difference for us to care. But tracking a obvious JS spec mistake with a WebKit bug is not unthinkable, even when we're in compliance.

> The Unicode spec defines types for Unicode Code Points

As you said, they are defined for code points, and not for code units. These two-bytes sequences of UTF-16 are code units.
Comment 16 Gavin Barraclough 2011-06-14 12:15:52 PDT
> > The Unicode spec defines types for Unicode Code Points
> 
> As you said, they are defined for code points, and not for code units. These two-bytes sequences of UTF-16 are code units.

Sure, and the code unit to code point mapping is outside the definition of the Unicode spec.  ECMAScript used to specifically defined that the source text encoding was UCS-2, which is a direct mapping from code units to code points.  They no longer call out the name of this encoding, but the spec defined lexing behaviour still matches UCS-2 decoding.

> I didn't reopen this bug, because I don't think that there is enough practical difference for us to care. But tracking a obvious JS spec mistake with a WebKit bug is not unthinkable, even when we're in compliance.

That's an understandable and reasonable goal.  I'm just worried that this bug report as it stands is likely to be misconstrued as a bug in our code that should be fixed, and that a contributor might work towards making changes that we would not accept.  Maybe we could try to get a bug filed against the ECMA spec (I believe their bug track is only open to ES members :-( ), and then you could reopen and bug & mark this bug as blocked on the ES one?