Bug 4921 - \u escape sequences in JavaScript identifiers
Summary: \u escape sequences in JavaScript identifiers
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Geoffrey Garen
URL:
Keywords: HasReduction
: 4918 (view as bug list)
Depends on:
Blocks: 4885
  Show dependency treegraph
 
Reported: 2005-09-10 13:09 PDT by Alexey Proskuryakov
Modified: 2006-03-11 07:38 PST (History)
2 users (show)

See Also:


Attachments
test case (669 bytes, text/html)
2005-09-10 13:09 PDT, Alexey Proskuryakov
no flags Details
Darin's patch extracted from 4885. Fixes unicode escapes (9.55 KB, patch)
2005-10-11 07:44 PDT, Kimmo Kinnunen
darin: review-
Details | Formatted Diff | Diff
New version of the unicode escape fix patch (9.59 KB, patch)
2005-10-11 09:58 PDT, Kimmo Kinnunen
darin: review+
Details | Formatted Diff | Diff
Testcase to demonstrate keyword clash with unicode escape sequences (701 bytes, text/html)
2005-10-11 10:09 PDT, Kimmo Kinnunen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-10 13:09:45 PDT
WebKit doesn't permit escaped sequences in JavaScript identifiers. As I understand ECMA-262, these are 
supposed to be handled exactly like the corresponding characters.

Firefox passes this test.
Comment 1 Alexey Proskuryakov 2005-09-10 13:09:59 PDT
Created attachment 3849 [details]
test case
Comment 2 Kimmo Kinnunen 2005-10-11 07:44:43 PDT
Created attachment 4309 [details]
Darin's patch extracted from 4885. Fixes unicode escapes

Darin's patch extracted from bug 4885. Fixes unicode escapes.

Note: in 4885, there's no handling of the case where there is an unicode escape
in an identifier which maps to a token after the escape is augmented.
Comment 3 Darin Adler 2005-10-11 08:40:01 PDT
Comment on attachment 4309 [details]
Darin's patch extracted from 4885. Fixes unicode escapes

Looks great (not sure if I can review my own patch).

Where does the specification say that if you use \u sequences, then an
identifier is not a keyword? I re-read it and that was not clear to me. Did you
test other implementations to see if that was so? (Part of the reason I ask is
that I know in C++ it does not work this way; you can use the sequences in
keywords.)

The boolean identifierIsNotToken is a great way to implement that rule, but I
recommend naming it identifierIsNotKeyword instead. Another way to implement it
would be to create a separate state. The old one could be renamed
IdentifierOrKeyword and the new state would be just Identifier. The
IdentifierOrKeyword case could fall into the Identifier case after checking for
and handling keywords.

Besides allowing \u sequences in identifiers, this patch also implements the
full rule for what characters are allowed in identifiers. So the attached test
case is insufficient. We need test cases that cover a variety of valid and
invalid identifiers, as well as covering the case of an identifer that uses a
\u sequence to avoid collision with a keyword.

The minimum that would be needed on this patch would be renaming
identifierIsNotToken to identifierIsNotKeyword and adding more test cases.
Comment 4 Kimmo Kinnunen 2005-10-11 09:58:57 PDT
Created attachment 4313 [details]
New version of the unicode escape fix patch

Unicode escape fix. 

Changes the bool variable identifierIsNotToken to few extra states.
Comment 5 Kimmo Kinnunen 2005-10-11 10:06:52 PDT
(In reply to comment #3)
> Where does the specification say that if you use \u sequences, then an
> identifier is not a keyword? I re-read it and that was not clear to me. Did you
> test other implementations to see if that was so? (Part of the reason I ask is
> that I know in C++ it does not work this way; you can use the sequences in
> keywords.)

The grammar says that 'Keyword' can be one of the set of keywords and that 'IdentifierStart' can have 
'UnicodeEscape', so I assumed that that means that 'Keyword' can not have 'UnicodeEscape'. As far as I 
can see, the spec says that you shouldn't have unicode escapes in keywords. It's not explicitly written 
out, so I could be missing something..

Yes, Firefox behaves in same way. There's also mozilla test to test this - ecma_3/Unicode/uc-003.js is 
fixed with the patch..

> The boolean identifierIsNotToken is a great way to implement that rule, but I
> recommend naming it identifierIsNotKeyword instead. Another way to implement it
> would be to create a separate state. The old one could be renamed
> IdentifierOrKeyword and the new state would be just Identifier. The
> IdentifierOrKeyword case could fall into the Identifier case after checking for
> and handling keywords.

This is certainly more explicit and clearer. I added this to the new patch. 

> Besides allowing \u sequences in identifiers, this patch also implements the
> full rule for what characters are allowed in identifiers. So the attached test
> case is insufficient. We need test cases that cover a variety of valid and
> invalid identifiers, as well as covering the case of an identifer that uses a
> \u sequence to avoid collision with a keyword.

There's atleast that  ecma_3/Unicode/uc-003.js which is fixed. It tests the collision with the keyword.
 
Comment 6 Kimmo Kinnunen 2005-10-11 10:09:56 PDT
Created attachment 4314 [details]
Testcase to demonstrate keyword clash with unicode escape sequences

Testcase to demonstrate keyword clash with unicode escape sequences. Duplicates
the test of ecma_3/Unicode/uc-003.js
Comment 7 Kimmo Kinnunen 2005-10-11 10:20:56 PDT
BTW: here's the list of tests this fixes. Unfortunately there are one unrelated test case which is fixed..

ecma_3/Function/regress-193555.js
I don't know why this is fixed. I thought it has something to do with additional checks for whitespace 
characters (ie. the test may have unicode whitespace chars which were previously not accepted as 
whitespace). With fast inspection, I couldn't find such a characters, though.

ecma_3/Function/regress-58274.js
Unicode escape test for escaped chars.

ecma_3/Unicode/uc-003.js

Tests identifiers which have escaped unicode chars. Also tests the keyword clash

ecma_3/Unicode/uc-005.js
Unicode escape test for escaped chars.


BTW. There might be problems with following constructs:
function id\1234() { };
if (id\1234.toString() == somestringvar)

because 'id\1234.toString()' doesn't return string with escaped text, but string 'id?' where ? is the 
augmented unicode character. So some information of the source is lost.

In Firefox situation is similar but slightly better: for 'id\1234' FF returns 'id\1234', but for 'id\abcd' FF 
returns 'id\ABCD'.


Comment 8 Kimmo Kinnunen 2005-10-11 10:24:22 PDT
(In reply to comment #7)
> function id\1234() { };
> if (id\1234.toString() == somestringvar)

And of course there were tons of typos in the comment #7.
The \1234 should've been \u1234. For example the above snippet should have been something like

function id\u1234() { };
if (id\u1234.toString() == somestringvar)
Comment 9 Maks Orlovich 2005-12-30 12:32:39 PST
This also fixes kjs/parse.js, via permitting unicode identifiers. I can spot a 
bug though --- it needs to check that decoded chars are valid identifier 
start/continuation. From 7.6: 
"A UnicodeEscapeSequence 
cannot be used to put a character into an identifier that would otherwise be 
illegal. In other words, if a \ 
UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's 
CV, the result must still 
be a valid Identifier that has the exact same sequence of characters as the 
original Identifier." 
 
 
Comment 10 Geoffrey Garen 2005-12-30 13:20:37 PST
Comment on attachment 4313 [details]
New version of the unicode escape fix patch

The bug Maks mentions is probably worthy of a separate bug report. This patch,
however, seems to have addressed Darin's previous comments. Darin?
Comment 11 Darin Adler 2005-12-30 13:29:18 PST
Comment on attachment 4313 [details]
New version of the unicode escape fix patch

Yes, seems fine to land this as the next step in improving things. I'd like to
see the problem Maks reports fixed.
Comment 12 Darin Adler 2005-12-30 13:29:38 PST
Comment on attachment 4313 [details]
New version of the unicode escape fix patch

The patch contains some tab characters. It would be best to land it without
those.
Comment 13 Geoffrey Garen 2006-01-04 13:43:27 PST
Landed. Didn't land layout test because comments indicate it's a duplicate of the mozilla test.
Comment 14 Alexey Proskuryakov 2006-03-11 07:38:13 PST
*** Bug 4918 has been marked as a duplicate of this bug. ***