Bug 4921

Summary:

\u escape sequences in JavaScript identifiers

Product:

WebKit

Reporter:

Alexey Proskuryakov <ap>

Component:

JavaScriptCore

Assignee:

Geoffrey Garen <ggaren>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

darin, maksim

Priority:

Keywords:

HasReduction

Version:

420+

Hardware:

Mac

OS:

OS X 10.4

Bug Depends on:

Bug Blocks:

4885

Attachments:

Description	Flags
test case	none
Darin's patch extracted from 4885. Fixes unicode escapes	darin: review-
New version of the unicode escape fix patch	darin: review+
Testcase to demonstrate keyword clash with unicode escape sequences	none

Alexey Proskuryakov

Reported 2005-09-10 13:09:45 PDT

WebKit doesn't permit escaped sequences in JavaScript identifiers. As I understand ECMA-262, these are supposed to be handled exactly like the corresponding characters. Firefox passes this test.

Attachments
test case (669 bytes, text/html) 2005-09-10 13:09 PDT, Alexey Proskuryakov	no flags	Details
Darin's patch extracted from 4885. Fixes unicode escapes (9.55 KB, patch) 2005-10-11 07:44 PDT, Kimmo Kinnunen	darin: review-	Details Formatted Diff Diff
New version of the unicode escape fix patch (9.59 KB, patch) 2005-10-11 09:58 PDT, Kimmo Kinnunen	darin: review+	Details Formatted Diff Diff
Testcase to demonstrate keyword clash with unicode escape sequences (701 bytes, text/html) 2005-10-11 10:09 PDT, Kimmo Kinnunen	no flags	Details
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2005-09-10 13:09:59 PDT

Created attachment 3849 [details] test case

Kimmo Kinnunen

Comment 2 2005-10-11 07:44:43 PDT

Created attachment 4309 [details] Darin's patch extracted from 4885. Fixes unicode escapes Darin's patch extracted from bug 4885. Fixes unicode escapes. Note: in 4885, there's no handling of the case where there is an unicode escape in an identifier which maps to a token after the escape is augmented.

Darin Adler

Comment 3 2005-10-11 08:40:01 PDT

Comment on attachment 4309 [details] Darin's patch extracted from 4885. Fixes unicode escapes Looks great (not sure if I can review my own patch). Where does the specification say that if you use \u sequences, then an identifier is not a keyword? I re-read it and that was not clear to me. Did you test other implementations to see if that was so? (Part of the reason I ask is that I know in C++ it does not work this way; you can use the sequences in keywords.) The boolean identifierIsNotToken is a great way to implement that rule, but I recommend naming it identifierIsNotKeyword instead. Another way to implement it would be to create a separate state. The old one could be renamed IdentifierOrKeyword and the new state would be just Identifier. The IdentifierOrKeyword case could fall into the Identifier case after checking for and handling keywords. Besides allowing \u sequences in identifiers, this patch also implements the full rule for what characters are allowed in identifiers. So the attached test case is insufficient. We need test cases that cover a variety of valid and invalid identifiers, as well as covering the case of an identifer that uses a \u sequence to avoid collision with a keyword. The minimum that would be needed on this patch would be renaming identifierIsNotToken to identifierIsNotKeyword and adding more test cases.

Kimmo Kinnunen

Comment 4 2005-10-11 09:58:57 PDT

Created attachment 4313 [details] New version of the unicode escape fix patch Unicode escape fix. Changes the bool variable identifierIsNotToken to few extra states.

Kimmo Kinnunen

Comment 5 2005-10-11 10:06:52 PDT

(In reply to comment #3) > Where does the specification say that if you use \u sequences, then an > identifier is not a keyword? I re-read it and that was not clear to me. Did you > test other implementations to see if that was so? (Part of the reason I ask is > that I know in C++ it does not work this way; you can use the sequences in > keywords.) The grammar says that 'Keyword' can be one of the set of keywords and that 'IdentifierStart' can have 'UnicodeEscape', so I assumed that that means that 'Keyword' can not have 'UnicodeEscape'. As far as I can see, the spec says that you shouldn't have unicode escapes in keywords. It's not explicitly written out, so I could be missing something.. Yes, Firefox behaves in same way. There's also mozilla test to test this - ecma_3/Unicode/uc-003.js is fixed with the patch.. > The boolean identifierIsNotToken is a great way to implement that rule, but I > recommend naming it identifierIsNotKeyword instead. Another way to implement it > would be to create a separate state. The old one could be renamed > IdentifierOrKeyword and the new state would be just Identifier. The > IdentifierOrKeyword case could fall into the Identifier case after checking for > and handling keywords. This is certainly more explicit and clearer. I added this to the new patch. > Besides allowing \u sequences in identifiers, this patch also implements the > full rule for what characters are allowed in identifiers. So the attached test > case is insufficient. We need test cases that cover a variety of valid and > invalid identifiers, as well as covering the case of an identifer that uses a > \u sequence to avoid collision with a keyword. There's atleast that ecma_3/Unicode/uc-003.js which is fixed. It tests the collision with the keyword.

Kimmo Kinnunen

Comment 6 2005-10-11 10:09:56 PDT

Created attachment 4314 [details] Testcase to demonstrate keyword clash with unicode escape sequences Testcase to demonstrate keyword clash with unicode escape sequences. Duplicates the test of ecma_3/Unicode/uc-003.js

Kimmo Kinnunen

Comment 7 2005-10-11 10:20:56 PDT

BTW: here's the list of tests this fixes. Unfortunately there are one unrelated test case which is fixed.. ecma_3/Function/regress-193555.js I don't know why this is fixed. I thought it has something to do with additional checks for whitespace characters (ie. the test may have unicode whitespace chars which were previously not accepted as whitespace). With fast inspection, I couldn't find such a characters, though. ecma_3/Function/regress-58274.js Unicode escape test for escaped chars. ecma_3/Unicode/uc-003.js Tests identifiers which have escaped unicode chars. Also tests the keyword clash ecma_3/Unicode/uc-005.js Unicode escape test for escaped chars. BTW. There might be problems with following constructs: function id\1234() { }; if (id\1234.toString() == somestringvar) because 'id\1234.toString()' doesn't return string with escaped text, but string 'id?' where ? is the augmented unicode character. So some information of the source is lost. In Firefox situation is similar but slightly better: for 'id\1234' FF returns 'id\1234', but for 'id\abcd' FF returns 'id\ABCD'.

Kimmo Kinnunen

Comment 8 2005-10-11 10:24:22 PDT

(In reply to comment #7) > function id\1234() { }; > if (id\1234.toString() == somestringvar) And of course there were tons of typos in the comment #7. The \1234 should've been \u1234. For example the above snippet should have been something like function id\u1234() { }; if (id\u1234.toString() == somestringvar)

Maks Orlovich

Comment 9 2005-12-30 12:32:39 PST

This also fixes kjs/parse.js, via permitting unicode identifiers. I can spot a bug though --- it needs to check that decoded chars are valid identifier start/continuation. From 7.6: "A UnicodeEscapeSequence cannot be used to put a character into an identifier that would otherwise be illegal. In other words, if a \ UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid Identifier that has the exact same sequence of characters as the original Identifier."

Geoffrey Garen

Comment 10 2005-12-30 13:20:37 PST

Comment on attachment 4313 [details] New version of the unicode escape fix patch The bug Maks mentions is probably worthy of a separate bug report. This patch, however, seems to have addressed Darin's previous comments. Darin?

Darin Adler

Comment 11 2005-12-30 13:29:18 PST

Comment on attachment 4313 [details] New version of the unicode escape fix patch Yes, seems fine to land this as the next step in improving things. I'd like to see the problem Maks reports fixed.

Darin Adler

Comment 12 2005-12-30 13:29:38 PST

Comment on attachment 4313 [details] New version of the unicode escape fix patch The patch contains some tab characters. It would be best to land it without those.

Geoffrey Garen

Comment 13 2006-01-04 13:43:27 PST

Landed. Didn't land layout test because comments indicate it's a duplicate of the mozilla test.

Alexey Proskuryakov

Comment 14 2006-03-11 07:38:13 PST

*** Bug 4918 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.