Bug 4921

Summary: \u escape sequences in JavaScript identifiers
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: JavaScriptCoreAssignee: Geoffrey Garen <ggaren>
Status: RESOLVED FIXED    
Severity: Normal CC: darin, maksim
Priority: P2 Keywords: HasReduction
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Bug Depends on:    
Bug Blocks: 4885    
Attachments:
Description Flags
test case
none
Darin's patch extracted from 4885. Fixes unicode escapes
darin: review-
New version of the unicode escape fix patch
darin: review+
Testcase to demonstrate keyword clash with unicode escape sequences none

Alexey Proskuryakov
Reported 2005-09-10 13:09:45 PDT
WebKit doesn't permit escaped sequences in JavaScript identifiers. As I understand ECMA-262, these are supposed to be handled exactly like the corresponding characters. Firefox passes this test.
Attachments
test case (669 bytes, text/html)
2005-09-10 13:09 PDT, Alexey Proskuryakov
no flags
Darin's patch extracted from 4885. Fixes unicode escapes (9.55 KB, patch)
2005-10-11 07:44 PDT, Kimmo Kinnunen
darin: review-
New version of the unicode escape fix patch (9.59 KB, patch)
2005-10-11 09:58 PDT, Kimmo Kinnunen
darin: review+
Testcase to demonstrate keyword clash with unicode escape sequences (701 bytes, text/html)
2005-10-11 10:09 PDT, Kimmo Kinnunen
no flags
Alexey Proskuryakov
Comment 1 2005-09-10 13:09:59 PDT
Created attachment 3849 [details] test case
Kimmo Kinnunen
Comment 2 2005-10-11 07:44:43 PDT
Created attachment 4309 [details] Darin's patch extracted from 4885. Fixes unicode escapes Darin's patch extracted from bug 4885. Fixes unicode escapes. Note: in 4885, there's no handling of the case where there is an unicode escape in an identifier which maps to a token after the escape is augmented.
Darin Adler
Comment 3 2005-10-11 08:40:01 PDT
Comment on attachment 4309 [details] Darin's patch extracted from 4885. Fixes unicode escapes Looks great (not sure if I can review my own patch). Where does the specification say that if you use \u sequences, then an identifier is not a keyword? I re-read it and that was not clear to me. Did you test other implementations to see if that was so? (Part of the reason I ask is that I know in C++ it does not work this way; you can use the sequences in keywords.) The boolean identifierIsNotToken is a great way to implement that rule, but I recommend naming it identifierIsNotKeyword instead. Another way to implement it would be to create a separate state. The old one could be renamed IdentifierOrKeyword and the new state would be just Identifier. The IdentifierOrKeyword case could fall into the Identifier case after checking for and handling keywords. Besides allowing \u sequences in identifiers, this patch also implements the full rule for what characters are allowed in identifiers. So the attached test case is insufficient. We need test cases that cover a variety of valid and invalid identifiers, as well as covering the case of an identifer that uses a \u sequence to avoid collision with a keyword. The minimum that would be needed on this patch would be renaming identifierIsNotToken to identifierIsNotKeyword and adding more test cases.
Kimmo Kinnunen
Comment 4 2005-10-11 09:58:57 PDT
Created attachment 4313 [details] New version of the unicode escape fix patch Unicode escape fix. Changes the bool variable identifierIsNotToken to few extra states.
Kimmo Kinnunen
Comment 5 2005-10-11 10:06:52 PDT
(In reply to comment #3) > Where does the specification say that if you use \u sequences, then an > identifier is not a keyword? I re-read it and that was not clear to me. Did you > test other implementations to see if that was so? (Part of the reason I ask is > that I know in C++ it does not work this way; you can use the sequences in > keywords.) The grammar says that 'Keyword' can be one of the set of keywords and that 'IdentifierStart' can have 'UnicodeEscape', so I assumed that that means that 'Keyword' can not have 'UnicodeEscape'. As far as I can see, the spec says that you shouldn't have unicode escapes in keywords. It's not explicitly written out, so I could be missing something.. Yes, Firefox behaves in same way. There's also mozilla test to test this - ecma_3/Unicode/uc-003.js is fixed with the patch.. > The boolean identifierIsNotToken is a great way to implement that rule, but I > recommend naming it identifierIsNotKeyword instead. Another way to implement it > would be to create a separate state. The old one could be renamed > IdentifierOrKeyword and the new state would be just Identifier. The > IdentifierOrKeyword case could fall into the Identifier case after checking for > and handling keywords. This is certainly more explicit and clearer. I added this to the new patch. > Besides allowing \u sequences in identifiers, this patch also implements the > full rule for what characters are allowed in identifiers. So the attached test > case is insufficient. We need test cases that cover a variety of valid and > invalid identifiers, as well as covering the case of an identifer that uses a > \u sequence to avoid collision with a keyword. There's atleast that ecma_3/Unicode/uc-003.js which is fixed. It tests the collision with the keyword.
Kimmo Kinnunen
Comment 6 2005-10-11 10:09:56 PDT
Created attachment 4314 [details] Testcase to demonstrate keyword clash with unicode escape sequences Testcase to demonstrate keyword clash with unicode escape sequences. Duplicates the test of ecma_3/Unicode/uc-003.js
Kimmo Kinnunen
Comment 7 2005-10-11 10:20:56 PDT
BTW: here's the list of tests this fixes. Unfortunately there are one unrelated test case which is fixed.. ecma_3/Function/regress-193555.js I don't know why this is fixed. I thought it has something to do with additional checks for whitespace characters (ie. the test may have unicode whitespace chars which were previously not accepted as whitespace). With fast inspection, I couldn't find such a characters, though. ecma_3/Function/regress-58274.js Unicode escape test for escaped chars. ecma_3/Unicode/uc-003.js Tests identifiers which have escaped unicode chars. Also tests the keyword clash ecma_3/Unicode/uc-005.js Unicode escape test for escaped chars. BTW. There might be problems with following constructs: function id\1234() { }; if (id\1234.toString() == somestringvar) because 'id\1234.toString()' doesn't return string with escaped text, but string 'id?' where ? is the augmented unicode character. So some information of the source is lost. In Firefox situation is similar but slightly better: for 'id\1234' FF returns 'id\1234', but for 'id\abcd' FF returns 'id\ABCD'.
Kimmo Kinnunen
Comment 8 2005-10-11 10:24:22 PDT
(In reply to comment #7) > function id\1234() { }; > if (id\1234.toString() == somestringvar) And of course there were tons of typos in the comment #7. The \1234 should've been \u1234. For example the above snippet should have been something like function id\u1234() { }; if (id\u1234.toString() == somestringvar)
Maks Orlovich
Comment 9 2005-12-30 12:32:39 PST
This also fixes kjs/parse.js, via permitting unicode identifiers. I can spot a bug though --- it needs to check that decoded chars are valid identifier start/continuation. From 7.6: "A UnicodeEscapeSequence cannot be used to put a character into an identifier that would otherwise be illegal. In other words, if a \ UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid Identifier that has the exact same sequence of characters as the original Identifier."
Geoffrey Garen
Comment 10 2005-12-30 13:20:37 PST
Comment on attachment 4313 [details] New version of the unicode escape fix patch The bug Maks mentions is probably worthy of a separate bug report. This patch, however, seems to have addressed Darin's previous comments. Darin?
Darin Adler
Comment 11 2005-12-30 13:29:18 PST
Comment on attachment 4313 [details] New version of the unicode escape fix patch Yes, seems fine to land this as the next step in improving things. I'd like to see the problem Maks reports fixed.
Darin Adler
Comment 12 2005-12-30 13:29:38 PST
Comment on attachment 4313 [details] New version of the unicode escape fix patch The patch contains some tab characters. It would be best to land it without those.
Geoffrey Garen
Comment 13 2006-01-04 13:43:27 PST
Landed. Didn't land layout test because comments indicate it's a duplicate of the mozilla test.
Alexey Proskuryakov
Comment 14 2006-03-11 07:38:13 PST
*** Bug 4918 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.