208998 – JavaScript identifier grammar supports unescaped astral symbols, but JSC doesn’t

RESOLVED FIXED208998

JavaScript identifier grammar supports unescaped astral symbols, but JSC doesn’t

https://bugs.webkit.org/show_bug.cgi?id=208998

Summary JavaScript identifier grammar supports unescaped astral symbols, but JSC doesn’t

Mathias Bynens

Reported 2020-03-12 08:58:42 PDT

Since ES2015, ID_Start or ID_Continue symbols _outside of the BMP_ can be used in identifiers. Example using U+102A7 CARIAN LETTER A2 (taken from https://mathiasbynens.be/notes/javascript-identifiers-es6): var 𐊧; // 1 var \u{102A7}; // 2 Expected result: The above program evaluates without throwing an exception. Actual result: Although JavaScriptCore handles line 2 correctly, it throws on line 1: SyntaxError: Invalid character '\ud800' Note that JSC is the only engine with this bug: $ eshost -sx 'var 𐊧' #### Chakra, SpiderMonkey, V8, V8 --harmony, XS #### JavaScriptCore SyntaxError: Invalid character '\ud800'

Attachments
WIP (31.43 KB, patch) 2020-03-15 15:57 PDT, Keith Miller	no flags	Details Formatted Diff Diff
Patch (34.10 KB, patch) 2020-03-16 15:11 PDT, Keith Miller	no flags	Details Formatted Diff Diff
Patch (40.20 KB, patch) 2020-03-16 15:42 PDT, Keith Miller	no flags	Details Formatted Diff Diff
Patch for landing (40.19 KB, patch) 2020-03-16 16:28 PDT, Keith Miller	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2020-03-12 12:22:25 PDT

<rdar://problem/60381814>

Keith Miller

Comment 2 2020-03-13 11:35:31 PDT

Interesting our implementation seems semantically the same as v8's: static inline bool isIdentStart(UChar32 c) { return isLatin1(c) ? isIdentStart(static_cast<LChar>(c)) : isNonLatin1IdentStart(c); } static NEVER_INLINE bool isNonLatin1IdentStart(UChar c) { return u_hasBinaryProperty(c, UCHAR_ID_START); } and similarly for non-start: static ALWAYS_INLINE bool isIdentPart(UChar32 c) { return isLatin1(c) ? isIdentPart(static_cast<LChar>(c)) : isNonLatin1IdentPart(c); } static NEVER_INLINE bool isNonLatin1IdentPart(UChar32 c) { return u_hasBinaryProperty(c, UCHAR_ID_CONTINUE) || c == 0x200C || c == 0x200D; } My guess is this is a bug in the system ICU. Does that analysis seem correct to you? My Unicode/ICU knowledge is very limited...

Yusuke Suzuki

Comment 3 2020-03-13 13:04:58 PDT

I think we should have a fallback path for surrogate pairs. Something like this. if (!isIdentStart(code)) { // Check whether the code is leading surrogate. if (!isLeadSurrogate(code)) goto fail; codeNext = getOneCharacter(); if (!isTrailSurrogate(codeNext)) goto fail; codePoint = codePointFromSurrogatePair(code, codeNext); if (!isIdentStart(codePoint)) goto fail; }

Mathias Bynens

Comment 4 2020-03-13 16:28:32 PDT

The error message mentions U+D800 which is the high surrogate half of the character in my example ('\u{102A7}' === '\uD800\uDEA7'), so I think the code paths Keith mentions are hit for each surrogate half (and surrogate code points are never valid identifier characters, hence the exception) as opposed to considering surrogate pairs where applicable.

Keith Miller

Comment 5 2020-03-14 00:27:00 PDT

(In reply to Mathias Bynens from comment #4) > The error message mentions U+D800 which is the high surrogate half of the > character in my example ('\u{102A7}' === '\uD800\uDEA7'), so I think the > code paths Keith mentions are hit for each surrogate half (and surrogate > code points are never valid identifier characters, hence the exception) as > opposed to considering surrogate pairs where applicable. Yeah, I think you're right. I just didn't know what was going on... >.>

Keith Miller

Comment 6 2020-03-15 15:57:01 PDT

Created attachment 393627 [details] WIP

Keith Miller

Comment 7 2020-03-16 15:11:00 PDT

Created attachment 393691 [details] Patch

Keith Miller

Comment 8 2020-03-16 15:42:09 PDT

Created attachment 393695 [details] Patch

Michael Saboff

Comment 9 2020-03-16 16:17:36 PDT

Comment on attachment 393695 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=393695&action=review r=me > Source/JavaScriptCore/parser/Lexer.cpp:2625 > + // FIXME: This should probably not be a lex error but dealing with surrogate pairs here is annoying and it's going to be an error anyway... Add a bug for this FIXME.

Keith Miller

Comment 10 2020-03-16 16:19:07 PDT

Comment on attachment 393695 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=393695&action=review >> Source/JavaScriptCore/parser/Lexer.cpp:2625 >> + // FIXME: This should probably not be a lex error but dealing with surrogate pairs here is annoying and it's going to be an error anyway... > > Add a bug for this FIXME. Ehh, This should probably not be a FIXME. I don't think we're actually going to fix it.

Keith Miller

Comment 11 2020-03-16 16:28:14 PDT

Created attachment 393704 [details] Patch for landing

WebKit Commit Bot

Comment 12 2020-03-16 17:12:22 PDT

Comment on attachment 393704 [details] Patch for landing Clearing flags on attachment: 393704 Committed r258531: <https://trac.webkit.org/changeset/258531>

WebKit Commit Bot

Comment 13 2020-03-16 17:12:24 PDT

All reviewed patches have been landed. Closing bug.

Alexey Shvayka

Comment 14 2020-03-23 10:19:48 PDT

*** Bug 167940 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Keith Miller

Reported

2020-03-12 08:58 PDT

Modified

2020-03-23 10:19 PDT History

CC List

15 users Show

URL

Keywords InRadar

Duplicates (1)

167940 View as bug list

Depends on

Blocks