178174 – Add support in named capture group identifiers for direct surrogate pairs

RESOLVED FIXED178174

Add support in named capture group identifiers for direct surrogate pairs

https://bugs.webkit.org/show_bug.cgi?id=178174

Summary Add support in named capture group identifiers for direct surrogate pairs

Michael Saboff

Reported 2017-10-11 11:24:28 PDT

Currently, a named capture group can have a backslash escaped RegExpUnicodeEscapeSequence (e.g. \{12345}). It also looks like it should take approriate unicode characters directly, including those made up of surrogate pairs. Failing Test262 - JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js Exception: SyntaxError: Invalid regular expression: invalid group specifier name at JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js:16 This test is a UTF-8 source file with embedded non-BMP Unicode characters as both named group identifiers and property names. I made a JS source file with a couple of the Unicode properties from the original test file, and the jsc command throws SyntaxError: Invalid character '\ud801' See attached test source file. I think we need the jsc command to handle UTF-8 source files before we can fix this bug.

Attachments
Patch (8.48 KB, patch) 2020-03-23 09:33 PDT, Alexey Shvayka	no flags	Details Formatted Diff Diff
Patch (24.44 KB, patch) 2020-03-26 20:28 PDT, Alexey Shvayka	no flags	Details Formatted Diff Diff
Patch (24.52 KB, patch) 2020-03-27 16:52 PDT, Alexey Shvayka	no flags	Details Formatted Diff Diff
Patch (24.80 KB, patch) 2020-03-30 17:31 PDT, Alexey Shvayka	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Michael Saboff

Comment 1 2017-10-11 11:25:09 PDT

<rdar://problem/34843961>

Alexey Shvayka

Comment 2 2020-03-23 09:33:09 PDT

Created attachment 394267 [details] Patch

Alexey Shvayka

Comment 3 2020-03-26 20:28:24 PDT

Created attachment 394700 [details] Patch

Alexey Shvayka

Comment 4 2020-03-26 20:41:35 PDT

While this patch is functionally the same, I've: a) Moved isIdentityEscapeAnError() out of tryConsumeUnicodeEscape() as RegExpUnicodeEscapeSequence [1] does not contain IdentityEscape [2] production. b) Tweaked error messages to report invalid \u escapes instead of identity escapes (which are Annex B), matching other error messages in other engines. c) Added a few error messages tests. I am kindly asking for re-review. Thank you. [1]: https://tc39.es/ecma262/#prod-RegExpUnicodeEscapeSequence [2]: https://tc39.es/ecma262/#prod-IdentityEscape

Darin Adler

Comment 5 2020-03-27 09:59:59 PDT

Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review This looks fine to me. Michael may also want to review. The key is really test coverage including performance test results, not necessarily exactly how we write the code > Source/JavaScriptCore/yarr/YarrParser.h:493 > + int escapeChar = tryConsumeUnicodeEscape(); Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint". > Source/JavaScriptCore/yarr/YarrParser.h:-1044 > - if (u == -1) > - return -1; What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in. > Source/JavaScriptCore/yarr/YarrParser.h:1031 > + int escapeChar = tryConsumeUnicodeEscape(); Same comment about variable naming here. Names like "ch" and "u" are definitely frowned upon in our general coding style, but in a way "escapeChar" is half-way there. Our underlying concept is that people are better at reading words than abbreviations.

Darin Adler

Comment 6 2020-03-27 10:02:30 PDT

Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review >> Source/JavaScriptCore/yarr/YarrParser.h:-1044 >> - return -1; > > What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in. Or we could remove the code, but add this: // <... some amazing comment explaining why this assertion is relevant ...> static_assert(!U16_IS_LEAD(-1));

Michael Saboff

Comment 7 2020-03-27 13:14:24 PDT

Comment on attachment 394700 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review >> Source/JavaScriptCore/yarr/YarrParser.h:493 >> + int escapeChar = tryConsumeUnicodeEscape(); > > Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint". I would prefer codePoint as that is what a Unicode escape represents. > Source/JavaScriptCore/yarr/YarrParser.h:999 > + ASSERT(!hasError(m_errorCode)); Isn't this a redundant ASSERT() given that we have the same assert at the beginning of the function and everywhere we set m_errorCode here we return? > Source/JavaScriptCore/yarr/YarrParser.h:1032 > + if (escapeChar == -1 && !hasError(m_errorCode)) Seems a little odd that we are using different combinations of a -1 return value from tryConsumeUnicodeEscape() and m_errorCode to signal an error depending on the call site. Maybe you could make tryConsumeUnicodeEscape() a templatized function with the template parameter changing in what cases tryConsumeUnicodeEscape() sets m_errorCode.

Alexey Shvayka

Comment 8 2020-03-27 16:52:52 PDT

Created attachment 394770 [details] Patch Set reviewers, rename vars, bring back -1 check, drop redundant ASSERT, make tryConsumeUnicodeEscape() a template<>.

Alexey Shvayka

Comment 9 2020-03-27 17:16:54 PDT

(In reply to Michael Saboff from comment #7) > Seems a little odd that we are using different combinations of a -1 return > value from tryConsumeUnicodeEscape() and m_errorCode to signal an error > depending on the call site. Maybe you could make tryConsumeUnicodeEscape() > a templatized function with the template parameter changing in what cases > tryConsumeUnicodeEscape() sets m_errorCode. This is great improvement, thank you. If it is not too much to ask, I would appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175.

Michael Saboff

Comment 10 2020-03-30 10:02:37 PDT

Comment on attachment 394770 [details] Patch r-me This looks great. The only change I recommend is changing the template parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way the template parameter at the call site describes the behavior. Just a suggestion, but use enum values something like AsCodePoint and AsIdentifierCharacter.

Michael Saboff

Comment 11 2020-03-30 10:16:11 PDT

Comment on attachment 394770 [details] Patch That is "r=me".

Michael Saboff

Comment 12 2020-03-30 10:30:09 PDT

(In reply to Alexey Shvayka from comment #9) > (In reply to Michael Saboff from comment #7) > ... If it is not too much to ask, I would > appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175. I looked at it earlier and didn't have any comments to add.

Alexey Shvayka

Comment 13 2020-03-30 17:31:19 PDT

Created attachment 394992 [details] Patch Introduce UnicodeEscapeContext.

Alexey Shvayka

Comment 14 2020-03-30 17:41:31 PDT

(In reply to Michael Saboff from comment #10) > This looks great. The only change I recommend is changing the template > parameter for tryConsumeUnicodeEscape() from a boolean to an enum. That way > the template parameter at the call site describes the behavior. Just a > suggestion, but use enum values something like AsCodePoint and > AsIdentifierCharacter. I went with UnicodeEscapeContext::{CharacterEscape,IdentifierName} because: a) As* usually points to a return type, while what we are passing to template<> is merely a parsing parameter that is used only in non-Unicode patterns; b) it was quite hard to pick a name that matched with As*, while UnicodeEscapeContext was a good fit.

Michael Saboff

Comment 15 2020-03-30 18:12:16 PDT

Comment on attachment 394992 [details] Patch r=me. I like it.

EWS

Comment 16 2020-03-30 18:27:17 PDT

Committed r259262: <https://trac.webkit.org/changeset/259262> All reviewed patches have been landed. Closing bug and clearing flags on attachment 394992 [details].

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Minor

Classification Unclassified

Version WebKit Nightly Build

Hardware All

OS All

Product WebKit

Component JavaScriptCore

Assignee

Alexey Shvayka

Reported

2017-10-11 11:24 PDT

Modified

2020-03-30 18:27 PDT History

CC List

10 users Show

URL

Keywords InRadar

Depends on

176435

Blocks

Dependencies

tree graph