Bug 178174 - Add support in named capture group identifiers for direct surrogate pairs
Summary: Add support in named capture group identifiers for direct surrogate pairs
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: All All
: P2 Minor
Assignee: Alexey Shvayka
URL:
Keywords: InRadar
Depends on: 176435
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-11 11:24 PDT by Michael Saboff
Modified: 2020-03-30 18:27 PDT (History)
10 users (show)

See Also:


Attachments
Patch (8.48 KB, patch)
2020-03-23 09:33 PDT, Alexey Shvayka
no flags Details | Formatted Diff | Diff
Patch (24.44 KB, patch)
2020-03-26 20:28 PDT, Alexey Shvayka
no flags Details | Formatted Diff | Diff
Patch (24.52 KB, patch)
2020-03-27 16:52 PDT, Alexey Shvayka
no flags Details | Formatted Diff | Diff
Patch (24.80 KB, patch)
2020-03-30 17:31 PDT, Alexey Shvayka
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2017-10-11 11:24:28 PDT
Currently, a named capture group can have a backslash escaped RegExpUnicodeEscapeSequence (e.g. \{12345}).  It also looks like it should take approriate unicode characters directly, including those made up of surrogate pairs.

Failing Test262 -
JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js
Exception: SyntaxError: Invalid regular expression: invalid group specifier name
at JSTests/test262/test/built-ins/RegExp/named-groups/unicode-property-names.js:16

This test is a UTF-8 source file with embedded non-BMP Unicode characters as both named group identifiers and property names.  I made a JS source file with a couple of the Unicode properties from the original test file, and the jsc command throws 
    SyntaxError: Invalid character '\ud801'
See attached test source file.

I think we need the jsc command to handle UTF-8 source files before we can fix this bug.
Comment 1 Michael Saboff 2017-10-11 11:25:09 PDT
<rdar://problem/34843961>
Comment 2 Alexey Shvayka 2020-03-23 09:33:09 PDT
Created attachment 394267 [details]
Patch
Comment 3 Alexey Shvayka 2020-03-26 20:28:24 PDT
Created attachment 394700 [details]
Patch
Comment 4 Alexey Shvayka 2020-03-26 20:41:35 PDT
While this patch is functionally the same, I've:

a) Moved isIdentityEscapeAnError() out of tryConsumeUnicodeEscape() as RegExpUnicodeEscapeSequence [1] does not contain IdentityEscape [2] production.
b) Tweaked error messages to report invalid \u escapes instead of identity escapes (which are Annex B), matching other error messages in other engines.
c) Added a few error messages tests.

I am kindly asking for re-review. Thank you.

[1]: https://tc39.es/ecma262/#prod-RegExpUnicodeEscapeSequence
[2]: https://tc39.es/ecma262/#prod-IdentityEscape
Comment 5 Darin Adler 2020-03-27 09:59:59 PDT
Comment on attachment 394700 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review

This looks fine to me. Michael may also want to review. The key is really test coverage including performance test results, not necessarily exactly how we write the code

> Source/JavaScriptCore/yarr/YarrParser.h:493
> +            int escapeChar = tryConsumeUnicodeEscape();

Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint".

> Source/JavaScriptCore/yarr/YarrParser.h:-1044
> -        if (u == -1)
> -            return -1;

What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in.

> Source/JavaScriptCore/yarr/YarrParser.h:1031
> +            int escapeChar = tryConsumeUnicodeEscape();

Same comment about variable naming here. Names like "ch" and "u" are definitely frowned upon in our general coding style, but in a way "escapeChar" is half-way there. Our underlying concept is that people are better at reading words than abbreviations.
Comment 6 Darin Adler 2020-03-27 10:02:30 PDT
Comment on attachment 394700 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review

>> Source/JavaScriptCore/yarr/YarrParser.h:-1044
>> -            return -1;
> 
> What’s the rationale for removing this? I don’t think it’s great to rely on what U16_IS_LEAD will return when passed something that is not a 16-bit code unit. So I would have left it in.

Or we could remove the code, but add this:

    // <... some amazing comment explaining why this assertion is relevant ...>
    static_assert(!U16_IS_LEAD(-1));
Comment 7 Michael Saboff 2020-03-27 13:14:24 PDT
Comment on attachment 394700 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=394700&action=review

>> Source/JavaScriptCore/yarr/YarrParser.h:493
>> +            int escapeChar = tryConsumeUnicodeEscape();
> 
> Normally we don’t use abbreviations like "char" in WebKit code, preferring words like "character". So this local variable would be either named "escape" or "escapeCharacter" or "escaped" or "escapedCharacter" or "codePoint".

I would prefer codePoint as that is what a Unicode escape represents.

> Source/JavaScriptCore/yarr/YarrParser.h:999
> +            ASSERT(!hasError(m_errorCode));

Isn't this a redundant ASSERT() given that we have the same assert at the beginning of the function and everywhere we set m_errorCode here we return?

> Source/JavaScriptCore/yarr/YarrParser.h:1032
> +            if (escapeChar == -1 && !hasError(m_errorCode))

Seems a little odd that we are using different combinations of a -1 return value from tryConsumeUnicodeEscape() and m_errorCode to signal an error depending on the call site.  Maybe you could make tryConsumeUnicodeEscape() a templatized function with the template parameter changing in what cases tryConsumeUnicodeEscape()  sets m_errorCode.
Comment 8 Alexey Shvayka 2020-03-27 16:52:52 PDT
Created attachment 394770 [details]
Patch

Set reviewers, rename vars, bring back -1 check, drop redundant ASSERT, make tryConsumeUnicodeEscape() a template<>.
Comment 9 Alexey Shvayka 2020-03-27 17:16:54 PDT
(In reply to Michael Saboff from comment #7)
> Seems a little odd that we are using different combinations of a -1 return
> value from tryConsumeUnicodeEscape() and m_errorCode to signal an error
> depending on the call site.  Maybe you could make tryConsumeUnicodeEscape()
> a templatized function with the template parameter changing in what cases
> tryConsumeUnicodeEscape()  sets m_errorCode.

This is great improvement, thank you. If it is not too much to ask, I would appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175.
Comment 10 Michael Saboff 2020-03-30 10:02:37 PDT
Comment on attachment 394770 [details]
Patch

r-me

This looks great.  The only change I recommend is changing the template parameter for tryConsumeUnicodeEscape() from a boolean to an enum.  That way the template parameter at the call site describes the behavior.  Just a suggestion, but use enum values something like AsCodePoint and AsIdentifierCharacter.
Comment 11 Michael Saboff 2020-03-30 10:16:11 PDT
Comment on attachment 394770 [details]
Patch

That is "r=me".
Comment 12 Michael Saboff 2020-03-30 10:30:09 PDT
(In reply to Alexey Shvayka from comment #9)
> (In reply to Michael Saboff from comment #7)

> ... If it is not too much to ask, I would
> appreciate your feedback on https://bugs.webkit.org/show_bug.cgi?id=178175.

I looked at it earlier and didn't have any comments to add.
Comment 13 Alexey Shvayka 2020-03-30 17:31:19 PDT
Created attachment 394992 [details]
Patch

Introduce UnicodeEscapeContext.
Comment 14 Alexey Shvayka 2020-03-30 17:41:31 PDT
(In reply to Michael Saboff from comment #10)
> This looks great.  The only change I recommend is changing the template
> parameter for tryConsumeUnicodeEscape() from a boolean to an enum.  That way
> the template parameter at the call site describes the behavior.  Just a
> suggestion, but use enum values something like AsCodePoint and
> AsIdentifierCharacter.

I went with UnicodeEscapeContext::{CharacterEscape,IdentifierName} because:
a) As* usually points to a return type, while what we are passing to template<> is merely a parsing parameter that is used only in non-Unicode patterns;
b) it was quite hard to pick a name that matched with As*, while UnicodeEscapeContext was a good fit.
Comment 15 Michael Saboff 2020-03-30 18:12:16 PDT
Comment on attachment 394992 [details]
Patch

r=me.

I like it.
Comment 16 EWS 2020-03-30 18:27:17 PDT
Committed r259262: <https://trac.webkit.org/changeset/259262>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 394992 [details].