Bug 210309 - [YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions
Summary: [YARR] Allow for Unicode named capture group identifiers in non-Unicode regul...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-04-09 16:09 PDT by Michael Saboff
Modified: 2020-04-13 12:19 PDT (History)
8 users (show)

See Also:


Attachments
Patch (12.95 KB, patch)
2020-04-09 16:30 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff
Patch with updates from review (14.84 KB, patch)
2020-04-10 10:42 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff
Patch with suggested naming changes and additional named back reference testing (16.54 KB, patch)
2020-04-10 16:42 PDT, Michael Saboff
ross.kirsling: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2020-04-09 16:09:42 PDT
During the March/April 2020 TC-39 meeting, it was agreed that named capture group identifiers can contain unicode escape characters even for non-unicode flagged regular expressions.  This change is part of the EcmaScript 2020 draft standard that was approved by TC-39 at the same meeting and should be ratified by the Ecma General Assembly in June 2020.

The current 2019 standard allows the following constructs for named capture group identifiers with non-BMP codepoints
    let regex1a = /(?<𝒜>A)/u;
    let regex1b = /(?<\u{1d49c}>A)/u;
    let regex1c = /(?<\ud835\udc9c>A)/u;
    let regex2a = /(?<𝒜>A)/;  // no u flag 
But didn’t allow non-BMP unicode escapes in named capture group identifiers for non unicode regex’s
    let regex2b = /(?<\u{1d49c}>A)/;
    let regex2c = /(?<\ud835\udc9c>A)/;

JavaScriptCore has a bug where it doesn’t even allow regex2a.

This is to track fixing the JSC bug for regex2a, and adding the two other non-unicode forms to bring JSC into compliance with the 2020 standard.
Comment 1 Michael Saboff 2020-04-09 16:10:01 PDT
<rdar://problem/61547384>
Comment 2 Michael Saboff 2020-04-09 16:30:40 PDT
Created attachment 396024 [details]
Patch
Comment 3 Ross Kirsling 2020-04-09 21:52:47 PDT
Seems like a good informal review opportunity for Alexey. ;)
Comment 4 Alexey Shvayka 2020-04-10 04:18:09 PDT
(In reply to Ross Kirsling from comment #3)
> Seems like a good informal review opportunity for Alexey. ;)

Thank you, Ross, I appreciate it.

(In reply to Michael Saboff from comment #0)
> JavaScriptCore has a bug where it doesn’t even allow regex2a.

It is also part of recent change: please see https://github.com/tc39/ecma262/pull/1869#issuecomment-607742063.
V8 doesn’t allow regex2a as well, which is covered by JSTests/test262/test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js.
Comment 5 Alexey Shvayka 2020-04-10 04:36:46 PDT
Comment on attachment 396024 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review

LGTM, a pure joy to informally review.

> JSTests/ChangeLog:8
> +        New test.

We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change:
- test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js
- test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js
- test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js
- test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js
- test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js

> JSTests/stress/regexp-named-capture-groups.js:16
> +    shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name");

I really like error message check, and the whole test is crafted with love.
Michael, would you object if I export this file to test262 with your & Apple Inc. credentials?

> Source/JavaScriptCore/ChangeLog:8
> +        Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups.

Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869

> Source/JavaScriptCore/yarr/YarrParser.h:543
> +        if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) {

Since we are changing this line, we can use `!atEndOfPattern()` equivalent instead of `(patternRemaining() > 0)`.
Comment 6 Alexey Shvayka 2020-04-10 05:49:34 PDT
Comment on attachment 396024 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review

> Source/JavaScriptCore/yarr/YarrParser.h:537
> +    template<UnicodeEscapeContext context>

consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me.
Surrogate pairs are consumed by many productions, not only character escapes & identifiers.
`bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).
Comment 7 Michael Saboff 2020-04-10 10:28:26 PDT
(In reply to Alexey Shvayka from comment #5)
> Comment on attachment 396024 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=396024&action=review
> 
> LGTM, a pure joy to informally review.

Thank you for taking a look.

> > JSTests/ChangeLog:8
> > +        New test.
> 
> We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change:
> - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js
> - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js
> - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js
> - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js
> - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js

I updated JSTests/test262/expectations.yaml for the now failing tests.

> > JSTests/stress/regexp-named-capture-groups.js:16
> > +    shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name");
> 
> I really like error message check, and the whole test is crafted with love.
> Michael, would you object if I export this file to test262 with your & Apple Inc. credentials?

Fine by me.

> > Source/JavaScriptCore/ChangeLog:8
> > +        Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups.
> 
> Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869
> 
> > Source/JavaScriptCore/yarr/YarrParser.h:543
> > +        if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) {
> 
> Since we are changing this line, we can use `!atEndOfPattern()` equivalent
> instead of `(patternRemaining() > 0)`.

Done.

> > Source/JavaScriptCore/yarr/YarrParser.h:537
> > +    template<UnicodeEscapeContext context>
> 
> 
> consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me.
> Surrogate pairs are consumed by many productions, not only character escapes & identifiers.
> `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).

In the two other places that consumePossibleSurrogatePair() is used, one is for individual characters and the other as part of a character class parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape.  The character class parsing uses characters as ends of a range or as individual to include in the class.
Comment 8 Michael Saboff 2020-04-10 10:42:46 PDT
Created attachment 396102 [details]
Patch with updates from review
Comment 9 Alexey Shvayka 2020-04-10 10:55:43 PDT
(In reply to Michael Saboff from comment #7)
> In the two other places that consumePossibleSurrogatePair() is used, one is
> for individual characters and the other as part of a character class
> parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. 

I might be putting a little different meaning into UnicodeEscapeContext: for me, it reads as "parent production of Unicode escape we are currently parsing", making it a bit awkward to use in `default` cases of parseTokens() and parseCharacterClass() as they don't parse any escape.
Comment 10 Alexey Shvayka 2020-04-10 11:02:38 PDT
If I'd seen just the signature of consumePossibleSurrogatePair():

  template<UnicodeEscapeContext context>
  UChar32 consumePossibleSurrogatePair()

I would have assume that it parses \uLEAD\uTRAIL escapes since it has UnicodeEscapeContext as a parameter, and not literal astral pair.
Comment 11 Michael Saboff 2020-04-10 11:16:14 PDT
(In reply to Alexey Shvayka from comment #9)
> (In reply to Michael Saboff from comment #7)
> > In the two other places that consumePossibleSurrogatePair() is used, one is
> > for individual characters and the other as part of a character class
> > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. 
> 
> I might be putting a little different meaning into UnicodeEscapeContext: for
> me, it reads as "parent production of Unicode escape we are currently
> parsing", making it a bit awkward to use in `default` cases of parseTokens()
> and parseCharacterClass() as they don't parse any escape.

If we renamed consumePossibleSurrogatePair() to consumePossibleSurrogatePairOrCharacter() would that make things a little clearer?
Comment 12 Alexey Shvayka 2020-04-10 11:35:49 PDT
(In reply to Michael Saboff from comment #11)
> If we renamed consumePossibleSurrogatePair() to
> consumePossibleSurrogatePairOrCharacter() would that make things a little
> clearer?

Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context.

I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?
Comment 13 Michael Saboff 2020-04-10 16:34:24 PDT
(In reply to Alexey Shvayka from comment #12)
> (In reply to Michael Saboff from comment #11)
> > If we renamed consumePossibleSurrogatePair() to
> > consumePossibleSurrogatePairOrCharacter() would that make things a little
> > clearer?
> 
> Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context.
> 
> I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?

How about we rename the enum UnicodeEscapeContext to UnicodeParseContext with the two values PatternCodePoint and GroupName?  Then we use UnicodeParseContext::PatternCodePoint and UnicodeParseContext::GroupName.  Both of these values make sense in the current and new usage and are closer to the ECMAScript spec.
Comment 14 Michael Saboff 2020-04-10 16:42:53 PDT
Created attachment 396135 [details]
Patch with suggested naming changes and additional named back reference testing
Comment 15 Alexey Shvayka 2020-04-11 09:53:28 PDT
(In reply to Michael Saboff from comment #14)
> Created attachment 396135 [details]
> Patch with suggested naming changes and additional named back reference
> testing

Looks perfect, thank you for taking time to address the feedback.
Comment 16 Michael Saboff 2020-04-13 12:19:50 PDT
Committed r260033: <https://trac.webkit.org/changeset/260033>