WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
210309
[YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions
https://bugs.webkit.org/show_bug.cgi?id=210309
Summary
[YARR] Allow for Unicode named capture group identifiers in non-Unicode regul...
Michael Saboff
Reported
2020-04-09 16:09:42 PDT
During the March/April 2020 TC-39 meeting, it was agreed that named capture group identifiers can contain unicode escape characters even for non-unicode flagged regular expressions. This change is part of the EcmaScript 2020 draft standard that was approved by TC-39 at the same meeting and should be ratified by the Ecma General Assembly in June 2020. The current 2019 standard allows the following constructs for named capture group identifiers with non-BMP codepoints let regex1a = /(?<𝒜>A)/u; let regex1b = /(?<\u{1d49c}>A)/u; let regex1c = /(?<\ud835\udc9c>A)/u; let regex2a = /(?<𝒜>A)/; // no u flag But didn’t allow non-BMP unicode escapes in named capture group identifiers for non unicode regex’s let regex2b = /(?<\u{1d49c}>A)/; let regex2c = /(?<\ud835\udc9c>A)/; JavaScriptCore has a bug where it doesn’t even allow regex2a. This is to track fixing the JSC bug for regex2a, and adding the two other non-unicode forms to bring JSC into compliance with the 2020 standard.
Attachments
Patch
(12.95 KB, patch)
2020-04-09 16:30 PDT
,
Michael Saboff
no flags
Details
Formatted Diff
Diff
Patch with updates from review
(14.84 KB, patch)
2020-04-10 10:42 PDT
,
Michael Saboff
no flags
Details
Formatted Diff
Diff
Patch with suggested naming changes and additional named back reference testing
(16.54 KB, patch)
2020-04-10 16:42 PDT
,
Michael Saboff
ross.kirsling
: review+
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Michael Saboff
Comment 1
2020-04-09 16:10:01 PDT
<
rdar://problem/61547384
>
Michael Saboff
Comment 2
2020-04-09 16:30:40 PDT
Created
attachment 396024
[details]
Patch
Ross Kirsling
Comment 3
2020-04-09 21:52:47 PDT
Seems like a good informal review opportunity for Alexey. ;)
Alexey Shvayka
Comment 4
2020-04-10 04:18:09 PDT
(In reply to Ross Kirsling from
comment #3
)
> Seems like a good informal review opportunity for Alexey. ;)
Thank you, Ross, I appreciate it. (In reply to Michael Saboff from
comment #0
)
> JavaScriptCore has a bug where it doesn’t even allow regex2a.
It is also part of recent change: please see
https://github.com/tc39/ecma262/pull/1869#issuecomment-607742063
. V8 doesn’t allow regex2a as well, which is covered by JSTests/test262/test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js.
Alexey Shvayka
Comment 5
2020-04-10 04:36:46 PDT
Comment on
attachment 396024
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=396024&action=review
LGTM, a pure joy to informally review.
> JSTests/ChangeLog:8 > + New test.
We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js
> JSTests/stress/regexp-named-capture-groups.js:16 > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name");
I really like error message check, and the whole test is crafted with love. Michael, would you object if I export this file to test262 with your & Apple Inc. credentials?
> Source/JavaScriptCore/ChangeLog:8 > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups.
Should we include spec PR link?
https://github.com/tc39/ecma262/pull/1869
> Source/JavaScriptCore/yarr/YarrParser.h:543 > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) {
Since we are changing this line, we can use `!atEndOfPattern()` equivalent instead of `(patternRemaining() > 0)`.
Alexey Shvayka
Comment 6
2020-04-10 05:49:34 PDT
Comment on
attachment 396024
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=396024&action=review
> Source/JavaScriptCore/yarr/YarrParser.h:537 > + template<UnicodeEscapeContext context>
consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. Surrogate pairs are consumed by many productions, not only character escapes & identifiers. `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).
Michael Saboff
Comment 7
2020-04-10 10:28:26 PDT
(In reply to Alexey Shvayka from
comment #5
)
> Comment on
attachment 396024
[details]
> Patch > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=396024&action=review
> > LGTM, a pure joy to informally review.
Thank you for taking a look.
> > JSTests/ChangeLog:8 > > + New test. > > We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js > - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js
I updated JSTests/test262/expectations.yaml for the now failing tests.
> > JSTests/stress/regexp-named-capture-groups.js:16 > > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); > > I really like error message check, and the whole test is crafted with love. > Michael, would you object if I export this file to test262 with your & Apple Inc. credentials?
Fine by me.
> > Source/JavaScriptCore/ChangeLog:8 > > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. > > Should we include spec PR link?
https://github.com/tc39/ecma262/pull/1869
> > > Source/JavaScriptCore/yarr/YarrParser.h:543 > > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { > > Since we are changing this line, we can use `!atEndOfPattern()` equivalent > instead of `(patternRemaining() > 0)`.
Done.
> > Source/JavaScriptCore/yarr/YarrParser.h:537 > > + template<UnicodeEscapeContext context> > > > consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. > Surrogate pairs are consumed by many productions, not only character escapes & identifiers. > `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).
In the two other places that consumePossibleSurrogatePair() is used, one is for individual characters and the other as part of a character class parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. The character class parsing uses characters as ends of a range or as individual to include in the class.
Michael Saboff
Comment 8
2020-04-10 10:42:46 PDT
Created
attachment 396102
[details]
Patch with updates from review
Alexey Shvayka
Comment 9
2020-04-10 10:55:43 PDT
(In reply to Michael Saboff from
comment #7
)
> In the two other places that consumePossibleSurrogatePair() is used, one is > for individual characters and the other as part of a character class > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape.
I might be putting a little different meaning into UnicodeEscapeContext: for me, it reads as "parent production of Unicode escape we are currently parsing", making it a bit awkward to use in `default` cases of parseTokens() and parseCharacterClass() as they don't parse any escape.
Alexey Shvayka
Comment 10
2020-04-10 11:02:38 PDT
If I'd seen just the signature of consumePossibleSurrogatePair(): template<UnicodeEscapeContext context> UChar32 consumePossibleSurrogatePair() I would have assume that it parses \uLEAD\uTRAIL escapes since it has UnicodeEscapeContext as a parameter, and not literal astral pair.
Michael Saboff
Comment 11
2020-04-10 11:16:14 PDT
(In reply to Alexey Shvayka from
comment #9
)
> (In reply to Michael Saboff from
comment #7
) > > In the two other places that consumePossibleSurrogatePair() is used, one is > > for individual characters and the other as part of a character class > > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. > > I might be putting a little different meaning into UnicodeEscapeContext: for > me, it reads as "parent production of Unicode escape we are currently > parsing", making it a bit awkward to use in `default` cases of parseTokens() > and parseCharacterClass() as they don't parse any escape.
If we renamed consumePossibleSurrogatePair() to consumePossibleSurrogatePairOrCharacter() would that make things a little clearer?
Alexey Shvayka
Comment 12
2020-04-10 11:35:49 PDT
(In reply to Michael Saboff from
comment #11
)
> If we renamed consumePossibleSurrogatePair() to > consumePossibleSurrogatePairOrCharacter() would that make things a little > clearer?
Sorry, please disregard
comment #10
, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. I am not a huge fan of boolean parameters (
https://ariya.io/2011/08/hall-of-api-shame-boolean-trap
) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?
Michael Saboff
Comment 13
2020-04-10 16:34:24 PDT
(In reply to Alexey Shvayka from
comment #12
)
> (In reply to Michael Saboff from
comment #11
) > > If we renamed consumePossibleSurrogatePair() to > > consumePossibleSurrogatePairOrCharacter() would that make things a little > > clearer? > > Sorry, please disregard
comment #10
, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. > > I am not a huge fan of boolean parameters (
https://ariya.io/2011/08/hall-of-api-shame-boolean-trap
) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?
How about we rename the enum UnicodeEscapeContext to UnicodeParseContext with the two values PatternCodePoint and GroupName? Then we use UnicodeParseContext::PatternCodePoint and UnicodeParseContext::GroupName. Both of these values make sense in the current and new usage and are closer to the ECMAScript spec.
Michael Saboff
Comment 14
2020-04-10 16:42:53 PDT
Created
attachment 396135
[details]
Patch with suggested naming changes and additional named back reference testing
Alexey Shvayka
Comment 15
2020-04-11 09:53:28 PDT
(In reply to Michael Saboff from
comment #14
)
> Created
attachment 396135
[details]
> Patch with suggested naming changes and additional named back reference > testing
Looks perfect, thank you for taking time to address the feedback.
Michael Saboff
Comment 16
2020-04-13 12:19:50 PDT
Committed
r260033
: <
https://trac.webkit.org/changeset/260033
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug