210309 – [YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions

RESOLVED FIXED 210309

[YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions

https://bugs.webkit.org/show_bug.cgi?id=210309

Summary [YARR] Allow for Unicode named capture group identifiers in non-Unicode regul...

Michael Saboff

Reported 2020-04-09 16:09:42 PDT

During the March/April 2020 TC-39 meeting, it was agreed that named capture group identifiers can contain unicode escape characters even for non-unicode flagged regular expressions. This change is part of the EcmaScript 2020 draft standard that was approved by TC-39 at the same meeting and should be ratified by the Ecma General Assembly in June 2020. The current 2019 standard allows the following constructs for named capture group identifiers with non-BMP codepoints let regex1a = /(?<𝒜>A)/u; let regex1b = /(?<\u{1d49c}>A)/u; let regex1c = /(?<\ud835\udc9c>A)/u; let regex2a = /(?<𝒜>A)/; // no u flag But didn’t allow non-BMP unicode escapes in named capture group identifiers for non unicode regex’s let regex2b = /(?<\u{1d49c}>A)/; let regex2c = /(?<\ud835\udc9c>A)/; JavaScriptCore has a bug where it doesn’t even allow regex2a. This is to track fixing the JSC bug for regex2a, and adding the two other non-unicode forms to bring JSC into compliance with the 2020 standard.

Attachments
Patch (12.95 KB, patch) 2020-04-09 16:30 PDT, Michael Saboff	no flags	Details Formatted Diff Diff
Patch with updates from review (14.84 KB, patch) 2020-04-10 10:42 PDT, Michael Saboff	no flags	Details Formatted Diff Diff
Patch with suggested naming changes and additional named back reference testing (16.54 KB, patch) 2020-04-10 16:42 PDT, Michael Saboff	ross.kirsling: review+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Michael Saboff

Comment 1 2020-04-09 16:10:01 PDT

<rdar://problem/61547384>

Michael Saboff

Comment 2 2020-04-09 16:30:40 PDT

Created attachment 396024 [details] Patch

Ross Kirsling

Comment 3 2020-04-09 21:52:47 PDT

Seems like a good informal review opportunity for Alexey. ;)

Alexey Shvayka

Comment 4 2020-04-10 04:18:09 PDT

(In reply to Ross Kirsling from comment #3) > Seems like a good informal review opportunity for Alexey. ;) Thank you, Ross, I appreciate it. (In reply to Michael Saboff from comment #0) > JavaScriptCore has a bug where it doesn’t even allow regex2a. It is also part of recent change: please see https://github.com/tc39/ecma262/pull/1869#issuecomment-607742063. V8 doesn’t allow regex2a as well, which is covered by JSTests/test262/test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js.

Alexey Shvayka

Comment 5 2020-04-10 04:36:46 PDT

Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review LGTM, a pure joy to informally review. > JSTests/ChangeLog:8 > + New test. We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js > JSTests/stress/regexp-named-capture-groups.js:16 > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); I really like error message check, and the whole test is crafted with love. Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? > Source/JavaScriptCore/ChangeLog:8 > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > Source/JavaScriptCore/yarr/YarrParser.h:543 > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { Since we are changing this line, we can use `!atEndOfPattern()` equivalent instead of `(patternRemaining() > 0)`.

Alexey Shvayka

Comment 6 2020-04-10 05:49:34 PDT

Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review > Source/JavaScriptCore/yarr/YarrParser.h:537 > + template<UnicodeEscapeContext context> consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. Surrogate pairs are consumed by many productions, not only character escapes & identifiers. `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).

Michael Saboff

Comment 7 2020-04-10 10:28:26 PDT

(In reply to Alexey Shvayka from comment #5) > Comment on attachment 396024 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=396024&action=review > > LGTM, a pure joy to informally review. Thank you for taking a look. > > JSTests/ChangeLog:8 > > + New test. > > We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js > - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js I updated JSTests/test262/expectations.yaml for the now failing tests. > > JSTests/stress/regexp-named-capture-groups.js:16 > > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); > > I really like error message check, and the whole test is crafted with love. > Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? Fine by me. > > Source/JavaScriptCore/ChangeLog:8 > > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. > > Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > > > Source/JavaScriptCore/yarr/YarrParser.h:543 > > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { > > Since we are changing this line, we can use `!atEndOfPattern()` equivalent > instead of `(patternRemaining() > 0)`. Done. > > Source/JavaScriptCore/yarr/YarrParser.h:537 > > + template<UnicodeEscapeContext context> > > > consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. > Surrogate pairs are consumed by many productions, not only character escapes & identifiers. > `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`). In the two other places that consumePossibleSurrogatePair() is used, one is for individual characters and the other as part of a character class parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. The character class parsing uses characters as ends of a range or as individual to include in the class.

Michael Saboff

Comment 8 2020-04-10 10:42:46 PDT

Created attachment 396102 [details] Patch with updates from review

Alexey Shvayka

Comment 9 2020-04-10 10:55:43 PDT

(In reply to Michael Saboff from comment #7) > In the two other places that consumePossibleSurrogatePair() is used, one is > for individual characters and the other as part of a character class > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. I might be putting a little different meaning into UnicodeEscapeContext: for me, it reads as "parent production of Unicode escape we are currently parsing", making it a bit awkward to use in `default` cases of parseTokens() and parseCharacterClass() as they don't parse any escape.

Alexey Shvayka

Comment 10 2020-04-10 11:02:38 PDT

If I'd seen just the signature of consumePossibleSurrogatePair(): template<UnicodeEscapeContext context> UChar32 consumePossibleSurrogatePair() I would have assume that it parses \uLEAD\uTRAIL escapes since it has UnicodeEscapeContext as a parameter, and not literal astral pair.

Michael Saboff

Comment 11 2020-04-10 11:16:14 PDT

(In reply to Alexey Shvayka from comment #9) > (In reply to Michael Saboff from comment #7) > > In the two other places that consumePossibleSurrogatePair() is used, one is > > for individual characters and the other as part of a character class > > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. > > I might be putting a little different meaning into UnicodeEscapeContext: for > me, it reads as "parent production of Unicode escape we are currently > parsing", making it a bit awkward to use in `default` cases of parseTokens() > and parseCharacterClass() as they don't parse any escape. If we renamed consumePossibleSurrogatePair() to consumePossibleSurrogatePairOrCharacter() would that make things a little clearer?

Alexey Shvayka

Comment 12 2020-04-10 11:35:49 PDT

(In reply to Michael Saboff from comment #11) > If we renamed consumePossibleSurrogatePair() to > consumePossibleSurrogatePairOrCharacter() would that make things a little > clearer? Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?

Michael Saboff

Comment 13 2020-04-10 16:34:24 PDT

(In reply to Alexey Shvayka from comment #12) > (In reply to Michael Saboff from comment #11) > > If we renamed consumePossibleSurrogatePair() to > > consumePossibleSurrogatePairOrCharacter() would that make things a little > > clearer? > > Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. > > I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those? How about we rename the enum UnicodeEscapeContext to UnicodeParseContext with the two values PatternCodePoint and GroupName? Then we use UnicodeParseContext::PatternCodePoint and UnicodeParseContext::GroupName. Both of these values make sense in the current and new usage and are closer to the ECMAScript spec.

Michael Saboff

Comment 14 2020-04-10 16:42:53 PDT

Created attachment 396135 [details] Patch with suggested naming changes and additional named back reference testing

Alexey Shvayka

Comment 15 2020-04-11 09:53:28 PDT

(In reply to Michael Saboff from comment #14) > Created attachment 396135 [details] > Patch with suggested naming changes and additional named back reference > testing Looks perfect, thank you for taking time to address the feedback.

Michael Saboff

Comment 16 2020-04-13 12:19:50 PDT

Committed r260033: <https://trac.webkit.org/changeset/260033>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Michael Saboff

Reported

2020-04-09 16:09 PDT

Modified

2020-04-13 12:19 PDT History

CC List

8 users Show

URL

Keywords InRadar

Depends on

Blocks