RESOLVED FIXED 210309
[YARR] Allow for Unicode named capture group identifiers in non-Unicode regular expressions
https://bugs.webkit.org/show_bug.cgi?id=210309
Summary [YARR] Allow for Unicode named capture group identifiers in non-Unicode regul...
Michael Saboff
Reported 2020-04-09 16:09:42 PDT
During the March/April 2020 TC-39 meeting, it was agreed that named capture group identifiers can contain unicode escape characters even for non-unicode flagged regular expressions. This change is part of the EcmaScript 2020 draft standard that was approved by TC-39 at the same meeting and should be ratified by the Ecma General Assembly in June 2020. The current 2019 standard allows the following constructs for named capture group identifiers with non-BMP codepoints let regex1a = /(?<𝒜>A)/u; let regex1b = /(?<\u{1d49c}>A)/u; let regex1c = /(?<\ud835\udc9c>A)/u; let regex2a = /(?<𝒜>A)/; // no u flag But didn’t allow non-BMP unicode escapes in named capture group identifiers for non unicode regex’s let regex2b = /(?<\u{1d49c}>A)/; let regex2c = /(?<\ud835\udc9c>A)/; JavaScriptCore has a bug where it doesn’t even allow regex2a. This is to track fixing the JSC bug for regex2a, and adding the two other non-unicode forms to bring JSC into compliance with the 2020 standard.
Attachments
Patch (12.95 KB, patch)
2020-04-09 16:30 PDT, Michael Saboff
no flags
Patch with updates from review (14.84 KB, patch)
2020-04-10 10:42 PDT, Michael Saboff
no flags
Patch with suggested naming changes and additional named back reference testing (16.54 KB, patch)
2020-04-10 16:42 PDT, Michael Saboff
ross.kirsling: review+
Michael Saboff
Comment 1 2020-04-09 16:10:01 PDT
Michael Saboff
Comment 2 2020-04-09 16:30:40 PDT
Ross Kirsling
Comment 3 2020-04-09 21:52:47 PDT
Seems like a good informal review opportunity for Alexey. ;)
Alexey Shvayka
Comment 4 2020-04-10 04:18:09 PDT
(In reply to Ross Kirsling from comment #3) > Seems like a good informal review opportunity for Alexey. ;) Thank you, Ross, I appreciate it. (In reply to Michael Saboff from comment #0) > JavaScriptCore has a bug where it doesn’t even allow regex2a. It is also part of recent change: please see https://github.com/tc39/ecma262/pull/1869#issuecomment-607742063. V8 doesn’t allow regex2a as well, which is covered by JSTests/test262/test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js.
Alexey Shvayka
Comment 5 2020-04-10 04:36:46 PDT
Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review LGTM, a pure joy to informally review. > JSTests/ChangeLog:8 > + New test. We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js > JSTests/stress/regexp-named-capture-groups.js:16 > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); I really like error message check, and the whole test is crafted with love. Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? > Source/JavaScriptCore/ChangeLog:8 > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > Source/JavaScriptCore/yarr/YarrParser.h:543 > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { Since we are changing this line, we can use `!atEndOfPattern()` equivalent instead of `(patternRemaining() > 0)`.
Alexey Shvayka
Comment 6 2020-04-10 05:49:34 PDT
Comment on attachment 396024 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=396024&action=review > Source/JavaScriptCore/yarr/YarrParser.h:537 > + template<UnicodeEscapeContext context> consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. Surrogate pairs are consumed by many productions, not only character escapes & identifiers. `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`).
Michael Saboff
Comment 7 2020-04-10 10:28:26 PDT
(In reply to Alexey Shvayka from comment #5) > Comment on attachment 396024 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=396024&action=review > > LGTM, a pure joy to informally review. Thank you for taking a look. > > JSTests/ChangeLog:8 > > + New test. > > We might want to skip failing & no longer correct tests as test262 wasn't updated for recent spec change: > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier-2.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-6.js > - test/language/literals/regexp/named-groups/invalid-non-id-start-groupspecifier-3.js > - test/language/literals/regexp/named-groups/invalid-u-escape-in-groupspecifier.js > - test/language/literals/regexp/named-groups/invalid-non-id-continue-groupspecifier.js I updated JSTests/test262/expectations.yaml for the now failing tests. > > JSTests/stress/regexp-named-capture-groups.js:16 > > + shouldBe(String(error), "SyntaxError: Invalid regular expression: invalid group specifier name"); > > I really like error message check, and the whole test is crafted with love. > Michael, would you object if I export this file to test262 with your & Apple Inc. credentials? Fine by me. > > Source/JavaScriptCore/ChangeLog:8 > > + Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. > > Should we include spec PR link? https://github.com/tc39/ecma262/pull/1869 > > > Source/JavaScriptCore/yarr/YarrParser.h:543 > > + if (U16_IS_LEAD(ch) && unicodePatternOrIdentifierName && (patternRemaining() > 0)) { > > Since we are changing this line, we can use `!atEndOfPattern()` equivalent > instead of `(patternRemaining() > 0)`. Done. > > Source/JavaScriptCore/yarr/YarrParser.h:537 > > + template<UnicodeEscapeContext context> > > > consumePossibleSurrogatePair() having notion of UnicodeEscapeContext does't feel right to me. > Surrogate pairs are consumed by many productions, not only character escapes & identifiers. > `bool inIdentifierName` seems to be a better fit (like we have `bool inCharacterClass`). In the two other places that consumePossibleSurrogatePair() is used, one is for individual characters and the other as part of a character class parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. The character class parsing uses characters as ends of a range or as individual to include in the class.
Michael Saboff
Comment 8 2020-04-10 10:42:46 PDT
Created attachment 396102 [details] Patch with updates from review
Alexey Shvayka
Comment 9 2020-04-10 10:55:43 PDT
(In reply to Michael Saboff from comment #7) > In the two other places that consumePossibleSurrogatePair() is used, one is > for individual characters and the other as part of a character class > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. I might be putting a little different meaning into UnicodeEscapeContext: for me, it reads as "parent production of Unicode escape we are currently parsing", making it a bit awkward to use in `default` cases of parseTokens() and parseCharacterClass() as they don't parse any escape.
Alexey Shvayka
Comment 10 2020-04-10 11:02:38 PDT
If I'd seen just the signature of consumePossibleSurrogatePair(): template<UnicodeEscapeContext context> UChar32 consumePossibleSurrogatePair() I would have assume that it parses \uLEAD\uTRAIL escapes since it has UnicodeEscapeContext as a parameter, and not literal astral pair.
Michael Saboff
Comment 11 2020-04-10 11:16:14 PDT
(In reply to Alexey Shvayka from comment #9) > (In reply to Michael Saboff from comment #7) > > In the two other places that consumePossibleSurrogatePair() is used, one is > > for individual characters and the other as part of a character class > > parsing, it seems appropriate to use UnicodeEscapeContext::CharacterEscape. > > I might be putting a little different meaning into UnicodeEscapeContext: for > me, it reads as "parent production of Unicode escape we are currently > parsing", making it a bit awkward to use in `default` cases of parseTokens() > and parseCharacterClass() as they don't parse any escape. If we renamed consumePossibleSurrogatePair() to consumePossibleSurrogatePairOrCharacter() would that make things a little clearer?
Alexey Shvayka
Comment 12 2020-04-10 11:35:49 PDT
(In reply to Michael Saboff from comment #11) > If we renamed consumePossibleSurrogatePair() to > consumePossibleSurrogatePairOrCharacter() would that make things a little > clearer? Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those?
Michael Saboff
Comment 13 2020-04-10 16:34:24 PDT
(In reply to Alexey Shvayka from comment #12) > (In reply to Michael Saboff from comment #11) > > If we renamed consumePossibleSurrogatePair() to > > consumePossibleSurrogatePairOrCharacter() would that make things a little > > clearer? > > Sorry, please disregard comment #10, consumePossibleSurrogatePair() is a good name and it surely needs Unicode escape context. > > I am not a huge fan of boolean parameters (https://ariya.io/2011/08/hall-of-api-shame-boolean-trap) myself, yet passing UnicodeEscapeContext::CharacterEscape in 2 call sites that are irrelevant to \unicode escapes is a bit misleading IMO. What if we pass UnicodeEscapeContext::None from those? How about we rename the enum UnicodeEscapeContext to UnicodeParseContext with the two values PatternCodePoint and GroupName? Then we use UnicodeParseContext::PatternCodePoint and UnicodeParseContext::GroupName. Both of these values make sense in the current and new usage and are closer to the ECMAScript spec.
Michael Saboff
Comment 14 2020-04-10 16:42:53 PDT
Created attachment 396135 [details] Patch with suggested naming changes and additional named back reference testing
Alexey Shvayka
Comment 15 2020-04-11 09:53:28 PDT
(In reply to Michael Saboff from comment #14) > Created attachment 396135 [details] > Patch with suggested naming changes and additional named back reference > testing Looks perfect, thank you for taking time to address the feedback.
Michael Saboff
Comment 16 2020-04-13 12:19:50 PDT
Note You need to log in before you can comment on or make changes to this bug.