Bug 204490 - [JSC] GetSubstitution is performed incorrectly via RegExp.prototype[@@replace]
Summary: [JSC] GetSubstitution is performed incorrectly via RegExp.prototype[@@replace]
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ross Kirsling
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-11-21 22:19 PST by Ross Kirsling
Modified: 2019-11-23 15:24 PST (History)
10 users (show)

See Also:


Attachments
Patch (5.17 KB, patch)
2019-11-21 22:35 PST, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch (8.39 KB, patch)
2019-11-22 11:31 PST, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch for landing (8.82 KB, patch)
2019-11-23 14:37 PST, Ross Kirsling
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ross Kirsling 2019-11-21 22:19:59 PST
[JSC] GetSubstitution is performed incorrectly via RegExp.prototype[@@replace]
Comment 1 Ross Kirsling 2019-11-21 22:35:40 PST
Created attachment 384124 [details]
Patch
Comment 2 Mark Lam 2019-11-22 00:50:56 PST
Comment on attachment 384124 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384124&action=review

> Source/JavaScriptCore/builtins/RegExpPrototype.js:189
> +        let m = captures.length - 1;

This is wrong.  The spec says "Let m be the number of elements in captures".  Otherwise, the tests below that checks against m (relying on m being the number of elements, not the last index) would be wrong.

> Source/JavaScriptCore/builtins/RegExpPrototype.js:251
>                          if (captures[n] != @undefined)
>                              result = result + captures[n];

Instead of decrementing m by 1 above, this code needs to be fixed as follows:
    let capture = captures[n - 1];
    if (capture != @undefined)
        result = result + capture;

Since this is JS code, it's also beneficial to not read from the array twice.

Can you add some test cases to confirm if I'm right or now about this?  I suspect test262 does not have sufficient coverage.  Thanks.
Comment 3 Ross Kirsling 2019-11-22 08:40:06 PST
(In reply to Mark Lam from comment #2)
> Comment on attachment 384124 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=384124&action=review
> 
> > Source/JavaScriptCore/builtins/RegExpPrototype.js:189
> > +        let m = captures.length - 1;
> 
> This is wrong.  The spec says "Let m be the number of elements in captures".
> Otherwise, the tests below that checks against m (relying on m being the
> number of elements, not the last index) would be wrong.

Apologies for my lack of explanation.

`m` is meant to be the number of captures, but `captures` is including the full match as captures[0]. I wanted to change the name of `m` to numSubpatterns or numberOfCaptures (which each have precedent in our code), but alas the name is the spec is just `m`.

The existing behavior is what's wrong here -- the n > m comparison is *always* off by one, but since we're guarding for n == 0, the effect is simply an unnecessary lookup that always returns undefined and hence doesn't produce an observable difference.

> Since this is JS code, it's also beneficial to not read from the array twice.

Sure. Still learning the best practices for our built-ins JS here. :)

> Can you add some test cases to confirm if I'm right or now about this?  I
> suspect test262 does not have sufficient coverage.  Thanks.

Yep, can do.
Comment 4 Mark Lam 2019-11-22 09:23:03 PST
Comment on attachment 384124 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384124&action=review

>>> Source/JavaScriptCore/builtins/RegExpPrototype.js:189
>>> +        let m = captures.length - 1;
>> 
>> This is wrong.  The spec says "Let m be the number of elements in captures".  Otherwise, the tests below that checks against m (relying on m being the number of elements, not the last index) would be wrong.
> 
> Apologies for my lack of explanation.
> 
> `m` is meant to be the number of captures, but `captures` is including the full match as captures[0]. I wanted to change the name of `m` to numSubpatterns or numberOfCaptures (which each have precedent in our code), but alas the name is the spec is just `m`.
> 
> The existing behavior is what's wrong here -- the n > m comparison is *always* off by one, but since we're guarding for n == 0, the effect is simply an unnecessary lookup that always returns undefined and hence doesn't produce an observable difference.

That makes sense.  Let's add a comment here to explain why we should use captures.length - 1.  Please also give a bit more detail to explain the bug(s) and how you're fixing them in the ChangeLog.  Either I'm dense, or others may also fail to grasp what the issue is.

Some concise JSC test cases would also help ensure that this doesn't regress.  Thanks.

>> Source/JavaScriptCore/builtins/RegExpPrototype.js:251
>>                              result = result + captures[n];
> 
> Instead of decrementing m by 1 above, this code needs to be fixed as follows:
>     let capture = captures[n - 1];
>     if (capture != @undefined)
>         result = result + capture;
> 
> Since this is JS code, it's also beneficial to not read from the array twice.
> 
> Can you add some test cases to confirm if I'm right or now about this?  I suspect test262 does not have sufficient coverage.  Thanks.

Please also add a comment here to indicate that captures[0] contains the full match.  The real captures start at index 1.  Hence, we can use n without subtracting 1 from it first.
Comment 5 Ross Kirsling 2019-11-22 10:47:04 PST
Rereading the part of the spec prior to the GetSubstitution call (https://tc39.es/ecma262/#sec-regexp.prototype-@@replace), it seems that we're really populating `captures` incorrectly. I'll fix this at the callsite instead and that should remove the confusion about `m`.
Comment 6 Ross Kirsling 2019-11-22 11:31:33 PST
Created attachment 384180 [details]
Patch
Comment 7 Mark Lam 2019-11-22 21:27:25 PST
Comment on attachment 384180 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384180&action=review

Nice tests.  r=me with fixes.

> Source/JavaScriptCore/builtins/RegExpPrototype.js:245
> +                        if (n == 0) {

Ahh, this is the bug here.  This check should be "if (n == 0 || n > m) {".  I only saw this bug because your test cases below made it obvious.  So, the test cases have been proven useful already.

> JSTests/stress/replace-indexed-backreferences.js:13
> +shouldBe(string.replace(search, '$10$21$32$43'), 'abc0def1$32$43');

This looks wrong.  According to the spec, this is looking for $10, $21, $32, and $43.  The spec for the $nn case says "If nn is 00 or nn > m, no replacement is done."   Hence, the result should be '$10$21$32$43'.

This tells me that we have a bug in the code.  Going back above to look for the bug ...

> JSTests/stress/replace-indexed-backreferences.js:19
> +shouldBe(search[Symbol.replace](string, '$10$21$32$43'), 'abc0def1$32$43');

Ditto.
Comment 8 Ross Kirsling 2019-11-22 23:32:31 PST
(In reply to Mark Lam from comment #7)
> > JSTests/stress/replace-indexed-backreferences.js:13
> > +shouldBe(string.replace(search, '$10$21$32$43'), 'abc0def1$32$43');
> 
> This looks wrong.  According to the spec, this is looking for $10, $21, $32,
> and $43.  The spec for the $nn case says "If nn is 00 or nn > m, no
> replacement is done."   Hence, the result should be '$10$21$32$43'.

After a bit investigation, it turns out this is actually a spec bug in the midst of being corrected!

What I knew for certain is that this behavior (of treating $10 as $1 followed by a 0 when we have 1 <= n < 10 captures) is consistent across all engines, and our other version of this algorithm calls out exactly what it's doing in a comment:
https://github.com/WebKit/webkit/blob/master/Source/JavaScriptCore/runtime/StringPrototype.cpp#L264-L265

So it's definitely web reality, but as you pointed out, the spec is not spelling things out properly at all. Just as I was about to file a GitHub issue about this though, I found that a PR to fix it has already been created as of last month:
https://github.com/tc39/ecma262/pull/1732

How convenient!
Comment 9 Mark Lam 2019-11-23 13:51:18 PST
Comment on attachment 384180 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384180&action=review

> Source/JavaScriptCore/ChangeLog:9
> +        of $-backreferences (called GetSubstitution in the spec: https://tc39.es/ecma262/#sec-getsubstitution).

Given the new info you've provided in https://bugs.webkit.org/show_bug.cgi?id=204490#c8, I think this patch is ready to land.  But please add a short blurb here stating that:
1. The implementation diverges from what's stated in the spec with the treatment of nn > m.
2. The difference in behavior is compliant with the https://github.com/tc39/ecma262/pull/1732.
3. Other browsers (name them if you know) also behave this way.
Comment 10 Ross Kirsling 2019-11-23 14:37:03 PST
Created attachment 384244 [details]
Patch for landing
Comment 11 Ross Kirsling 2019-11-23 14:37:22 PST
Thanks for the thorough review!
Comment 12 WebKit Commit Bot 2019-11-23 15:23:38 PST
Comment on attachment 384244 [details]
Patch for landing

Clearing flags on attachment: 384244

Committed r252836: <https://trac.webkit.org/changeset/252836>
Comment 13 WebKit Commit Bot 2019-11-23 15:23:40 PST
All reviewed patches have been landed.  Closing bug.
Comment 14 Radar WebKit Bug Importer 2019-11-23 15:24:18 PST
<rdar://problem/57455247>