Bug 202041 - [YARR] Properly handle surrogates when matching back references
Summary: [YARR] Properly handle surrogates when matching back references
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords: InRadar
Depends on:
Blocks: 202042 202310
  Show dependency treegraph
 
Reported: 2019-09-20 02:30 PDT by Paulo Matos
Modified: 2019-10-01 11:24 PDT (History)
10 users (show)

See Also:


Attachments
WIP - Patch (4.02 KB, patch)
2019-09-25 07:35 PDT, Paulo Matos
msaboff: review-
ews-watchlist: commit-queue-
Details | Formatted Diff | Diff
Patch (8.65 KB, patch)
2019-09-25 17:24 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. (8.90 KB, patch)
2019-09-25 17:33 PDT, Michael Saboff
keith_miller: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paulo Matos 2019-09-20 02:30:10 PDT
Since the fix for https://bugs.webkit.org/show_bug.cgi?id=201647 (whose report I don't have permissions to read), that we noticed the test stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js (added with the mentioned fix) started failing on 32bits ARM.

Turns out this is because we don't have YARR JIT enabled and it's not specific to 32bits. If I disable YARR JIT on 64bits, I can reproduce the same failure in other archs.

$ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0 JSTests/stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js
null
Exception: Expected /([^x]+)[^]*\1([^])/u to match "�𐀀" but it didn't
Comment 1 Paulo Matos 2019-09-20 02:31:22 PDT
Adding to CC msaboff and ysuzuki who were involved with https://bugs.webkit.org/show_bug.cgi?id=201647
Comment 2 Paulo Matos 2019-09-20 05:22:37 PDT
Related patch to add failed run to set of tests: https://bugs.webkit.org/show_bug.cgi?id=202042
Comment 3 Paulo Matos 2019-09-23 06:32:52 PDT
I am investigating this at the moment.
It seems interpretation of backreferences are broken:
>>> /([^x]+)\1/u.exec("\ud800\ud800\udc00")
null

when starting jsc with $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0
Comment 4 Paulo Matos 2019-09-24 01:50:53 PDT
The current understanding is slightly different than the initial one.

The RegExp JIT is broken, not the interpreter. All other engines seem to agree that:
/([^x]+)[^]*\1([^])/u.exec("\ud800\ud800\udc00")

should _not_ match.

There are three surrogates, one unpaired (the first \ud800) and two paired with each other (\ud800\udc00). The [^x]+ should match and capture the unpaired surrogate \ud800 and the backreference should not possibly capture part of the paired surrogate with \1. This should therefore fail.
Comment 5 Paulo Matos 2019-09-25 07:35:55 PDT
Created attachment 379547 [details]
WIP - Patch
Comment 6 EWS Watchlist 2019-09-25 09:48:12 PDT
Comment on attachment 379547 [details]
WIP - Patch

Attachment 379547 [details] did not pass jsc-ews (mac):
Output: https://webkit-queues.webkit.org/results/13067037

New failing tests:
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-dfg-eager-no-cjit
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-ftl
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-no-cjit
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-eager-no-cjit
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-eager-no-cjit
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-ftl
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-llint
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-cjit
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-dfg-eager-no-cjit
jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-no-cjit
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-cjit
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout
jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-llint
Comment 7 Michael Saboff 2019-09-25 16:30:10 PDT
Comment on attachment 379547 [details]
WIP - Patch

r-
This patch is mostly right.  It needs 2 changes.
1) We need to properly increment the patternIndex and index when matching non-BMP characters.
2) There is a register clobbering issue on X86_64 since tryReadUnicodeCharImpl() uses r10 as a temp and patternChar is also r10.

Posting a patch shortly that builds on this patch and addresses these two issues.
Comment 8 Michael Saboff 2019-09-25 17:24:16 PDT
Created attachment 379595 [details]
Patch
Comment 9 Michael Saboff 2019-09-25 17:33:03 PDT
Created attachment 379597 [details]
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.
Comment 10 Paulo Matos 2019-09-26 08:19:37 PDT
Thanks Michael, I was still looking at this issue and just noticed you fixed it.

I had a little testsuite with the most relevant tests to run:
let match1 = /([^x])\1./u.exec('\ud800\ud800\udc00');

if (match1) {
    print('error 1');
}

let match2 = /(\u{10123})x\1/u.exec('\u{10123}x\u{10123}');

if (!match2 || match2[1] !== '\u{10123}') {
    print('error 2');
}

let match3 = /(\ud800\udd23)x\1/u.exec('\u{10123}x\u{10123}');

if (!match3 || match3[1] !== '\u{10123}') {
    print('error 3');
}

let match4 = /(\u{10123})x\ud800\udd23/u.exec('\u{10123}x\u{10123}');

if (!match4 || match4[1] !== '\u{10123}') {
    print('error 4');
}

let match5 = /(\ud800)\udd23x\1/u.exec('\u{10123}x\u{10123}');

if (match5) {
    print('error 5');
}

let match6 = /([^x])\ud800./u.exec('\ud800\ud800\udc00');

if (match6) {
    print('error 6');
}

print('DONE');


All bots pass and the above tests pass too so I think it's fixed.
I didn't recognize the problems you mentioned and debugging this for the first time turned out to be a long exercise in C++ and x86_64 assembly reading. Once again, thanks for the help on this.
Comment 11 Keith Miller 2019-10-01 10:48:35 PDT
Comment on attachment 379597 [details]
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.

r=me. I'll take your word on some of the character encodings because I don't know them very well.
Comment 12 Keith Miller 2019-10-01 10:49:27 PDT
Comment on attachment 379597 [details]
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.

Can you also add Paulos' test?
Comment 13 Michael Saboff 2019-10-01 11:01:43 PDT
(In reply to Keith Miller from comment #12)
> Comment on attachment 379597 [details]
> Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original
> contribution.
> 
> Can you also add Paulos' test?

His test is a collection of subtests that where failing with his patch.  They are already part of our existing tests.
Comment 14 Michael Saboff 2019-10-01 11:23:56 PDT
Committed r250568: <https://trac.webkit.org/changeset/250568>
Comment 15 Radar WebKit Bug Importer 2019-10-01 11:24:24 PDT
<rdar://problem/55883253>