RESOLVED FIXED Bug 202041
[YARR] Properly handle surrogates when matching back references
https://bugs.webkit.org/show_bug.cgi?id=202041
Summary [YARR] Properly handle surrogates when matching back references
Paulo Matos
Reported 2019-09-20 02:30:10 PDT
Since the fix for https://bugs.webkit.org/show_bug.cgi?id=201647 (whose report I don't have permissions to read), that we noticed the test stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js (added with the mentioned fix) started failing on 32bits ARM. Turns out this is because we don't have YARR JIT enabled and it's not specific to 32bits. If I disable YARR JIT on 64bits, I can reproduce the same failure in other archs. $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0 JSTests/stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js null Exception: Expected /([^x]+)[^]*\1([^])/u to match "�𐀀" but it didn't
Attachments
WIP - Patch (4.02 KB, patch)
2019-09-25 07:35 PDT, Paulo Matos
msaboff: review-
ews-watchlist: commit-queue-
Patch (8.65 KB, patch)
2019-09-25 17:24 PDT, Michael Saboff
no flags
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. (8.90 KB, patch)
2019-09-25 17:33 PDT, Michael Saboff
keith_miller: review+
Paulo Matos
Comment 1 2019-09-20 02:31:22 PDT
Adding to CC msaboff and ysuzuki who were involved with https://bugs.webkit.org/show_bug.cgi?id=201647
Paulo Matos
Comment 2 2019-09-20 05:22:37 PDT
Related patch to add failed run to set of tests: https://bugs.webkit.org/show_bug.cgi?id=202042
Paulo Matos
Comment 3 2019-09-23 06:32:52 PDT
I am investigating this at the moment. It seems interpretation of backreferences are broken: >>> /([^x]+)\1/u.exec("\ud800\ud800\udc00") null when starting jsc with $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0
Paulo Matos
Comment 4 2019-09-24 01:50:53 PDT
The current understanding is slightly different than the initial one. The RegExp JIT is broken, not the interpreter. All other engines seem to agree that: /([^x]+)[^]*\1([^])/u.exec("\ud800\ud800\udc00") should _not_ match. There are three surrogates, one unpaired (the first \ud800) and two paired with each other (\ud800\udc00). The [^x]+ should match and capture the unpaired surrogate \ud800 and the backreference should not possibly capture part of the paired surrogate with \1. This should therefore fail.
Paulo Matos
Comment 5 2019-09-25 07:35:55 PDT
Created attachment 379547 [details] WIP - Patch
EWS Watchlist
Comment 6 2019-09-25 09:48:12 PDT
Comment on attachment 379547 [details] WIP - Patch Attachment 379547 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/13067037 New failing tests: jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-ftl jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-ftl jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-llint jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-llint
Michael Saboff
Comment 7 2019-09-25 16:30:10 PDT
Comment on attachment 379547 [details] WIP - Patch r- This patch is mostly right. It needs 2 changes. 1) We need to properly increment the patternIndex and index when matching non-BMP characters. 2) There is a register clobbering issue on X86_64 since tryReadUnicodeCharImpl() uses r10 as a temp and patternChar is also r10. Posting a patch shortly that builds on this patch and addresses these two issues.
Michael Saboff
Comment 8 2019-09-25 17:24:16 PDT
Michael Saboff
Comment 9 2019-09-25 17:33:03 PDT
Created attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.
Paulo Matos
Comment 10 2019-09-26 08:19:37 PDT
Thanks Michael, I was still looking at this issue and just noticed you fixed it. I had a little testsuite with the most relevant tests to run: let match1 = /([^x])\1./u.exec('\ud800\ud800\udc00'); if (match1) { print('error 1'); } let match2 = /(\u{10123})x\1/u.exec('\u{10123}x\u{10123}'); if (!match2 || match2[1] !== '\u{10123}') { print('error 2'); } let match3 = /(\ud800\udd23)x\1/u.exec('\u{10123}x\u{10123}'); if (!match3 || match3[1] !== '\u{10123}') { print('error 3'); } let match4 = /(\u{10123})x\ud800\udd23/u.exec('\u{10123}x\u{10123}'); if (!match4 || match4[1] !== '\u{10123}') { print('error 4'); } let match5 = /(\ud800)\udd23x\1/u.exec('\u{10123}x\u{10123}'); if (match5) { print('error 5'); } let match6 = /([^x])\ud800./u.exec('\ud800\ud800\udc00'); if (match6) { print('error 6'); } print('DONE'); All bots pass and the above tests pass too so I think it's fixed. I didn't recognize the problems you mentioned and debugging this for the first time turned out to be a long exercise in C++ and x86_64 assembly reading. Once again, thanks for the help on this.
Keith Miller
Comment 11 2019-10-01 10:48:35 PDT
Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. r=me. I'll take your word on some of the character encodings because I don't know them very well.
Keith Miller
Comment 12 2019-10-01 10:49:27 PDT
Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. Can you also add Paulos' test?
Michael Saboff
Comment 13 2019-10-01 11:01:43 PDT
(In reply to Keith Miller from comment #12) > Comment on attachment 379597 [details] > Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original > contribution. > > Can you also add Paulos' test? His test is a collection of subtests that where failing with his patch. They are already part of our existing tests.
Michael Saboff
Comment 14 2019-10-01 11:23:56 PDT
Radar WebKit Bug Importer
Comment 15 2019-10-01 11:24:24 PDT
Note You need to log in before you can comment on or make changes to this bug.