Since the fix for https://bugs.webkit.org/show_bug.cgi?id=201647 (whose report I don't have permissions to read), that we noticed the test stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js (added with the mentioned fix) started failing on 32bits ARM. Turns out this is because we don't have YARR JIT enabled and it's not specific to 32bits. If I disable YARR JIT on 64bits, I can reproduce the same failure in other archs. $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0 JSTests/stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js null Exception: Expected /([^x]+)[^]*\1([^])/u to match "�𐀀" but it didn't
Adding to CC msaboff and ysuzuki who were involved with https://bugs.webkit.org/show_bug.cgi?id=201647
Related patch to add failed run to set of tests: https://bugs.webkit.org/show_bug.cgi?id=202042
I am investigating this at the moment. It seems interpretation of backreferences are broken: >>> /([^x]+)\1/u.exec("\ud800\ud800\udc00") null when starting jsc with $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0
The current understanding is slightly different than the initial one. The RegExp JIT is broken, not the interpreter. All other engines seem to agree that: /([^x]+)[^]*\1([^])/u.exec("\ud800\ud800\udc00") should _not_ match. There are three surrogates, one unpaired (the first \ud800) and two paired with each other (\ud800\udc00). The [^x]+ should match and capture the unpaired surrogate \ud800 and the backreference should not possibly capture part of the paired surrogate with \1. This should therefore fail.
Created attachment 379547 [details] WIP - Patch
Comment on attachment 379547 [details] WIP - Patch Attachment 379547 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/13067037 New failing tests: jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-ftl jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-ftl jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-llint jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-llint
Comment on attachment 379547 [details] WIP - Patch r- This patch is mostly right. It needs 2 changes. 1) We need to properly increment the patternIndex and index when matching non-BMP characters. 2) There is a register clobbering issue on X86_64 since tryReadUnicodeCharImpl() uses r10 as a temp and patternChar is also r10. Posting a patch shortly that builds on this patch and addresses these two issues.
Created attachment 379595 [details] Patch
Created attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.
Thanks Michael, I was still looking at this issue and just noticed you fixed it. I had a little testsuite with the most relevant tests to run: let match1 = /([^x])\1./u.exec('\ud800\ud800\udc00'); if (match1) { print('error 1'); } let match2 = /(\u{10123})x\1/u.exec('\u{10123}x\u{10123}'); if (!match2 || match2[1] !== '\u{10123}') { print('error 2'); } let match3 = /(\ud800\udd23)x\1/u.exec('\u{10123}x\u{10123}'); if (!match3 || match3[1] !== '\u{10123}') { print('error 3'); } let match4 = /(\u{10123})x\ud800\udd23/u.exec('\u{10123}x\u{10123}'); if (!match4 || match4[1] !== '\u{10123}') { print('error 4'); } let match5 = /(\ud800)\udd23x\1/u.exec('\u{10123}x\u{10123}'); if (match5) { print('error 5'); } let match6 = /([^x])\ud800./u.exec('\ud800\ud800\udc00'); if (match6) { print('error 6'); } print('DONE'); All bots pass and the above tests pass too so I think it's fixed. I didn't recognize the problems you mentioned and debugging this for the first time turned out to be a long exercise in C++ and x86_64 assembly reading. Once again, thanks for the help on this.
Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. r=me. I'll take your word on some of the character encodings because I don't know them very well.
Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. Can you also add Paulos' test?
(In reply to Keith Miller from comment #12) > Comment on attachment 379597 [details] > Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original > contribution. > > Can you also add Paulos' test? His test is a collection of subtests that where failing with his patch. They are already part of our existing tests.
Committed r250568: <https://trac.webkit.org/changeset/250568>
<rdar://problem/55883253>