202041 – [YARR] Properly handle surrogates when matching back references

RESOLVED FIXED 202041

[YARR] Properly handle surrogates when matching back references

https://bugs.webkit.org/show_bug.cgi?id=202041

Summary [YARR] Properly handle surrogates when matching back references

Paulo Matos

Reported 2019-09-20 02:30:10 PDT

Since the fix for https://bugs.webkit.org/show_bug.cgi?id=201647 (whose report I don't have permissions to read), that we noticed the test stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js (added with the mentioned fix) started failing on 32bits ARM. Turns out this is because we don't have YARR JIT enabled and it's not specific to 32bits. If I disable YARR JIT on 64bits, I can reproduce the same failure in other archs. $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0 JSTests/stress/regexp-unicode-surrogate-pair-increment-should-involve-length-check.js null Exception: Expected /([^x]+)[^]*\1([^])/u to match "�𐀀" but it didn't

Attachments
WIP - Patch (4.02 KB, patch) 2019-09-25 07:35 PDT, Paulo Matos	msaboff: review- ews-watchlist: commit-queue-	Details Formatted Diff Diff
Patch (8.65 KB, patch) 2019-09-25 17:24 PDT, Michael Saboff	no flags	Details Formatted Diff Diff
Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. (8.90 KB, patch) 2019-09-25 17:33 PDT, Michael Saboff	keith_miller: review+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Paulo Matos

Comment 1 2019-09-20 02:31:22 PDT

Adding to CC msaboff and ysuzuki who were involved with https://bugs.webkit.org/show_bug.cgi?id=201647

Paulo Matos

Comment 2 2019-09-20 05:22:37 PDT

Related patch to add failed run to set of tests: https://bugs.webkit.org/show_bug.cgi?id=202042

Paulo Matos

Comment 3 2019-09-23 06:32:52 PDT

I am investigating this at the moment. It seems interpretation of backreferences are broken: >>> /([^x]+)\1/u.exec("\ud800\ud800\udc00") null when starting jsc with $ WebKitBuild/Debug/bin/jsc --useRegExpJIT=0

Paulo Matos

Comment 4 2019-09-24 01:50:53 PDT

The current understanding is slightly different than the initial one. The RegExp JIT is broken, not the interpreter. All other engines seem to agree that: /([^x]+)[^]*\1([^])/u.exec("\ud800\ud800\udc00") should _not_ match. There are three surrogates, one unpaired (the first \ud800) and two paired with each other (\ud800\udc00). The [^x]+ should match and capture the unpaired surrogate \ud800 and the backreference should not possibly capture part of the paired surrogate with \1. This should therefore fail.

Paulo Matos

Comment 5 2019-09-25 07:35:55 PDT

Created attachment 379547 [details] WIP - Patch

EWS Watchlist

Comment 6 2019-09-25 09:48:12 PDT

Comment on attachment 379547 [details] WIP - Patch Attachment 379547 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/13067037 New failing tests: jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-ftl jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-ftl-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-ftl jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-llint jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-dfg-eager-no-cjit jsc-layout-tests.yaml/fast/regex/script-tests/backreferences.js.layout-ftl-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-cjit jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout jsc-layout-tests.yaml/js/script-tests/regexp-unicode.js.layout-no-llint

Michael Saboff

Comment 7 2019-09-25 16:30:10 PDT

Comment on attachment 379547 [details] WIP - Patch r- This patch is mostly right. It needs 2 changes. 1) We need to properly increment the patternIndex and index when matching non-BMP characters. 2) There is a register clobbering issue on X86_64 since tryReadUnicodeCharImpl() uses r10 as a temp and patternChar is also r10. Posting a patch shortly that builds on this patch and addresses these two issues.

Michael Saboff

Comment 8 2019-09-25 17:24:16 PDT

Created attachment 379595 [details] Patch

Michael Saboff

Comment 9 2019-09-25 17:33:03 PDT

Created attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution.

Paulo Matos

Comment 10 2019-09-26 08:19:37 PDT

Thanks Michael, I was still looking at this issue and just noticed you fixed it. I had a little testsuite with the most relevant tests to run: let match1 = /([^x])\1./u.exec('\ud800\ud800\udc00'); if (match1) { print('error 1'); } let match2 = /(\u{10123})x\1/u.exec('\u{10123}x\u{10123}'); if (!match2 || match2[1] !== '\u{10123}') { print('error 2'); } let match3 = /(\ud800\udd23)x\1/u.exec('\u{10123}x\u{10123}'); if (!match3 || match3[1] !== '\u{10123}') { print('error 3'); } let match4 = /(\u{10123})x\ud800\udd23/u.exec('\u{10123}x\u{10123}'); if (!match4 || match4[1] !== '\u{10123}') { print('error 4'); } let match5 = /(\ud800)\udd23x\1/u.exec('\u{10123}x\u{10123}'); if (match5) { print('error 5'); } let match6 = /([^x])\ud800./u.exec('\ud800\ud800\udc00'); if (match6) { print('error 6'); } print('DONE'); All bots pass and the above tests pass too so I think it's fixed. I didn't recognize the problems you mentioned and debugging this for the first time turned out to be a long exercise in C++ and x86_64 assembly reading. Once again, thanks for the help on this.

Keith Miller

Comment 11 2019-10-01 10:48:35 PDT

Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. r=me. I'll take your word on some of the character encodings because I don't know them very well.

Keith Miller

Comment 12 2019-10-01 10:49:27 PDT

Comment on attachment 379597 [details] Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original contribution. Can you also add Paulos' test?

Michael Saboff

Comment 13 2019-10-01 11:01:43 PDT

(In reply to Keith Miller from comment #12) > Comment on attachment 379597 [details] > Updated ChangeLogs citing Paulo Matos <pmatos@igalia.com> original > contribution. > > Can you also add Paulos' test? His test is a collection of subtests that where failing with his patch. They are already part of our existing tests.

Michael Saboff

Comment 14 2019-10-01 11:23:56 PDT

Committed r250568: <https://trac.webkit.org/changeset/250568>

Radar WebKit Bug Importer

Comment 15 2019-10-01 11:24:24 PDT

<rdar://problem/55883253>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Michael Saboff

Reported

2019-09-20 02:30 PDT

Modified

2019-10-01 11:24 PDT History

CC List

10 users Show

URL

Keywords InRadar

Depends on

Blocks

202042 202310

Dependencies

tree graph