RESOLVED FIXED290567
RegExp Unicode JIT treats escaped surrogate followed by literal surrogate as surrogate pair
https://bugs.webkit.org/show_bug.cgi?id=290567
Summary RegExp Unicode JIT treats escaped surrogate followed by literal surrogate as ...
Ben Grant
Reported 2025-03-27 14:36:02 PDT
To reproduce, evaluate either of the following examples: > new RegExp("\\ud800\udc00+", "u").exec("\u{10000}\u{10000}") > new RegExp("\\uD83D\uDC38", "u").exec("\u{1F438}") (the first is from https://github.com/oven-sh/bun/issues/18540, and the second is from https://github.com/tc39/test262/blob/ce7e72d2107f99d165f4259571f10aa75753d997/test/staging/sm/RegExp/unicode-raw.js#L56) These should both return null. The reason is that, in Unicode mode, \u-escaped surrogates followed by literal surrogates should not form a pair. So these regular expressions are trying to match an unpaired high surrogate followed by an unpaired low surrogate, which is impossible as those code units would form a pair. But in JavaScriptCore by default, these code samples do match the first codepoint of the input string: > >>> new RegExp("\\ud800\udc00+", "u").exec("\u{10000}\u{10000}") > [𐀀] > >>> new RegExp("\\uD83D\uDC38", "u").exec("\u{1F438}") > [🐸] I'm using a local build from 292785@main. The correct behavior is observed in SpiderMonkey and V8, and in JavaScriptCore with --useRegExpJIT=0. This was originally reported to Deno at https://github.com/denoland/deno/issues/28587, but the Deno team believes (and I agree given the test262 coverage) that V8 has the more correct behavior here.
Attachments
Radar WebKit Bug Importer
Comment 1 2025-04-03 14:36:22 PDT
Michael Saboff
Comment 2 2025-04-23 17:42:23 PDT
EWS
Comment 3 2025-04-24 09:14:30 PDT
Committed 294066@main (ab6288d351f1): <https://commits.webkit.org/294066@main> Reviewed commits have been landed. Closing PR #44450 and removing active labels.
Note You need to log in before you can comment on or make changes to this bug.