WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
312690
Input Position Corruption in Yarr Backreference Backward Matching via `tryReadBackward` Surrogate Pair Rewind
https://bugs.webkit.org/show_bug.cgi?id=312690
Summary
Input Position Corruption in Yarr Backreference Backward Matching via `tryRea...
parkjuny
Reported
2026-04-18 14:49:02 PDT
## Summary A position corruption bug exists in the Yarr regex interpreter's `tryConsumeBackReference` function when processing greedy backreferences inside lookbehind assertions with supplementary (non-BMP) Unicode characters. The `tryReadBackward` function calls `rewind(1)` when it encounters a UTF-16 trail surrogate preceded by a lead surrogate, but when the backreference match subsequently fails, the Greedy quantifier path in `matchBackReference` does not restore the corrupted input position. This causes subsequent regex term matching to use a wrong position, producing incorrect match results that violate the ECMAScript specification. ## Bug ### Summary In `YarrInterpreter.cpp`, the function `tryConsumeBackReference()` does not restore the input position (`pos`) when returning `false` in the backward (lookbehind) matching direction. The function `tryReadBackward()` has a side effect: when it encounters a UTF-16 trail surrogate preceded by a lead surrogate (a supplementary character), it calls `rewind(1)` to decrement `pos` by 1 to account for the two-code-unit width of the surrogate pair. If the backreference comparison then fails, `tryConsumeBackReference` returns `false` without restoring `pos`, leaving it corrupted (decremented by 1). The Greedy case in `matchBackReference` unconditionally returns `true` after the while loop exits (when `tryConsumeBackReference` returns `false`), propagating the corrupted position to subsequent term matching. ### Detail The bug involves three interacting code paths: **1. `tryReadBackward()` — the position-modifying side effect:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:348-360 char32_t NODELETE tryReadBackward(unsigned negativePositionOffest) { if (pos < negativePositionOffest) return errorCodePoint; unsigned p = pos - negativePositionOffest; ASSERT(p < length); auto result = input[p]; if (U16_IS_TRAIL(result) && decodeSurrogatePairs && p > 0 && U16_IS_LEAD(input[p - 1])) { rewind(1); // <-- Modifies pos! Decrements by 1 return U16_GET_SUPPLEMENTARY(input[p - 1], result); } return result; } ``` When reading backward and encountering a trail surrogate preceded by a lead surrogate, `rewind(1)` decrements `pos` by 1 to account for the full surrogate pair. This is correct for successful matches (the extra code unit is tracked via the position decrement), but becomes a bug when the match fails and `pos` is not restored. **2. `tryConsumeBackReference()` — missing position restoration on backward failure:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:641-690 bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term) { unsigned matchSize = (unsigned)(matchEnd - matchBegin); if (term.matchDirection() == Forward) { if (!input.checkInput(matchSize)) return false; } for (unsigned i = 0; i < matchSize; ++i) { unsigned negativeInputOffset = term.inputPosition + matchSize - i; if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) return false; // <-- pos not yet corrupted, but no restoration either char32_t oldCh = input.reread(matchBegin + i); char32_t ch; if (!U_IS_BMP(oldCh)) { ch = input.readSurrogatePairChecked(negativeInputOffset); ++i; } else ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset); // ^^^ tryReadBackward may call rewind(1), modifying pos if (oldCh == errorCodePoint || ch == errorCodePoint) return false; // <-- Returns false WITHOUT restoring pos for Backward direction if (oldCh == ch) continue; // ... case-insensitive checks ... if (term.matchDirection() == Forward) input.uncheckInput(matchSize); // Only restores for Forward! return false; // <-- Returns false WITHOUT restoring pos for Backward direction } if (term.matchDirection() == Backward) input.uncheckInput(matchSize); // Only called on SUCCESS return true; } ``` Lines 680–681 handle Forward direction position restoration on failure via `uncheckInput`. There is no equivalent restoration for the Backward direction. Lines 686–687 restore position for the Backward direction, but only on **success** (return true path). **3. `matchBackReference()` Greedy case — no position save/restore:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1049-1055 case QuantifierType::Greedy: { unsigned matchAmount = 0; while ((matchAmount < term.atom.quantityMaxCount) && tryConsumeBackReference(matchBegin, matchEnd, term)) ++matchAmount; backTrack->matchAmount = matchAmount; return true; // <-- Always returns true, even with corrupted pos } ``` When `tryConsumeBackReference` returns `false` (match failure), the while loop exits and the function unconditionally returns `true`. The position corruption from `tryReadBackward`'s `rewind(1)` persists, and the caller sees a position that is off by 1. Compare with FixedCount (lines 1039–1046), which properly restores position via `input.setPos(backTrack->begin)` on failure: ```cpp case QuantifierType::FixedCount: { for (unsigned matchAmount = 0; matchAmount < term.atom.quantityMaxCount; ++matchAmount) { if (!tryConsumeBackReference(matchBegin, matchEnd, term)) { input.setPos(backTrack->begin); // <-- Properly restores position return false; } } return true; } ``` Also note that at initialization time (lines 1005–1007), the Greedy case does not save `backTrack->begin`: ```cpp case QuantifierType::Greedy: backTrack->matchAmount = 0; break; // <-- Does NOT save backTrack->begin = input.getPos() ``` While FixedCount does save it (lines 1001–1003): ```cpp case QuantifierType::FixedCount: backTrack->begin = input.getPos(); // <-- Saves position for restoration break; ``` ### Trigger Conditions 1. A lookbehind assertion (`(?<=...)`) that forces backward matching direction (lookbehinds always use the Yarr interpreter, not JIT). 2. A greedy backreference quantifier (`\N*` or `\N+`) inside the lookbehind. 3. The backward read position lands on a trail surrogate of a supplementary (non-BMP) Unicode character. 4. The backreference match fails (the captured text does not match the supplementary character). 5. The Unicode flag (`u` or `v`) must be set for surrogate pair decoding. ## Version ### Reproduced Version - `main` branch latest commit (2026/04/18): `a4390137a4038c07661f58e45ee205b5a623f9dd` ## Reproduction Case ### Release Build ```bash WebKitBuild/JSCOnly/Release/bin/jsc /tmp/poc.js ``` Result: ``` null b,a b,a b,a ``` Line 1 (`null`) is the bug: `/(?<=\u{10000}\1*(a))b/u` should match at index 3. Lines 2–4 are the BMP-only, non-greedy, and fixed-count controls — all match correctly, isolating the bug to greedy backreference + non-BMP character. Debug build result identical to release. No ASSERT fires — the position corruption produces a silent wrong result; `rewind(1)` leaves `pos` at a valid in-bounds index so no internal sanity check triggers. ### V8 (Chrome) ``` b,a b,a b,a b,a ``` All four cases produce `b,a` — confirming the first case is a JSC-only bug. ### PoC Code ```js // Yarr: greedy backref in lookbehind with non-BMP char corrupts pos (tryConsumeBackReference) print(/(?<=\u{10000}\1*(a))b/u.exec("\u{10000}ab")); // null (bug; expected: ["b","a",index:3]) print(/(?<=Z\1*(a))b/u.exec("Zab")); // ["b","a",...] — BMP-only, works print(/(?<=\u{10000}\1*?(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — non-greedy, works print(/(?<=\u{10000}\1{0}(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — fixed-count, works ``` ## Suggested Patch ### File: `Source/JavaScriptCore/yarr/YarrInterpreter.cpp` The root cause is that `tryConsumeBackReference` does not save and restore `input.pos` on failure paths when `matchDirection() == Backward`. The fix saves `pos` at function entry and restores it on every early-return `false` path in the backward case. ```diff --- a/Source/JavaScriptCore/yarr/YarrInterpreter.cpp +++ b/Source/JavaScriptCore/yarr/YarrInterpreter.cpp @@ -641,50 +641,59 @@ public: bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term) { unsigned matchSize = (unsigned)(matchEnd - matchBegin); if (term.matchDirection() == Forward) { if (!input.checkInput(matchSize)) return false; } + unsigned savedPos = input.getPos(); + for (unsigned i = 0; i < matchSize; ++i) { unsigned negativeInputOffset = term.inputPosition + matchSize - i; - if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) + if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) { + input.setPos(savedPos); return false; + } char32_t oldCh = input.reread(matchBegin + i); char32_t ch; if (!U_IS_BMP(oldCh)) { ch = input.readSurrogatePairChecked(negativeInputOffset); ++i; } else ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset); - if (oldCh == errorCodePoint || ch == errorCodePoint) + if (oldCh == errorCodePoint || ch == errorCodePoint) { + if (term.matchDirection() == Backward) + input.setPos(savedPos); return false; + } if (oldCh == ch) continue; if (term.ignoreCase()) { // See ES 6.0, 21.2.2.8.2 for the definition of Canonicalize(). For non-Unicode // patterns, Unicode values are never allowed to match against ASCII ones. // For Unicode, we need to check all canonical equivalents of a character. if (isLegacyCompilation() && (isASCII(oldCh) || isASCII(ch))) { if (toASCIIUpper(oldCh) == toASCIIUpper(ch)) continue; } else if (areCanonicallyEquivalent(oldCh, ch, isEitherUnicodeCompilation() ? CanonicalMode::Unicode : CanonicalMode::UCS2)) continue; } if (term.matchDirection() == Forward) input.uncheckInput(matchSize); + else + input.setPos(savedPos); return false; } if (term.matchDirection() == Backward) input.uncheckInput(matchSize); return true; } ``` #### Explanation The fix saves `input.pos` at the start of `tryConsumeBackReference` and restores it on all failure paths when `matchDirection() == Backward`. This ensures that the `rewind(1)` side effect from `tryReadBackward`'s surrogate pair handling does not leak out of a failed backreference match attempt. The fix addresses all three `return false` paths in the function: 1. When `negativeInputOffset > input.getPos()` (boundary check, defensive). 2. When `oldCh == errorCodePoint || ch == errorCodePoint` (read error — the critical path where `tryReadBackward` has already called `rewind(1)`). 3. When characters do not match and case-insensitive comparison also fails. This is consistent with how the Forward direction handles failure via `input.uncheckInput(matchSize)`, and how the FixedCount case in `matchBackReference` handles failure via `input.setPos(backTrack->begin)`. An alternative fix (Option B) would save/restore position in the Greedy case of `matchBackReference`, but fixing it at the source in `tryConsumeBackReference` is preferable as it also protects any future callers. ### Credit Information Reporter credit: Junyoung Park (@candymate) of KAIST Hacking Lab
Attachments
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2026-04-19 13:22:10 PDT
<
rdar://problem/175122467
>
Kai Tamkun
Comment 2
2026-04-27 11:13:46 PDT
Pull request:
https://github.com/WebKit/WebKit/pull/63722
EWS
Comment 3
2026-05-11 14:06:43 PDT
Committed
313026@main
(f44dcfd3c922): <
https://commits.webkit.org/313026@main
> Reviewed commits have been landed. Closing PR #63722 and removing active labels.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug