RESOLVED FIXED312690
Input Position Corruption in Yarr Backreference Backward Matching via `tryReadBackward` Surrogate Pair Rewind
https://bugs.webkit.org/show_bug.cgi?id=312690
Summary Input Position Corruption in Yarr Backreference Backward Matching via `tryRea...
parkjuny
Reported 2026-04-18 14:49:02 PDT
## Summary A position corruption bug exists in the Yarr regex interpreter's `tryConsumeBackReference` function when processing greedy backreferences inside lookbehind assertions with supplementary (non-BMP) Unicode characters. The `tryReadBackward` function calls `rewind(1)` when it encounters a UTF-16 trail surrogate preceded by a lead surrogate, but when the backreference match subsequently fails, the Greedy quantifier path in `matchBackReference` does not restore the corrupted input position. This causes subsequent regex term matching to use a wrong position, producing incorrect match results that violate the ECMAScript specification. ## Bug ### Summary In `YarrInterpreter.cpp`, the function `tryConsumeBackReference()` does not restore the input position (`pos`) when returning `false` in the backward (lookbehind) matching direction. The function `tryReadBackward()` has a side effect: when it encounters a UTF-16 trail surrogate preceded by a lead surrogate (a supplementary character), it calls `rewind(1)` to decrement `pos` by 1 to account for the two-code-unit width of the surrogate pair. If the backreference comparison then fails, `tryConsumeBackReference` returns `false` without restoring `pos`, leaving it corrupted (decremented by 1). The Greedy case in `matchBackReference` unconditionally returns `true` after the while loop exits (when `tryConsumeBackReference` returns `false`), propagating the corrupted position to subsequent term matching. ### Detail The bug involves three interacting code paths: **1. `tryReadBackward()` — the position-modifying side effect:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:348-360 char32_t NODELETE tryReadBackward(unsigned negativePositionOffest) { if (pos < negativePositionOffest) return errorCodePoint; unsigned p = pos - negativePositionOffest; ASSERT(p < length); auto result = input[p]; if (U16_IS_TRAIL(result) && decodeSurrogatePairs && p > 0 && U16_IS_LEAD(input[p - 1])) { rewind(1); // <-- Modifies pos! Decrements by 1 return U16_GET_SUPPLEMENTARY(input[p - 1], result); } return result; } ``` When reading backward and encountering a trail surrogate preceded by a lead surrogate, `rewind(1)` decrements `pos` by 1 to account for the full surrogate pair. This is correct for successful matches (the extra code unit is tracked via the position decrement), but becomes a bug when the match fails and `pos` is not restored. **2. `tryConsumeBackReference()` — missing position restoration on backward failure:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:641-690 bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term) { unsigned matchSize = (unsigned)(matchEnd - matchBegin); if (term.matchDirection() == Forward) { if (!input.checkInput(matchSize)) return false; } for (unsigned i = 0; i < matchSize; ++i) { unsigned negativeInputOffset = term.inputPosition + matchSize - i; if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) return false; // <-- pos not yet corrupted, but no restoration either char32_t oldCh = input.reread(matchBegin + i); char32_t ch; if (!U_IS_BMP(oldCh)) { ch = input.readSurrogatePairChecked(negativeInputOffset); ++i; } else ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset); // ^^^ tryReadBackward may call rewind(1), modifying pos if (oldCh == errorCodePoint || ch == errorCodePoint) return false; // <-- Returns false WITHOUT restoring pos for Backward direction if (oldCh == ch) continue; // ... case-insensitive checks ... if (term.matchDirection() == Forward) input.uncheckInput(matchSize); // Only restores for Forward! return false; // <-- Returns false WITHOUT restoring pos for Backward direction } if (term.matchDirection() == Backward) input.uncheckInput(matchSize); // Only called on SUCCESS return true; } ``` Lines 680–681 handle Forward direction position restoration on failure via `uncheckInput`. There is no equivalent restoration for the Backward direction. Lines 686–687 restore position for the Backward direction, but only on **success** (return true path). **3. `matchBackReference()` Greedy case — no position save/restore:** ```cpp // Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1049-1055 case QuantifierType::Greedy: { unsigned matchAmount = 0; while ((matchAmount < term.atom.quantityMaxCount) && tryConsumeBackReference(matchBegin, matchEnd, term)) ++matchAmount; backTrack->matchAmount = matchAmount; return true; // <-- Always returns true, even with corrupted pos } ``` When `tryConsumeBackReference` returns `false` (match failure), the while loop exits and the function unconditionally returns `true`. The position corruption from `tryReadBackward`'s `rewind(1)` persists, and the caller sees a position that is off by 1. Compare with FixedCount (lines 1039–1046), which properly restores position via `input.setPos(backTrack->begin)` on failure: ```cpp case QuantifierType::FixedCount: { for (unsigned matchAmount = 0; matchAmount < term.atom.quantityMaxCount; ++matchAmount) { if (!tryConsumeBackReference(matchBegin, matchEnd, term)) { input.setPos(backTrack->begin); // <-- Properly restores position return false; } } return true; } ``` Also note that at initialization time (lines 1005–1007), the Greedy case does not save `backTrack->begin`: ```cpp case QuantifierType::Greedy: backTrack->matchAmount = 0; break; // <-- Does NOT save backTrack->begin = input.getPos() ``` While FixedCount does save it (lines 1001–1003): ```cpp case QuantifierType::FixedCount: backTrack->begin = input.getPos(); // <-- Saves position for restoration break; ``` ### Trigger Conditions 1. A lookbehind assertion (`(?<=...)`) that forces backward matching direction (lookbehinds always use the Yarr interpreter, not JIT). 2. A greedy backreference quantifier (`\N*` or `\N+`) inside the lookbehind. 3. The backward read position lands on a trail surrogate of a supplementary (non-BMP) Unicode character. 4. The backreference match fails (the captured text does not match the supplementary character). 5. The Unicode flag (`u` or `v`) must be set for surrogate pair decoding. ## Version ### Reproduced Version - `main` branch latest commit (2026/04/18): `a4390137a4038c07661f58e45ee205b5a623f9dd` ## Reproduction Case ### Release Build ```bash WebKitBuild/JSCOnly/Release/bin/jsc /tmp/poc.js ``` Result: ``` null b,a b,a b,a ``` Line 1 (`null`) is the bug: `/(?<=\u{10000}\1*(a))b/u` should match at index 3. Lines 2–4 are the BMP-only, non-greedy, and fixed-count controls — all match correctly, isolating the bug to greedy backreference + non-BMP character. Debug build result identical to release. No ASSERT fires — the position corruption produces a silent wrong result; `rewind(1)` leaves `pos` at a valid in-bounds index so no internal sanity check triggers. ### V8 (Chrome) ``` b,a b,a b,a b,a ``` All four cases produce `b,a` — confirming the first case is a JSC-only bug. ### PoC Code ```js // Yarr: greedy backref in lookbehind with non-BMP char corrupts pos (tryConsumeBackReference) print(/(?<=\u{10000}\1*(a))b/u.exec("\u{10000}ab")); // null (bug; expected: ["b","a",index:3]) print(/(?<=Z\1*(a))b/u.exec("Zab")); // ["b","a",...] — BMP-only, works print(/(?<=\u{10000}\1*?(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — non-greedy, works print(/(?<=\u{10000}\1{0}(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — fixed-count, works ``` ## Suggested Patch ### File: `Source/JavaScriptCore/yarr/YarrInterpreter.cpp` The root cause is that `tryConsumeBackReference` does not save and restore `input.pos` on failure paths when `matchDirection() == Backward`. The fix saves `pos` at function entry and restores it on every early-return `false` path in the backward case. ```diff --- a/Source/JavaScriptCore/yarr/YarrInterpreter.cpp +++ b/Source/JavaScriptCore/yarr/YarrInterpreter.cpp @@ -641,50 +641,59 @@ public: bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term) { unsigned matchSize = (unsigned)(matchEnd - matchBegin); if (term.matchDirection() == Forward) { if (!input.checkInput(matchSize)) return false; } + unsigned savedPos = input.getPos(); + for (unsigned i = 0; i < matchSize; ++i) { unsigned negativeInputOffset = term.inputPosition + matchSize - i; - if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) + if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) { + input.setPos(savedPos); return false; + } char32_t oldCh = input.reread(matchBegin + i); char32_t ch; if (!U_IS_BMP(oldCh)) { ch = input.readSurrogatePairChecked(negativeInputOffset); ++i; } else ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset); - if (oldCh == errorCodePoint || ch == errorCodePoint) + if (oldCh == errorCodePoint || ch == errorCodePoint) { + if (term.matchDirection() == Backward) + input.setPos(savedPos); return false; + } if (oldCh == ch) continue; if (term.ignoreCase()) { // See ES 6.0, 21.2.2.8.2 for the definition of Canonicalize(). For non-Unicode // patterns, Unicode values are never allowed to match against ASCII ones. // For Unicode, we need to check all canonical equivalents of a character. if (isLegacyCompilation() && (isASCII(oldCh) || isASCII(ch))) { if (toASCIIUpper(oldCh) == toASCIIUpper(ch)) continue; } else if (areCanonicallyEquivalent(oldCh, ch, isEitherUnicodeCompilation() ? CanonicalMode::Unicode : CanonicalMode::UCS2)) continue; } if (term.matchDirection() == Forward) input.uncheckInput(matchSize); + else + input.setPos(savedPos); return false; } if (term.matchDirection() == Backward) input.uncheckInput(matchSize); return true; } ``` #### Explanation The fix saves `input.pos` at the start of `tryConsumeBackReference` and restores it on all failure paths when `matchDirection() == Backward`. This ensures that the `rewind(1)` side effect from `tryReadBackward`'s surrogate pair handling does not leak out of a failed backreference match attempt. The fix addresses all three `return false` paths in the function: 1. When `negativeInputOffset > input.getPos()` (boundary check, defensive). 2. When `oldCh == errorCodePoint || ch == errorCodePoint` (read error — the critical path where `tryReadBackward` has already called `rewind(1)`). 3. When characters do not match and case-insensitive comparison also fails. This is consistent with how the Forward direction handles failure via `input.uncheckInput(matchSize)`, and how the FixedCount case in `matchBackReference` handles failure via `input.setPos(backTrack->begin)`. An alternative fix (Option B) would save/restore position in the Greedy case of `matchBackReference`, but fixing it at the source in `tryConsumeBackReference` is preferable as it also protects any future callers. ### Credit Information Reporter credit: Junyoung Park (@candymate) of KAIST Hacking Lab
Attachments
Radar WebKit Bug Importer
Comment 1 2026-04-19 13:22:10 PDT
Kai Tamkun
Comment 2 2026-04-27 11:13:46 PDT
EWS
Comment 3 2026-05-11 14:06:43 PDT
Committed 313026@main (f44dcfd3c922): <https://commits.webkit.org/313026@main> Reviewed commits have been landed. Closing PR #63722 and removing active labels.
Note You need to log in before you can comment on or make changes to this bug.