Bug 312690
| Summary: | Input Position Corruption in Yarr Backreference Backward Matching via `tryReadBackward` Surrogate Pair Rewind | ||
|---|---|---|---|
| Product: | WebKit | Reporter: | parkjuny |
| Component: | JavaScriptCore | Assignee: | Nobody <webkit-unassigned> |
| Status: | RESOLVED FIXED | ||
| Severity: | Minor | CC: | karlcow, webkit-bug-importer |
| Priority: | P2 | Keywords: | InRadar |
| Version: | WebKit Nightly Build | ||
| Hardware: | PC | ||
| OS: | Linux | ||
parkjuny
## Summary
A position corruption bug exists in the Yarr regex interpreter's `tryConsumeBackReference` function when processing greedy backreferences inside lookbehind assertions with supplementary (non-BMP) Unicode characters. The `tryReadBackward` function calls `rewind(1)` when it encounters a UTF-16 trail surrogate preceded by a lead surrogate, but when the backreference match subsequently fails, the Greedy quantifier path in `matchBackReference` does not restore the corrupted input position. This causes subsequent regex term matching to use a wrong position, producing incorrect match results that violate the ECMAScript specification.
## Bug
### Summary
In `YarrInterpreter.cpp`, the function `tryConsumeBackReference()` does not restore the input position (`pos`) when returning `false` in the backward (lookbehind) matching direction. The function `tryReadBackward()` has a side effect: when it encounters a UTF-16 trail surrogate preceded by a lead surrogate (a supplementary character), it calls `rewind(1)` to decrement `pos` by 1 to account for the two-code-unit width of the surrogate pair. If the backreference comparison then fails, `tryConsumeBackReference` returns `false` without restoring `pos`, leaving it corrupted (decremented by 1). The Greedy case in `matchBackReference` unconditionally returns `true` after the while loop exits (when `tryConsumeBackReference` returns `false`), propagating the corrupted position to subsequent term matching.
### Detail
The bug involves three interacting code paths:
**1. `tryReadBackward()` — the position-modifying side effect:**
```cpp
// Source/JavaScriptCore/yarr/YarrInterpreter.cpp:348-360
char32_t NODELETE tryReadBackward(unsigned negativePositionOffest)
{
if (pos < negativePositionOffest)
return errorCodePoint;
unsigned p = pos - negativePositionOffest;
ASSERT(p < length);
auto result = input[p];
if (U16_IS_TRAIL(result) && decodeSurrogatePairs && p > 0 && U16_IS_LEAD(input[p - 1])) {
rewind(1); // <-- Modifies pos! Decrements by 1
return U16_GET_SUPPLEMENTARY(input[p - 1], result);
}
return result;
}
```
When reading backward and encountering a trail surrogate preceded by a lead surrogate, `rewind(1)` decrements `pos` by 1 to account for the full surrogate pair. This is correct for successful matches (the extra code unit is tracked via the position decrement), but becomes a bug when the match fails and `pos` is not restored.
**2. `tryConsumeBackReference()` — missing position restoration on backward failure:**
```cpp
// Source/JavaScriptCore/yarr/YarrInterpreter.cpp:641-690
bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term)
{
unsigned matchSize = (unsigned)(matchEnd - matchBegin);
if (term.matchDirection() == Forward) {
if (!input.checkInput(matchSize))
return false;
}
for (unsigned i = 0; i < matchSize; ++i) {
unsigned negativeInputOffset = term.inputPosition + matchSize - i;
if (term.matchDirection() == Backward && negativeInputOffset > input.getPos())
return false; // <-- pos not yet corrupted, but no restoration either
char32_t oldCh = input.reread(matchBegin + i);
char32_t ch;
if (!U_IS_BMP(oldCh)) {
ch = input.readSurrogatePairChecked(negativeInputOffset);
++i;
} else
ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset);
// ^^^ tryReadBackward may call rewind(1), modifying pos
if (oldCh == errorCodePoint || ch == errorCodePoint)
return false; // <-- Returns false WITHOUT restoring pos for Backward direction
if (oldCh == ch)
continue;
// ... case-insensitive checks ...
if (term.matchDirection() == Forward)
input.uncheckInput(matchSize); // Only restores for Forward!
return false; // <-- Returns false WITHOUT restoring pos for Backward direction
}
if (term.matchDirection() == Backward)
input.uncheckInput(matchSize); // Only called on SUCCESS
return true;
}
```
Lines 680–681 handle Forward direction position restoration on failure via `uncheckInput`. There is no equivalent restoration for the Backward direction. Lines 686–687 restore position for the Backward direction, but only on **success** (return true path).
**3. `matchBackReference()` Greedy case — no position save/restore:**
```cpp
// Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1049-1055
case QuantifierType::Greedy: {
unsigned matchAmount = 0;
while ((matchAmount < term.atom.quantityMaxCount) && tryConsumeBackReference(matchBegin, matchEnd, term))
++matchAmount;
backTrack->matchAmount = matchAmount;
return true; // <-- Always returns true, even with corrupted pos
}
```
When `tryConsumeBackReference` returns `false` (match failure), the while loop exits and the function unconditionally returns `true`. The position corruption from `tryReadBackward`'s `rewind(1)` persists, and the caller sees a position that is off by 1.
Compare with FixedCount (lines 1039–1046), which properly restores position via `input.setPos(backTrack->begin)` on failure:
```cpp
case QuantifierType::FixedCount: {
for (unsigned matchAmount = 0; matchAmount < term.atom.quantityMaxCount; ++matchAmount) {
if (!tryConsumeBackReference(matchBegin, matchEnd, term)) {
input.setPos(backTrack->begin); // <-- Properly restores position
return false;
}
}
return true;
}
```
Also note that at initialization time (lines 1005–1007), the Greedy case does not save `backTrack->begin`:
```cpp
case QuantifierType::Greedy:
backTrack->matchAmount = 0;
break; // <-- Does NOT save backTrack->begin = input.getPos()
```
While FixedCount does save it (lines 1001–1003):
```cpp
case QuantifierType::FixedCount:
backTrack->begin = input.getPos(); // <-- Saves position for restoration
break;
```
### Trigger Conditions
1. A lookbehind assertion (`(?<=...)`) that forces backward matching direction (lookbehinds always use the Yarr interpreter, not JIT).
2. A greedy backreference quantifier (`\N*` or `\N+`) inside the lookbehind.
3. The backward read position lands on a trail surrogate of a supplementary (non-BMP) Unicode character.
4. The backreference match fails (the captured text does not match the supplementary character).
5. The Unicode flag (`u` or `v`) must be set for surrogate pair decoding.
## Version
### Reproduced Version
- `main` branch latest commit (2026/04/18): `a4390137a4038c07661f58e45ee205b5a623f9dd`
## Reproduction Case
### Release Build
```bash
WebKitBuild/JSCOnly/Release/bin/jsc /tmp/poc.js
```
Result:
```
null
b,a
b,a
b,a
```
Line 1 (`null`) is the bug: `/(?<=\u{10000}\1*(a))b/u` should match at index 3. Lines 2–4 are the BMP-only, non-greedy, and fixed-count controls — all match correctly, isolating the bug to greedy backreference + non-BMP character.
Debug build result identical to release. No ASSERT fires — the position corruption produces a silent wrong result; `rewind(1)` leaves `pos` at a valid in-bounds index so no internal sanity check triggers.
### V8 (Chrome)
```
b,a
b,a
b,a
b,a
```
All four cases produce `b,a` — confirming the first case is a JSC-only bug.
### PoC Code
```js
// Yarr: greedy backref in lookbehind with non-BMP char corrupts pos (tryConsumeBackReference)
print(/(?<=\u{10000}\1*(a))b/u.exec("\u{10000}ab")); // null (bug; expected: ["b","a",index:3])
print(/(?<=Z\1*(a))b/u.exec("Zab")); // ["b","a",...] — BMP-only, works
print(/(?<=\u{10000}\1*?(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — non-greedy, works
print(/(?<=\u{10000}\1{0}(a))b/u.exec("\u{10000}ab")); // ["b","a",...] — fixed-count, works
```
## Suggested Patch
### File: `Source/JavaScriptCore/yarr/YarrInterpreter.cpp`
The root cause is that `tryConsumeBackReference` does not save and restore `input.pos` on failure paths when `matchDirection() == Backward`. The fix saves `pos` at function entry and restores it on every early-return `false` path in the backward case.
```diff
--- a/Source/JavaScriptCore/yarr/YarrInterpreter.cpp
+++ b/Source/JavaScriptCore/yarr/YarrInterpreter.cpp
@@ -641,50 +641,59 @@ public:
bool NODELETE tryConsumeBackReference(int matchBegin, int matchEnd, ByteTerm& term)
{
unsigned matchSize = (unsigned)(matchEnd - matchBegin);
if (term.matchDirection() == Forward) {
if (!input.checkInput(matchSize))
return false;
}
+ unsigned savedPos = input.getPos();
+
for (unsigned i = 0; i < matchSize; ++i) {
unsigned negativeInputOffset = term.inputPosition + matchSize - i;
- if (term.matchDirection() == Backward && negativeInputOffset > input.getPos())
+ if (term.matchDirection() == Backward && negativeInputOffset > input.getPos()) {
+ input.setPos(savedPos);
return false;
+ }
char32_t oldCh = input.reread(matchBegin + i);
char32_t ch;
if (!U_IS_BMP(oldCh)) {
ch = input.readSurrogatePairChecked(negativeInputOffset);
++i;
} else
ch = term.matchDirection() == Forward ? input.readCheckedDontAdvance(negativeInputOffset) : input.tryReadBackward(negativeInputOffset);
- if (oldCh == errorCodePoint || ch == errorCodePoint)
+ if (oldCh == errorCodePoint || ch == errorCodePoint) {
+ if (term.matchDirection() == Backward)
+ input.setPos(savedPos);
return false;
+ }
if (oldCh == ch)
continue;
if (term.ignoreCase()) {
// See ES 6.0, 21.2.2.8.2 for the definition of Canonicalize(). For non-Unicode
// patterns, Unicode values are never allowed to match against ASCII ones.
// For Unicode, we need to check all canonical equivalents of a character.
if (isLegacyCompilation() && (isASCII(oldCh) || isASCII(ch))) {
if (toASCIIUpper(oldCh) == toASCIIUpper(ch))
continue;
} else if (areCanonicallyEquivalent(oldCh, ch, isEitherUnicodeCompilation() ? CanonicalMode::Unicode : CanonicalMode::UCS2))
continue;
}
if (term.matchDirection() == Forward)
input.uncheckInput(matchSize);
+ else
+ input.setPos(savedPos);
return false;
}
if (term.matchDirection() == Backward)
input.uncheckInput(matchSize);
return true;
}
```
#### Explanation
The fix saves `input.pos` at the start of `tryConsumeBackReference` and restores it on all failure paths when `matchDirection() == Backward`. This ensures that the `rewind(1)` side effect from `tryReadBackward`'s surrogate pair handling does not leak out of a failed backreference match attempt.
The fix addresses all three `return false` paths in the function:
1. When `negativeInputOffset > input.getPos()` (boundary check, defensive).
2. When `oldCh == errorCodePoint || ch == errorCodePoint` (read error — the critical path where `tryReadBackward` has already called `rewind(1)`).
3. When characters do not match and case-insensitive comparison also fails.
This is consistent with how the Forward direction handles failure via `input.uncheckInput(matchSize)`, and how the FixedCount case in `matchBackReference` handles failure via `input.setPos(backTrack->begin)`. An alternative fix (Option B) would save/restore position in the Greedy case of `matchBackReference`, but fixing it at the source in `tryConsumeBackReference` is preferable as it also protects any future callers.
### Credit Information
Reporter credit: Junyoung Park (@candymate) of KAIST Hacking Lab
| Attachments | ||
|---|---|---|
| Add attachment proposed patch, testcase, etc. |
Radar WebKit Bug Importer
<rdar://problem/175122467>
Kai Tamkun
Pull request: https://github.com/WebKit/WebKit/pull/63722
EWS
Committed 313026@main (f44dcfd3c922): <https://commits.webkit.org/313026@main>
Reviewed commits have been landed. Closing PR #63722 and removing active labels.