Bug 56082 - Look-ahead assertions with back references don’t work as expected
Summary: Look-ahead assertions with back references don’t work as expected
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac (Intel) OS X 10.6
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-10 01:57 PST by Markus Wulftange
Modified: 2011-03-14 17:36 PDT (History)
3 users (show)

See Also:


Attachments
Patch that tests assertion pre-checked character length independent from the remaining expression (10.38 KB, patch)
2011-03-14 15:25 PDT, Michael Saboff
barraclough: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Wulftange 2011-03-10 01:57:21 PST
There seems to be a bug with look-ahead assertions that use a back reference:

     "P11".match(/^P([1-6])(?=\1)([1-6])$/)

This obviously should match as there is another 1 following the first 1 (matched string of the first group). In opposite to this, using a negated look-ahead assertion instead does match although there is another 1 following:

     "P11".match(/^P([1-6])(?!\1)([1-6])$/)
Comment 1 Alexey Proskuryakov 2011-03-10 10:55:56 PST
Is this a regression from Safari 5 that you're seeing in nightly builds?
Comment 2 Alexey Proskuryakov 2011-03-10 10:56:24 PST
Is this a regression from Safari 5 that you're seeing in nightly builds?
Comment 3 Gavin Barraclough 2011-03-10 11:20:47 PST
Reduction to /(x)(?=\1)x/.exec("xx") also fails.
Comment 4 Gavin Barraclough 2011-03-10 14:46:43 PST
The bug relates to the pre-checking of input.
In the reduced case of /(x)(?=\1)x/.exec("xx") , we:

(1) check 2 characters are available (increments input position by 2, to end of string)
(2) check the initial capturing 'x' (at input offset -2)
(3) enter the parenthetical assertion
(4) try to check the back reference at offset -1.

The first step of trying to mach the back-ref is to check that matchSize (1) bytes are available to check.  However they are not, since they have been pre-checked for allow for the part of the pattern after the parenthetical assertion.  (this can be demonstrated by testing with input "xxy" - the presence of the 'y' allows the matchSize check to pass, and then the back ref test works correctly).

A simple fix would appear to be to make the back reference only check matchSize bytes less the input offset from the end, however this would not be correct.  (E.g. suppose the back ref were quantified such as '\1{2}' - given the above example the first match of the back ref would need check 0 bytes but the second would need to check 1).


The same problem can be demonstrated with the following test:
    /(?=((?:ab)*))a/.exec("ab")
This should have results ["a","ab"], but we currently produce ["a",""].  (Again, an additional character of input fixes the issue, e.g. "abc" gets the right result).  This demonstrates that the issue is not specific to back-references, but can effect anything in a parenthetical assertion.


The correct fix here is most likely to prevent pre-checking of input from spanning over parenthetical assertions.  I.e., given an expression /abc(?=$FOO)bar/, rather than pre-checking 6 at the start of the expression we should just pre-check 3, then run the parenthetical assertion, then pre-check the next three after the parenthetic assertion has matched.

An alternative that might be slightly more performant would be to fully pre-check at at the head of the outer alternative, then unwind the input pointer around the alternative.  (E.g. in the case of /abc(?=$FOO)bar/ we could pre-check 6, when we enter the assertion decrement input position by 3, on assertion match increment again by 3, with matching adjustments when backtracking through the assertion).  If we made the change in this fashion we'd also need modify the way that the input offsets within the assertion are calculated, since they would no longer be correct.  This is probably not an important case to optimize, and this sounds fragile, so I'd suggest just breaking the pre-checking from spanning over the assertion.
Comment 5 Peter Varga 2011-03-11 06:47:14 PST
I guess I have already reported a very similar problem at https://bugs.webkit.org/show_bug.cgi?id=55589

The test case at that bug (pcre-test-1) isn't a fallback case therefore the YARR JIT passes on it. The YARR JIT handles the look-ahead assertions correctly thus the YARR Interpreter needs a fix.
Comment 6 Gavin Barraclough 2011-03-11 15:45:25 PST
Peter, ah yes- that looks like a related bug, cheers! - hopefully should be fixed by this.

Michael, here are all the test regexes we were looking at yesterday, along with a list of test inputs we were using - should all be good for a regression test.

cheers,
G.


/^P([1-6])(?!\1)([1-6])$/
"P11"

/(x)(?=\1)x/
"x", "xx", "xxy"

/(x)zzz(?=\1)x/
"xzzzx", "xzzzxy"

/(a)\1(?=(b*c))bc/
/(a)a(?=(b*c))bc/
"aabc", "aabcx"

/a(?=(b*c))bc/
/(?=((?:ab)*))a/
"ab", "abc"

/(?=((?:xx)*))x/
/(?=((xx)*))x/
/(?=(xx))+x/
"x", "xx", "xxx"

/(?=a+b)aab/
"aab"
Comment 7 Michael Saboff 2011-03-14 15:25:21 PDT
Created attachment 85729 [details]
Patch that tests assertion pre-checked character length independent from the remaining expression
Comment 8 Gavin Barraclough 2011-03-14 16:50:05 PDT
Comment on attachment 85729 [details]
Patch that tests assertion pre-checked character length independent from the remaining expression

looks great Michael.
Comment 9 Michael Saboff 2011-03-14 17:36:41 PDT
Committed r81085: <http://trac.webkit.org/changeset/81085>