RESOLVED WORKSFORME Bug 38117
Differences between subpattern matching in use of pcre and Yarr Intrepreter
https://bugs.webkit.org/show_bug.cgi?id=38117
Summary Differences between subpattern matching in use of pcre and Yarr Intrepreter
Peter Varga
Reported 2010-04-26 05:36:11 PDT
I have found some cases when the Yarr JIT has done fallback to pcre, but the result of matching of subpatterns is different between the pcre and Yarr Interpreter cases. example 1: var pat = /(a(b)*)*/; var str = "aba"; print(str.match(pat)); pcre's result: aba,a,b yarr's result: aba,a, In this case the pcre doesn't match the subpattern of 'b' at the second iteration, because the matching algorithm reached the end of the input string. Thus the first iteration's result remains in the output vector. But the yarr's algorithm matches the subpattern of 'b' at the end of the input string again. I think according to the yarr algorithm the outern subpattern should match in the third iteration but in this case the yarr's result should be "aba,," because the subpattern of 'a' doesn't match at the third iteration either. example 2: var pat = /(a*)*/; var str = "ab"; print(str.match(pat)); pcre's result: a, yarr's result: a,a In this case the situation is similar to the first example, but here a character matching blocks the matching of the subpattern instead of reaching the end of the input string. Yarr stores the first matching of subpattern but pcre doesn't. example 3: var pat = /([ab]*)*/; var str = "abab"; print(str.match(pat)); pcre's result = abab, yarr's result = abab,abab IMHO the yarr's way is correct in this case, because pcre tries to match the subpattern character by character instead of one iteration. Which is the correct behaviour in each example? Which regex engine needs a fix?
Attachments
Gavin Barraclough
Comment 1 2010-05-19 11:46:14 PDT
Hi Peter, In all three cases here, I believe Yarr is correct, and its results match those of FireFox. Where the spec appeared ambiguous I based Yarr's behaviour on an average of the behaviours of other browsers, and what seemed to make sense. I seem to recall I generally found IE to have the most spec compliant and sensible (to my opinion) results. In all the cases that I'm aware of differences between PCRE & Yarr, we believe Yarr to be correct – though of course there may well be bugs in Yarr that we're not aware of (e.g. I spotted a bug you raise the other day re ?? where Yarr clearly is wrong). The rule here, I think, is that if you have a quantified set of capturing parentheses, the capture should be the value from the last successful match. For a match to count as successful the pattern obviously has to match, but also for optional matches (matches after the minimum extent of the quantification has been reached) the result of the match must not be the empty string. E.g.: /(a?){3}(b)/.exec("aab") This should produce the result aab,,b since the first capture match three times, the third time matching an empty string. This is not optional, so is recorded as the first subpattern despite being empty. /(a?){2,3}(b)/.exec("aab") This should produce the result aab,a,b since the first capture only successfully matches twice (a third match is optional, and since it would match the empty string is considered a failed match, and isn't recorded). Firefox gets these examples right, PCRE gets the latter wrong – I haven't had a chance to build Yarr interpreter – hopefully that should get it right though! (And I think the relevant quote from the spec, if you want to try to find this, is "Step 1 of theRepeatMatcher's closure d states that, once the minimum number of repetitions has been satisfied, any more expansions of Atom that match the empty string are not considered for further repetitions. This prevents the regular expression engine from falling into an infinite loop on patterns such as:") The other rule is that the results of any captures nested in an outer set of parens should should reflect the value from the last successful match of the outer parens. e.g.: /(?:(a)|(b))*/.exec("ab") Should result in ab,b,,b since in the last iteration of the outermost capture the first nested capture (a) does not match but the second nested capture (b) does. PCRE does not reset its nested matches between iterations, and again it gets this wrong. Firefox gets this right, and again I've not had time to test Yarr interpreter, but believe we should get this right! Our general approach towards fixing PCRE right now is that we'd like to just remove it altogether! – we rather not have multiple RE engines in the codebase, so our plan is that when we have time it to just working on optimizing YARR interpreter, with a view that once this is fast enough we'll be able to switch all builds to use this instead. That said, patches to fix PCRE are always welcome too! cheers, G.
Peter Varga
Comment 2 2010-05-20 02:22:37 PDT
(In reply to comment #1) Thanks for your detailed answer. I'm working on the implementation of the cases which are missing from the YARR JIT and cause fallback to pcre. I'm planning to apply some optimizations on YARR Interpreter too, which I noticed during the work on JIT. Most of my previous bug reports (like this) were raised during this work. Now I'm working on the JIT implementation of the quantified (iterative) group matching. The current conception is to avoid to allocate stack frames for terms in advance and store information about terms in every iteration on stack. This information is usually the match count of each term and the start and end position of the previous matching in case of capturing groups. A part of this solution can be found at this bug report: https://bugs.webkit.org/show_bug.cgi?id=38988 This patch is motivated by the "do I really want this on the stack" TODO. Thanks, Peter
Gavin Barraclough
Comment 3 2011-06-18 00:09:10 PDT
These tests were all determined to be correct in YARR. These test cases are now landed in in r89167.
Note You need to log in before you can comment on or make changes to this bug.