Bug 42159

Summary: Matching regular expressions which contain iterative parentheses with YARR JIT
Product: WebKit Reporter: Peter Varga <pvarga>
Component: JavaScriptCoreAssignee: Peter Varga <pvarga>
Status: NEW ---    
Severity: Normal CC: abecsi, barraclough, ggaren, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 42264    
Bug Blocks:    
Attachments:
Description Flags
work in progress patch
none
performance testcase
none
measurement script none

Description Peter Varga 2010-07-13 05:50:15 PDT
The current YARR JIT implementation doesn't support iterative parentheses matching so
it does fallback to pcre in these cases which prevents us from removing pcre from JSC.
Comment 1 Peter Varga 2010-07-14 08:23:56 PDT
Created attachment 61521 [details]
work in progress patch

This patch adds support for matching regular expressions which contain iterative
parentheses to YARR JIT. For applying this patch you need a JSC patched by the
"Prepare YARR JIT for matching regexps with iterative parentheses" 
(https://bugs.webkit.org/show_bug.cgi?id=42264) bug's patch because this solution's logic
is based on that.

The JIT generated code does match by the following simple logic:
Every time when matching the iterative parentheses the previous iteration's
match is stored on the stack marked by the subpattern ID. Therefore in
case of backtrack the result of the last successful matching can be restored.

The implementation isn't complete, the following cases which are within
iterative parentheses aren't handled therefore these still do fallback to
pcre:

- disjunction of parentheses contain more than one alternative
- the number of iterations is fixed (generateParenthesesFixed)
- in case of non-greedy matching (generateParenthesesNonGreedy)

However this solution passes all the regression tests in case of greedy matching with an upper limit 
the result can be wrong and the algorithm doesn't fallback to pcre, so this is also needs further work. These critical 
points are marked in the code by TODOs.

All the fast/js layout tests pass as well, except one: regexp-overflow.html
This test doesn't throw an expected exception in case of large number of iteration, but
YARR Interpreter doesn't throw an exception neither. 
However two layouts are fixed in the fast/js:

- sputnik/Conformance/15_Native_Objects/15.10_RegExp/15.10.2/15.10.2.5_Term/S15.10.2.5_A1_T4.html
- sputnik/Conformance/15_Native_Objects/15.10_RegExp/15.10.6/15.10.6.2_RegExp.prototype.exec/S15.10.6.2_A1_T6.html

NOTE: This patch is just a suggestion, not the final solution. I don't offer
this for a review now, I just would like to get some feedback before finishing
the whole work on iterative parentheses matching.

Please don't hesitate to comment!
Comment 2 Peter Varga 2010-07-14 08:26:04 PDT
Created attachment 61522 [details]
performance testcase

I wrote a performance test for benchmarking greedy matching of parentheses.
Results of measurements on Linux:
reference:		~3050ms
with this patch: 	~1150ms
Comment 3 Peter Varga 2010-07-14 08:32:29 PDT
Created attachment 61524 [details]
measurement script

A simple script for measuring the performance of the testcase. Simply run it with the jsc binary.
Comment 4 Geoffrey Garen 2010-07-14 10:09:33 PDT
Is there a way to avoid stack overflow in cases of substantial backtracking?
Comment 5 Peter Varga 2010-07-15 09:05:51 PDT
The dependency of this patch was changed and this caused performance improvements.

These are the performance results:

				  reference		modified
regexp-dna:			22.2ms +/- 1.4%    22.0ms +/- 1.5%
v8-regexp:	1.032x as fast	435.1ms +/- 1.2%    421.6ms +/- 1.2%
Comment 6 Peter Varga 2010-07-15 09:08:36 PDT
(In reply to comment #4)
> Is there a way to avoid stack overflow in cases of substantial backtracking?

Hi Geoffrey, there was some discussion about stack guards on the preparing patch.
You can see: https://bugs.webkit.org/show_bug.cgi?id=42264