Bug 64202

Summary: Enh: Improve handling of RegExp in the form of /.*blah.*/
Product: WebKit Reporter: Michael Saboff <msaboff>
Component: JavaScriptCoreAssignee: Michael Saboff <msaboff>
Status: RESOLVED FIXED    
Severity: Normal    
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Proposed patch
none
Updated patch to fix win compilation error.
barraclough: review-
Patch with suggested updates
none
Previous patch with regex JIT turned back on. barraclough: review+

Description Michael Saboff 2011-07-08 14:12:41 PDT
The current regular expression code handles REs in the form of /.*blah.*/ by greedily matching all characters with the first ".*" and then back tracking to find the "blah". Without capturing parens anywhere in the RE, we can look for the "blah" and then process the leading and trailing ".*".  This should provide a big win for these types of REs.
Comment 1 Michael Saboff 2011-07-11 09:29:21 PDT
Created attachment 100311 [details]
Proposed patch

Proposed enhancement to both the Yarr interpreter and JIT to handle expressions in the form of /[^].*[?]<sub-expr>.*[$]/[m].  The terms in between the leading and trailing .*'s cannot capture and also this enhancement is limited to single alternative expressions.  Process the inner terms and then look for the beginning of the string and end of the string.
Comment 2 Michael Saboff 2011-07-11 15:52:16 PDT
Created attachment 100374 [details]
Updated patch to fix win compilation error.
Comment 3 Gavin Barraclough 2011-07-12 17:35:26 PDT
Comment on attachment 100374 [details]
Updated patch to fix win compilation error.

View in context: https://bugs.webkit.org/attachment.cgi?id=100374&action=review

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1074
> +        if (pattern->m_multiline) {

These two if statements could be merged & save a repeating the match assign & return true.

if (!pattern->m_multiline && ((term.anchors.m_bol && matchBegin) || (term.anchors.m_eol && matchEnd != input.end())))
    return false;

> Source/JavaScriptCore/yarr/YarrInterpreter.cpp:1379
> +            BACKTRACK();

I don't think we should ever be able to backtrack into here; I think this should be ASSERT_NOT_REACHED()?

> Source/JavaScriptCore/yarr/YarrJIT.cpp:1011
> +        saveStartIndex.append(branch32(Equal, matchPos, Imm32(m_checked)));

Please modify per our discussion, /.*\n\d+.*/ "abc\n123".

> Source/JavaScriptCore/yarr/YarrPattern.cpp:716
> +        for (size_t alt = 0; alt < alternatives.size(); ++alt) {

We shouldn't need a for-loop here, given the above guard!   possibly should switch the check to "if (alternatives.size() != 1) return;" (I don't think we ever have 0 alternatives, but just to be safe!), then drop the loop & just work over alternatives[0].
Comment 4 Michael Saboff 2011-07-13 11:20:16 PDT
Created attachment 100690 [details]
Patch with suggested updates
Comment 5 Michael Saboff 2011-07-13 11:30:23 PDT
Created attachment 100694 [details]
Previous patch with regex JIT turned back on.
Comment 6 Michael Saboff 2011-07-13 16:05:45 PDT
Committed r90962: <http://trac.webkit.org/changeset/90962>