45787 – Yarr JIT code checks BOL (^) each time through loop when in subexpression

RESOLVED FIXED45787

Yarr JIT code checks BOL (^) each time through loop when in subexpression

https://bugs.webkit.org/show_bug.cgi?id=45787

Summary Yarr JIT code checks BOL (^) each time through loop when in subexpression

Michael Saboff

Reported 2010-09-14 15:57:13 PDT

The current Yarr JIT code will check for beginning of line every time through a loop for expression like (^|X)abc. This generates inefficient code. For non-multiline regular expressions, the beginning of line check can be done once for the first iteration and after that only the remaining parenthetical subexpressions need to be checked.

Attachments
Patch to unroll regular expression containing leading (^\|...) subexpressions. (13.11 KB, patch) 2010-09-14 16:39 PDT, Michael Saboff	no flags	Details Formatted Diff Diff
Patch that handles broader class of ^ anchored expressions. (16.77 KB, patch) 2010-09-16 17:51 PDT, Michael Saboff	barraclough: review- barraclough: commit-queue-	Details Formatted Diff Diff
Patch with changes recommended by reviewer (22.45 KB, patch) 2010-09-17 11:18 PDT, Michael Saboff	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Michael Saboff

Comment 1 2010-09-14 16:01:20 PDT

This is related to <rdar://problem/8401178>.

Michael Saboff

Comment 2 2010-09-14 16:39:31 PDT

Created attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. The addition of this patch improves the v8-regexp execution time by ~15%.

Michael Saboff

Comment 3 2010-09-14 17:02:41 PDT

Actually this is related to <rdar://problem/8383005> v8 regexp codegen improvements

Joseph Pecoraro

Comment 4 2010-09-15 10:08:39 PDT

Was it deemed impractical to optimize ^ in a non-first alteration of the first group. For example: "aced\nabc".match( /(X|^)abc/m ) Or, multiple alterations: (here it looks like your patch would optimize the first, but not the second) "sire\nstring".match( /(^si|^st|X)r/m ) It could be that it is far less likely that developers take these approaches compared to the one you are optimizing. The goal is probably to improve the performance of real world JavaScript, not necessarily all edge cases.

Michael Saboff

Comment 5 2010-09-15 10:31:47 PDT

First of all, multiline regular expressions are excluded from this optimization as this optimization factors out the (^)blah from (^|...)blah and executes it once at the beginning of the line. Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider "abab".match( /(ab|^)ab/ ). This should match the first alternative including capturing. The suggestion of handling expressions like "sireblah".match( /(^si|^st|X)r/ ) could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world".

Joseph Pecoraro

Comment 6 2010-09-15 10:37:24 PDT

(In reply to comment #5) > First of all, multiline regular expressions are excluded from this optimization ... Whoops, yes you're right. > Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider > > "abab".match( /(ab|^)ab/ ). > > This should match the first alternative including capturing. Very good point. > The suggestion of handling expressions like > > "sireblah".match( /(^si|^st|X)r/ ) > > could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world". Sounds good to me. Thanks for the explanations!

Gavin Barraclough

Comment 7 2010-09-16 13:43:05 PDT

Comment on attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. Clearing review flag & look forward to a new patch as per our discussion.

Michael Saboff

Comment 8 2010-09-16 17:51:48 PDT

Created attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch unrolls expressions with any alternative that contains ^ anchoring. Compared to the prior patch, this patch uses the parse to record expressions that are BOL anchored. The routine optimizeBOL() then re-factors the expression, adding alternatives that don't contain BOL anchoring. The original expression is executed once and the alternatives without BOL anchoring are looped over.

Gavin Barraclough

Comment 9 2010-09-16 23:51:33 PDT

Comment on attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch looks really great. Two small issues. > + // Bubble up BOL flags > + for (unsigned i = 0; i < numParenAlternatives; i++) > + if (parenthesisDisjunction->m_alternatives[i]->m_startsWithBOL) > + numBOLAnchoredAlts++; (1) This for-loop needs braces. (Our style guidelines require braces for any nested multiline statement, even just nesting a comment & one line of code.) > + if ((!m_pattern.m_containsBOL) || (m_pattern.m_multiline)) (2) Please lose the two nested sets of parentheses. There is one other thing I'd like to see, but not necessarily as a part of this patch. Unless I'm missing something this patch won't yet optimize /^a/ matched against a huge string of 'b's. It looks to me like you'll fail the 'if (loopDisjunction) {' test, and safely not modify the pattern - which means I think we'll still search all the way through the input looking for BOL. In !'if (loopDisjunction) {' cases (cases where m_startsWithBOL is true for the topmost disjunction) we should be able to avoid looping at all – if testing the first position fails then we have a failed match. (I don't think that you're optimizing this yet, apologies if I've missed this). I think it would be really great if we could add this optimization now since it is something that PCRE has but YARR does not, but I don't think you need to do so as a part of this patch, it should be a nice simple addition to follow up with building on this work. Really awesome job, would be an r+ bar the two little style things. G.

Michael Saboff

Comment 10 2010-09-17 11:18:19 PDT

Created attachment 67924 [details] Patch with changes recommended by reviewer Cleaned up both style issues pointed out. Added code to properly handle expressions like /^a/ to only execute once and never loop. Found and fixed issue for expressions like /(?!^)a/ which should not be marked as starting with a BOL, since they start with NOT BOL!

WebKit Commit Bot

Comment 11 2010-09-18 14:04:29 PDT

Comment on attachment 67924 [details] Patch with changes recommended by reviewer Clearing flags on attachment: 67924 Committed r67790: <http://trac.webkit.org/changeset/67790>

WebKit Commit Bot

Comment 12 2010-09-18 14:04:34 PDT

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Enhancement

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component JavaScriptCore

Assignee

Michael Saboff

Reported

2010-09-14 15:57 PDT

Modified

2010-09-18 14:04 PDT History

CC List

2 users Show

URL

Keywords

Depends on

Blocks