RESOLVED FIXED Bug 45787
Yarr JIT code checks BOL (^) each time through loop when in subexpression
https://bugs.webkit.org/show_bug.cgi?id=45787
Summary Yarr JIT code checks BOL (^) each time through loop when in subexpression
Michael Saboff
Reported 2010-09-14 15:57:13 PDT
The current Yarr JIT code will check for beginning of line every time through a loop for expression like (^|X)abc. This generates inefficient code. For non-multiline regular expressions, the beginning of line check can be done once for the first iteration and after that only the remaining parenthetical subexpressions need to be checked.
Attachments
Patch to unroll regular expression containing leading (^|...) subexpressions. (13.11 KB, patch)
2010-09-14 16:39 PDT, Michael Saboff
no flags
Patch that handles broader class of ^ anchored expressions. (16.77 KB, patch)
2010-09-16 17:51 PDT, Michael Saboff
barraclough: review-
barraclough: commit-queue-
Patch with changes recommended by reviewer (22.45 KB, patch)
2010-09-17 11:18 PDT, Michael Saboff
no flags
Michael Saboff
Comment 1 2010-09-14 16:01:20 PDT
This is related to <rdar://problem/8401178>.
Michael Saboff
Comment 2 2010-09-14 16:39:31 PDT
Created attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. The addition of this patch improves the v8-regexp execution time by ~15%.
Michael Saboff
Comment 3 2010-09-14 17:02:41 PDT
Actually this is related to <rdar://problem/8383005> v8 regexp codegen improvements
Joseph Pecoraro
Comment 4 2010-09-15 10:08:39 PDT
Was it deemed impractical to optimize ^ in a non-first alteration of the first group. For example: "aced\nabc".match( /(X|^)abc/m ) Or, multiple alterations: (here it looks like your patch would optimize the first, but not the second) "sire\nstring".match( /(^si|^st|X)r/m ) It could be that it is far less likely that developers take these approaches compared to the one you are optimizing. The goal is probably to improve the performance of real world JavaScript, not necessarily all edge cases.
Michael Saboff
Comment 5 2010-09-15 10:31:47 PDT
First of all, multiline regular expressions are excluded from this optimization as this optimization factors out the (^)blah from (^|...)blah and executes it once at the beginning of the line. Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider "abab".match( /(ab|^)ab/ ). This should match the first alternative including capturing. The suggestion of handling expressions like "sireblah".match( /(^si|^st|X)r/ ) could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world".
Joseph Pecoraro
Comment 6 2010-09-15 10:37:24 PDT
(In reply to comment #5) > First of all, multiline regular expressions are excluded from this optimization ... Whoops, yes you're right. > Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider > > "abab".match( /(ab|^)ab/ ). > > This should match the first alternative including capturing. Very good point. > The suggestion of handling expressions like > > "sireblah".match( /(^si|^st|X)r/ ) > > could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world". Sounds good to me. Thanks for the explanations!
Gavin Barraclough
Comment 7 2010-09-16 13:43:05 PDT
Comment on attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. Clearing review flag & look forward to a new patch as per our discussion.
Michael Saboff
Comment 8 2010-09-16 17:51:48 PDT
Created attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch unrolls expressions with any alternative that contains ^ anchoring. Compared to the prior patch, this patch uses the parse to record expressions that are BOL anchored. The routine optimizeBOL() then re-factors the expression, adding alternatives that don't contain BOL anchoring. The original expression is executed once and the alternatives without BOL anchoring are looped over.
Gavin Barraclough
Comment 9 2010-09-16 23:51:33 PDT
Comment on attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch looks really great. Two small issues. > + // Bubble up BOL flags > + for (unsigned i = 0; i < numParenAlternatives; i++) > + if (parenthesisDisjunction->m_alternatives[i]->m_startsWithBOL) > + numBOLAnchoredAlts++; (1) This for-loop needs braces. (Our style guidelines require braces for any nested multiline statement, even just nesting a comment & one line of code.) > + if ((!m_pattern.m_containsBOL) || (m_pattern.m_multiline)) (2) Please lose the two nested sets of parentheses. There is one other thing I'd like to see, but not necessarily as a part of this patch. Unless I'm missing something this patch won't yet optimize /^a/ matched against a huge string of 'b's. It looks to me like you'll fail the 'if (loopDisjunction) {' test, and safely not modify the pattern - which means I think we'll still search all the way through the input looking for BOL. In !'if (loopDisjunction) {' cases (cases where m_startsWithBOL is true for the topmost disjunction) we should be able to avoid looping at all – if testing the first position fails then we have a failed match. (I don't think that you're optimizing this yet, apologies if I've missed this). I think it would be really great if we could add this optimization now since it is something that PCRE has but YARR does not, but I don't think you need to do so as a part of this patch, it should be a nice simple addition to follow up with building on this work. Really awesome job, would be an r+ bar the two little style things. G.
Michael Saboff
Comment 10 2010-09-17 11:18:19 PDT
Created attachment 67924 [details] Patch with changes recommended by reviewer Cleaned up both style issues pointed out. Added code to properly handle expressions like /^a/ to only execute once and never loop. Found and fixed issue for expressions like /(?!^)a/ which should not be marked as starting with a BOL, since they start with NOT BOL!
WebKit Commit Bot
Comment 11 2010-09-18 14:04:29 PDT
Comment on attachment 67924 [details] Patch with changes recommended by reviewer Clearing flags on attachment: 67924 Committed r67790: <http://trac.webkit.org/changeset/67790>
WebKit Commit Bot
Comment 12 2010-09-18 14:04:34 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.