The current Yarr JIT code will check for beginning of line every time through a loop for expression like (^|X)abc. This generates inefficient code. For non-multiline regular expressions, the beginning of line check can be done once for the first iteration and after that only the remaining parenthetical subexpressions need to be checked.
This is related to <rdar://problem/8401178>.
Created attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. The addition of this patch improves the v8-regexp execution time by ~15%.
Actually this is related to <rdar://problem/8383005> v8 regexp codegen improvements
Was it deemed impractical to optimize ^ in a non-first alteration of the first group. For example: "aced\nabc".match( /(X|^)abc/m ) Or, multiple alterations: (here it looks like your patch would optimize the first, but not the second) "sire\nstring".match( /(^si|^st|X)r/m ) It could be that it is far less likely that developers take these approaches compared to the one you are optimizing. The goal is probably to improve the performance of real world JavaScript, not necessarily all edge cases.
First of all, multiline regular expressions are excluded from this optimization as this optimization factors out the (^)blah from (^|...)blah and executes it once at the beginning of the line. Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider "abab".match( /(ab|^)ab/ ). This should match the first alternative including capturing. The suggestion of handling expressions like "sireblah".match( /(^si|^st|X)r/ ) could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world".
(In reply to comment #5) > First of all, multiline regular expressions are excluded from this optimization ... Whoops, yes you're right. > Looking for ^ in some location other than the first position could be done, but there are some troubling corner cases. Consider > > "abab".match( /(ab|^)ab/ ). > > This should match the first alternative including capturing. Very good point. > The suggestion of handling expressions like > > "sireblah".match( /(^si|^st|X)r/ ) > > could be done with more detection and unrolling logic. It wasn't seen as the likely form for ^ anchored expression "in the real world". Sounds good to me. Thanks for the explanations!
Comment on attachment 67619 [details] Patch to unroll regular expression containing leading (^|...) subexpressions. Clearing review flag & look forward to a new patch as per our discussion.
Created attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch unrolls expressions with any alternative that contains ^ anchoring. Compared to the prior patch, this patch uses the parse to record expressions that are BOL anchored. The routine optimizeBOL() then re-factors the expression, adding alternatives that don't contain BOL anchoring. The original expression is executed once and the alternatives without BOL anchoring are looped over.
Comment on attachment 67868 [details] Patch that handles broader class of ^ anchored expressions. This patch looks really great. Two small issues. > + // Bubble up BOL flags > + for (unsigned i = 0; i < numParenAlternatives; i++) > + if (parenthesisDisjunction->m_alternatives[i]->m_startsWithBOL) > + numBOLAnchoredAlts++; (1) This for-loop needs braces. (Our style guidelines require braces for any nested multiline statement, even just nesting a comment & one line of code.) > + if ((!m_pattern.m_containsBOL) || (m_pattern.m_multiline)) (2) Please lose the two nested sets of parentheses. There is one other thing I'd like to see, but not necessarily as a part of this patch. Unless I'm missing something this patch won't yet optimize /^a/ matched against a huge string of 'b's. It looks to me like you'll fail the 'if (loopDisjunction) {' test, and safely not modify the pattern - which means I think we'll still search all the way through the input looking for BOL. In !'if (loopDisjunction) {' cases (cases where m_startsWithBOL is true for the topmost disjunction) we should be able to avoid looping at all – if testing the first position fails then we have a failed match. (I don't think that you're optimizing this yet, apologies if I've missed this). I think it would be really great if we could add this optimization now since it is something that PCRE has but YARR does not, but I don't think you need to do so as a part of this patch, it should be a nice simple addition to follow up with building on this work. Really awesome job, would be an r+ bar the two little style things. G.
Created attachment 67924 [details] Patch with changes recommended by reviewer Cleaned up both style issues pointed out. Added code to properly handle expressions like /^a/ to only execute once and never loop. Found and fixed issue for expressions like /(?!^)a/ which should not be marked as starting with a BOL, since they start with NOT BOL!
Comment on attachment 67924 [details] Patch with changes recommended by reviewer Clearing flags on attachment: 67924 Committed r67790: <http://trac.webkit.org/changeset/67790>
All reviewed patches have been landed. Closing bug.