Bug 45748

Summary: Collect the beginning characters in a RegExp pattern for look-up optimization
Product: WebKit Reporter: Peter Varga <pvarga>
Component: JavaScriptCoreAssignee: Peter Varga <pvarga>
Status: RESOLVED FIXED    
Severity: Normal CC: abecsi, barraclough, commit-queue, ggaren, msaboff, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 45751    
Attachments:
Description Flags
proposed patch
none
proposed patch v2
none
proposed patch v3
none
proposed patch v4
none
proposed patch v5
ggaren: review-
proposed patch v6 none

Peter Varga
Reported 2010-09-14 07:24:50 PDT
The RegExp pattern matching can be faster if we try to match an array of possible characters and increment the character index of the input in case of mismatch instead of starting the whole pattern matching process at the beginning. For this optimization the YARR's parser needs an algorithm which creates a list of beginning characters from the parsed tree.
Attachments
proposed patch (16.75 KB, patch)
2010-09-14 07:29 PDT, Peter Varga
no flags
proposed patch v2 (16.86 KB, patch)
2010-09-20 04:10 PDT, Peter Varga
no flags
proposed patch v3 (14.10 KB, patch)
2010-09-20 09:18 PDT, Peter Varga
no flags
proposed patch v4 (14.27 KB, patch)
2010-10-01 05:30 PDT, Peter Varga
no flags
proposed patch v5 (14.28 KB, patch)
2010-10-04 04:40 PDT, Peter Varga
ggaren: review-
proposed patch v6 (15.44 KB, patch)
2010-10-21 07:40 PDT, Peter Varga
no flags
Peter Varga
Comment 1 2010-09-14 07:29:40 PDT
Created attachment 67546 [details] proposed patch
Peter Varga
Comment 2 2010-09-20 04:10:18 PDT
Created attachment 68069 [details] proposed patch v2 updated patch for top of trunk
Peter Varga
Comment 3 2010-09-20 09:18:45 PDT
Created attachment 68094 [details] proposed patch v3 I have attached a new patch for collecting the beginning characters in the parsing phase. This modification doesn't collect the character classes therefore the overhead of the collecting is even less but the look-up optimization is less efficient as well. I separately show the new performance results at JIT (https://bugs.webkit.org/show_bug.cgi?id=45749) and Interpreter (https://bugs.webkit.org/show_bug.cgi?id=45751) bugreports.
Peter Varga
Comment 4 2010-10-01 05:30:39 PDT
Created attachment 69455 [details] proposed patch v4 Restrict the rules of the collecting of beginning characters, in this way some cases where the beginning character look-up optimization is less effective the collecting can be avoided.
Michael Saboff
Comment 5 2010-10-01 14:08:38 PDT
Note that there are a couple of changes needed to the 45748-4 patch to compile on Mac having to do with unsigned / signed comparisons. There are three changes In JavaScriptCore/yarr/RegexCompiler.cpp near line 686, in setupAlternativeBeginTerms() change numTerms to unsigned. In JavaScriptCore/yarr/RegexCompiler.cpp near line 743, insetupBeginChars() change the local size to unsigned. JavaScriptCore/yarr/RegexPattern.h near line 359, in BeginCharHelper struct definition change the return type of the size() method to unsigned.
Peter Varga
Comment 6 2010-10-04 04:40:00 PDT
Created attachment 69616 [details] proposed patch v5 Mac build fixed.
Michael Saboff
Comment 7 2010-10-15 15:54:08 PDT
Here are some comments/questions about patch V5 (https://bugs.webkit.org/attachment.cgi?id=69616). I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit. I have some questions about the BeginCharHelper methods sort(), uniq() and merge(). Is there a reason why you implemented your own sort() function? It seems that qsort(3) would work. Will uniq() handle three or more values in a row? I don't think so, but think that changing the if() to a while() will fix this. Why doesn't uniq() check the mask for equality as well? Do you sort in descending order to make the check in merge() work or is there another reason? Won't merging create mask values which make the JIT code more complex? The collection of begin characters seems to be unbounded (I didn't see any threshold in the code). Is there a point where this optimization actually doesn't provide any benefit? Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure that uses hashing be more appropriate to collect a unique list of character pairs? It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge that their first alternative is more likely then the others.
Geoffrey Garen
Comment 8 2010-10-15 15:58:49 PDT
Comment on attachment 69616 [details] proposed patch v5 I think Michael raised enough significant issues here for an r-. >I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit. Typo: "should be *unsigned*..."
Gavin Barraclough
Comment 9 2010-10-18 16:31:48 PDT
One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too. My only concerns with this patch would overlap with Michael's - in the implementation of the uniqued set of characters. Another major sticking point here would be the fact the sort algorithm is recursive – we strongly prefer to steer clear of recursion on the stack bounded by input, since this can lead to security vulnerabilities. Fundamentally we already a class to gather a set of characters in YARR – in CharacterClass. If at all possible I think the ideal solution here would be to try to share the implementation with this, since the requirements seem identical (both being a sorted, uniqued set of unicode characters). If the current implementation of CharacterClass is not a perfect fit then you're free to look at generalizing it to make it more suitable. I hope something along these lines will be workable.
Peter Varga
Comment 10 2010-10-21 07:40:02 PDT
Created attachment 71434 [details] proposed patch v6 I was refactoring the patch to solve the mentioned issues: > I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure > that two UChars fit. Done. >Is there a reason why you implemented your own sort() function? It seems that qsort(3) would work. I removed the sort() function. Now I use binary chop in case of adding a BeginChar like in CharacterClassConstructor's addSorted() function. > Will uniq() handle three or more values in a row? I don't think so, but think that changing the if() to a while() > will fix this. Why doesn't uniq() check the mask for equality as well? You were right I removed the uniq() function as well. By the way the checking of mask equality is unnecessary because the characters only get different masks in case of case insensitivity before the merging. > Do you sort in descending order to make the check in merge() work or is there another reason? The sort is important to be able to merge easily and only load a "character pair" from the input string once in an iteration during matching. Else we need to load single characters too. Look at firstSingleCharFound boolean in JIT and Interpreter. > Won't merging create mask values which make the JIT code more complex? The merging reduced the number of branches in the generated JIT code. > The collection of begin characters seems to be unbounded (I didn't see any threshold in the code). > Is there a point where this optimization actually doesn't provide any benefit? > Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. The number of begin characters don't have a treshold. The overhead of the optimization complexity is less than the overhead of the potential backtracks. > Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure > that uses hashing be more appropriate to collect a unique list of character pairs? > It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge > that their first alternative is more likely then the others. Using a set type structure seems to be too complex for this optimization. Keeping the expression order is unnecessary because this optimization has no effect on the result of matching. > One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too. The unicode case-insensitivity check is fixed. The solution is based on the CharacterClassConstructor's unicode case-insensitivity check.
Peter Varga
Comment 11 2010-11-04 04:29:29 PDT
ping
Gavin Barraclough
Comment 12 2010-11-16 12:14:00 PST
Comment on attachment 71434 [details] proposed patch v6 Looks fine, cheers Peter!
WebKit Commit Bot
Comment 13 2010-11-16 15:47:42 PST
The commit-queue encountered the following flaky tests while processing attachment 71434 [details]: http/tests/appcache/reload.html java/lc3/JSObject/ToObject-001.html Please file bugs against the tests. These tests were authored by ap@webkit.org. The commit-queue is continuing to process your patch.
WebKit Commit Bot
Comment 14 2010-11-17 01:43:00 PST
Comment on attachment 71434 [details] proposed patch v6 Clearing flags on attachment: 71434 Committed r72180: <http://trac.webkit.org/changeset/72180>
WebKit Commit Bot
Comment 15 2010-11-17 01:43:08 PST
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.