Bug 45748

Summary:

Collect the beginning characters in a RegExp pattern for look-up optimization

Product:

WebKit

Reporter:

Peter Varga <pvarga>

Component:

JavaScriptCore

Assignee:

Peter Varga <pvarga>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abecsi, barraclough, commit-queue, ggaren, msaboff, zherczeg

Priority:

Version:

528+ (Nightly build)

Hardware:

All

OS:

All

Bug Depends on:

Bug Blocks:

45751

Attachments:

Description	Flags
proposed patch	none
proposed patch v2	none
proposed patch v3	none
proposed patch v4	none
proposed patch v5	ggaren: review-
proposed patch v6	none

Peter Varga

Reported 2010-09-14 07:24:50 PDT

The RegExp pattern matching can be faster if we try to match an array of possible characters and increment the character index of the input in case of mismatch instead of starting the whole pattern matching process at the beginning. For this optimization the YARR's parser needs an algorithm which creates a list of beginning characters from the parsed tree.

Attachments
proposed patch (16.75 KB, patch) 2010-09-14 07:29 PDT, Peter Varga	no flags	Details Formatted Diff Diff
proposed patch v2 (16.86 KB, patch) 2010-09-20 04:10 PDT, Peter Varga	no flags	Details Formatted Diff Diff
proposed patch v3 (14.10 KB, patch) 2010-09-20 09:18 PDT, Peter Varga	no flags	Details Formatted Diff Diff
proposed patch v4 (14.27 KB, patch) 2010-10-01 05:30 PDT, Peter Varga	no flags	Details Formatted Diff Diff
proposed patch v5 (14.28 KB, patch) 2010-10-04 04:40 PDT, Peter Varga	ggaren: review-	Details Formatted Diff Diff
proposed patch v6 (15.44 KB, patch) 2010-10-21 07:40 PDT, Peter Varga	no flags	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Peter Varga

Comment 1 2010-09-14 07:29:40 PDT

Created attachment 67546 [details] proposed patch

Peter Varga

Comment 2 2010-09-20 04:10:18 PDT

Created attachment 68069 [details] proposed patch v2 updated patch for top of trunk

Peter Varga

Comment 3 2010-09-20 09:18:45 PDT

Created attachment 68094 [details] proposed patch v3 I have attached a new patch for collecting the beginning characters in the parsing phase. This modification doesn't collect the character classes therefore the overhead of the collecting is even less but the look-up optimization is less efficient as well. I separately show the new performance results at JIT (https://bugs.webkit.org/show_bug.cgi?id=45749) and Interpreter (https://bugs.webkit.org/show_bug.cgi?id=45751) bugreports.

Peter Varga

Comment 4 2010-10-01 05:30:39 PDT

Created attachment 69455 [details] proposed patch v4 Restrict the rules of the collecting of beginning characters, in this way some cases where the beginning character look-up optimization is less effective the collecting can be avoided.

Michael Saboff

Comment 5 2010-10-01 14:08:38 PDT

Note that there are a couple of changes needed to the 45748-4 patch to compile on Mac having to do with unsigned / signed comparisons. There are three changes In JavaScriptCore/yarr/RegexCompiler.cpp near line 686, in setupAlternativeBeginTerms() change numTerms to unsigned. In JavaScriptCore/yarr/RegexCompiler.cpp near line 743, insetupBeginChars() change the local size to unsigned. JavaScriptCore/yarr/RegexPattern.h near line 359, in BeginCharHelper struct definition change the return type of the size() method to unsigned.

Peter Varga

Comment 6 2010-10-04 04:40:00 PDT

Created attachment 69616 [details] proposed patch v5 Mac build fixed.

Michael Saboff

Comment 7 2010-10-15 15:54:08 PDT

Here are some comments/questions about patch V5 (https://bugs.webkit.org/attachment.cgi?id=69616). I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit. I have some questions about the BeginCharHelper methods sort(), uniq() and merge(). Is there a reason why you implemented your own sort() function? It seems that qsort(3) would work. Will uniq() handle three or more values in a row? I don't think so, but think that changing the if() to a while() will fix this. Why doesn't uniq() check the mask for equality as well? Do you sort in descending order to make the check in merge() work or is there another reason? Won't merging create mask values which make the JIT code more complex? The collection of begin characters seems to be unbounded (I didn't see any threshold in the code). Is there a point where this optimization actually doesn't provide any benefit? Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure that uses hashing be more appropriate to collect a unique list of character pairs? It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge that their first alternative is more likely then the others.

Geoffrey Garen

Comment 8 2010-10-15 15:58:49 PDT

Comment on attachment 69616 [details] proposed patch v5 I think Michael raised enough significant issues here for an r-. >I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit. Typo: "should be *unsigned*..."

Gavin Barraclough

Comment 9 2010-10-18 16:31:48 PDT

One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too. My only concerns with this patch would overlap with Michael's - in the implementation of the uniqued set of characters. Another major sticking point here would be the fact the sort algorithm is recursive – we strongly prefer to steer clear of recursion on the stack bounded by input, since this can lead to security vulnerabilities. Fundamentally we already a class to gather a set of characters in YARR – in CharacterClass. If at all possible I think the ideal solution here would be to try to share the implementation with this, since the requirements seem identical (both being a sorted, uniqued set of unicode characters). If the current implementation of CharacterClass is not a perfect fit then you're free to look at generalizing it to make it more suitable. I hope something along these lines will be workable.

Peter Varga

Comment 10 2010-10-21 07:40:02 PDT

Created attachment 71434 [details] proposed patch v6 I was refactoring the patch to solve the mentioned issues: > I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure > that two UChars fit. Done. >Is there a reason why you implemented your own sort() function? It seems that qsort(3) would work. I removed the sort() function. Now I use binary chop in case of adding a BeginChar like in CharacterClassConstructor's addSorted() function. > Will uniq() handle three or more values in a row? I don't think so, but think that changing the if() to a while() > will fix this. Why doesn't uniq() check the mask for equality as well? You were right I removed the uniq() function as well. By the way the checking of mask equality is unnecessary because the characters only get different masks in case of case insensitivity before the merging. > Do you sort in descending order to make the check in merge() work or is there another reason? The sort is important to be able to merge easily and only load a "character pair" from the input string once in an iteration during matching. Else we need to load single characters too. Look at firstSingleCharFound boolean in JIT and Interpreter. > Won't merging create mask values which make the JIT code more complex? The merging reduced the number of branches in the generated JIT code. > The collection of begin characters seems to be unbounded (I didn't see any threshold in the code). > Is there a point where this optimization actually doesn't provide any benefit? > Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. The number of begin characters don't have a treshold. The overhead of the optimization complexity is less than the overhead of the potential backtracks. > Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure > that uses hashing be more appropriate to collect a unique list of character pairs? > It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge > that their first alternative is more likely then the others. Using a set type structure seems to be too complex for this optimization. Keeping the expression order is unnecessary because this optimization has no effect on the result of matching. > One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too. The unicode case-insensitivity check is fixed. The solution is based on the CharacterClassConstructor's unicode case-insensitivity check.

Peter Varga

Comment 11 2010-11-04 04:29:29 PDT

ping

Gavin Barraclough

Comment 12 2010-11-16 12:14:00 PST

Comment on attachment 71434 [details] proposed patch v6 Looks fine, cheers Peter!

WebKit Commit Bot

Comment 13 2010-11-16 15:47:42 PST

The commit-queue encountered the following flaky tests while processing attachment 71434 [details]: http/tests/appcache/reload.html java/lc3/JSObject/ToObject-001.html Please file bugs against the tests. These tests were authored by ap@webkit.org. The commit-queue is continuing to process your patch.

WebKit Commit Bot

Comment 14 2010-11-17 01:43:00 PST

Comment on attachment 71434 [details] proposed patch v6 Clearing flags on attachment: 71434 Committed r72180: <http://trac.webkit.org/changeset/72180>

WebKit Commit Bot

Comment 15 2010-11-17 01:43:08 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.