45748 – Collect the beginning characters in a RegExp pattern for look-up optimization

Bug 45748 - Collect the beginning characters in a RegExp pattern for look-up optimization

Summary: Collect the beginning characters in a RegExp pattern for look-up optimization

Status:	RESOLVED FIXED

Alias:	None

Product:	WebKit
Classification:	Unclassified
Component:	JavaScriptCore (show other bugs)
Version:	528+ (Nightly build)
Hardware:	All All

Importance:	P2 Normal
Assignee:	Peter Varga

URL:
Keywords:

Depends on:
Blocks:	45751
	Show dependency tree / graph

Reported:	2010-09-14 07:24 PDT by Peter Varga
Modified:	2011-05-23 14:23 PDT (History)
CC List:	6 users (show)

See Also:

Attachments
proposed patch (16.75 KB, patch) 2010-09-14 07:29 PDT, Peter Varga	no flags	Details \| Formatted Diff \| Diff
proposed patch v2 (16.86 KB, patch) 2010-09-20 04:10 PDT, Peter Varga	no flags	Details \| Formatted Diff \| Diff
proposed patch v3 (14.10 KB, patch) 2010-09-20 09:18 PDT, Peter Varga	no flags	Details \| Formatted Diff \| Diff
proposed patch v4 (14.27 KB, patch) 2010-10-01 05:30 PDT, Peter Varga	no flags	Details \| Formatted Diff \| Diff
proposed patch v5 (14.28 KB, patch) 2010-10-04 04:40 PDT, Peter Varga	ggaren: review-	Details \| Formatted Diff \| Diff
proposed patch v6 (15.44 KB, patch) 2010-10-21 07:40 PDT, Peter Varga	no flags	Details \| Formatted Diff \| Diff
Show Obsolete (5) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Varga 2010-09-14 07:24:50 PDT

The RegExp pattern matching can be faster if we try to match an array of
possible characters and increment the character index of the input in case of
mismatch instead of starting the whole pattern matching process at the beginning.
For this optimization the YARR's parser needs an algorithm which creates a
list of beginning characters from the parsed tree.

Comment 1 Peter Varga 2010-09-14 07:29:40 PDT

Created attachment 67546 [details]
proposed patch

Comment 2 Peter Varga 2010-09-20 04:10:18 PDT

Created attachment 68069 [details]
proposed patch v2

updated patch for top of trunk

Comment 3 Peter Varga 2010-09-20 09:18:45 PDT

Created attachment 68094 [details]
proposed patch v3

I have attached a new patch for collecting the beginning characters in the parsing phase.
This modification doesn't collect the character classes therefore the overhead of the
collecting is even less but the look-up optimization is less efficient as well.

I separately show the new performance results at JIT (https://bugs.webkit.org/show_bug.cgi?id=45749) and 
Interpreter (https://bugs.webkit.org/show_bug.cgi?id=45751) bugreports.

Comment 4 Peter Varga 2010-10-01 05:30:39 PDT

Created attachment 69455 [details]
proposed patch v4

Restrict the rules of the collecting of beginning characters, in this way some cases where the beginning character
look-up optimization is less effective the collecting can be avoided.

Comment 5 Michael Saboff 2010-10-01 14:08:38 PDT

Note that there are a couple of changes needed to the 45748-4 patch to compile on Mac having to do with unsigned / signed comparisons.  There are three changes
 In JavaScriptCore/yarr/RegexCompiler.cpp near line 686, in setupAlternativeBeginTerms() change numTerms to unsigned.

 In JavaScriptCore/yarr/RegexCompiler.cpp near line 743, insetupBeginChars() change the local size to unsigned.

 JavaScriptCore/yarr/RegexPattern.h near line 359, in BeginCharHelper struct definition change  the return type of the size() method to unsigned.

Comment 6 Peter Varga 2010-10-04 04:40:00 PDT

Created attachment 69616 [details]
proposed patch v5

Mac build fixed.

Comment 7 Michael Saboff 2010-10-15 15:54:08 PDT

Here are some comments/questions about patch V5 (https://bugs.webkit.org/attachment.cgi?id=69616).

I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit.

I have some questions about the BeginCharHelper methods sort(), uniq() and merge().

Is there a reason why you implemented your own sort() function?  It seems that qsort(3) would work.

Will uniq() handle three or more values in a row?  I don't think so, but think that changing the if() to a while() will fix this.  Why doesn't uniq() check the mask for equality as well? 

Do you sort in descending order to make the check in merge() work or is there another reason?

Won't merging create mask values which make the JIT code more complex?

The collection of begin characters seems to be unbounded (I didn't see any threshold in the code).  Is there a point where this optimization actually doesn't provide any benefit?  Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. 

Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure that uses hashing be more appropriate to collect a unique list of character pairs?  It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge that their first alternative is more likely then the others.

Comment 8 Geoffrey Garen 2010-10-15 15:58:49 PDT

Comment on attachment 69616 [details]
proposed patch v5

I think Michael raised enough significant issues here for an r-.

>I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure that two UChars fit.

Typo: "should be *unsigned*..."

Comment 9 Gavin Barraclough 2010-10-18 16:31:48 PDT

One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too.

My only concerns with this patch would overlap with Michael's - in the implementation of the uniqued set of characters.  Another major sticking point here would be the fact the sort algorithm is recursive – we strongly prefer to steer clear of recursion on the stack bounded by input, since this can lead to security vulnerabilities.

Fundamentally we already a class to gather a set of characters in YARR – in CharacterClass.  If at all possible I think the ideal solution here would be to try to share the implementation with this, since the requirements seem identical (both being a sorted, uniqued set of unicode characters).  If the current implementation of CharacterClass is not a perfect fit then you're free to look at generalizing it to make it more suitable.  I hope something along these lines will be workable.

Comment 10 Peter Varga 2010-10-21 07:40:02 PDT

Created attachment 71434 [details]
proposed patch v6

I was refactoring the patch to solve the mentioned issues:

> I think the types for value and mask in BeginChar() should be signed 32 bit values (maybe uint32_t) to ensure 
> that two UChars fit.

Done.

>Is there a reason why you implemented your own sort() function?  It seems that qsort(3) would work.

I removed the sort() function. Now I use binary chop in case of adding a BeginChar like in CharacterClassConstructor's
addSorted() function.

> Will uniq() handle three or more values in a row?  I don't think so, but think that changing the if() to a while() 
> will fix this.  Why doesn't uniq() check the mask for equality as well?

You were right I removed the uniq() function as well.
By the way the checking of mask equality is unnecessary because the characters only get different masks in case of 
case insensitivity before the merging.

> Do you sort in descending order to make the check in merge() work or is there another reason?

The sort is important to be able to merge easily and only load a "character pair" from the input string once in an iteration
during matching. Else we need to load single characters too. Look at firstSingleCharFound boolean in JIT and Interpreter.

> Won't merging create mask values which make the JIT code more complex?

The merging reduced the number of branches in the generated JIT code.

> The collection of begin characters seems to be unbounded (I didn't see any threshold in the code).  
> Is there a point where this optimization actually doesn't provide any benefit?  
> Or there is always a benefit since the optimization complexity is proportional to the original expression complexity. 

The number of begin characters don't have a treshold. The overhead of the optimization complexity is less than 
the overhead of the potential backtracks.

> Given these questions and that the uniq() and merge() functions are both O(N^2) would a set type structure 
> that uses hashing be more appropriate to collect a unique list of character pairs?  
> It seems to me that we might want to keep the begin char checks in expression order to honor programmer intent / knowledge 
> that their first alternative is more likely then the others.

Using a set type structure seems to be too complex for this optimization. Keeping the expression order is unnecessary
because this optimization has no effect on the result of matching.

> One small issue, the bits of code calling isASCIIAlpha look unicode-unsafe, should be handling unicode case-insensitivity too.

The unicode case-insensitivity check is fixed. The solution is based on the CharacterClassConstructor's
unicode case-insensitivity check.

Comment 11 Peter Varga 2010-11-04 04:29:29 PDT

ping

Comment 12 Gavin Barraclough 2010-11-16 12:14:00 PST

Comment on attachment 71434 [details]
proposed patch v6

Looks fine, cheers Peter!

Comment 13 WebKit Commit Bot 2010-11-16 15:47:42 PST

The commit-queue encountered the following flaky tests while processing attachment 71434 [details]:

http/tests/appcache/reload.html
java/lc3/JSObject/ToObject-001.html

Please file bugs against the tests.  These tests were authored by ap@webkit.org.  The commit-queue is continuing to process your patch.

Comment 14 WebKit Commit Bot 2010-11-17 01:43:00 PST

Comment on attachment 71434 [details]
proposed patch v6

Clearing flags on attachment: 71434

Committed r72180: <http://trac.webkit.org/changeset/72180>

Comment 15 WebKit Commit Bot 2010-11-17 01:43:08 PST

All reviewed patches have been landed.  Closing bug.