Bug 41238 - RegExp performance slow on Dromaeo benchmark
Summary: RegExp performance slow on Dromaeo benchmark
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.6
: P3 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-25 17:21 PDT by Michael Saboff
Modified: 2010-06-29 15:01 PDT (History)
2 users (show)

See Also:


Attachments
The patch (4.98 KB, patch)
2010-06-28 15:04 PDT, Michael Saboff
darin: review-
Details | Formatted Diff | Diff
Updated patch. (8.31 KB, patch)
2010-06-28 18:54 PDT, Michael Saboff
darin: review-
Details | Formatted Diff | Diff
Latest revised patch (7.31 KB, patch)
2010-06-29 11:32 PDT, Michael Saboff
darin: review+
darin: commit-queue-
Details | Formatted Diff | Diff
Patch with updated ChangeLog and minor fix. (5.82 KB, patch)
2010-06-29 14:27 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2010-06-25 17:21:21 PDT
Other javascript engines appear to cache prior results of regular expression operations.

Suggest adding some sort of caching mechanism to regular expression processing.
Comment 1 Alexey Proskuryakov 2010-06-27 14:40:28 PDT
See also: bug 37908, bug 38828 (I don't know to what degree they are related, if any).
Comment 2 Michael Saboff 2010-06-28 15:04:31 PDT
Created attachment 59949 [details]
The patch
Comment 3 WebKit Review Bot 2010-06-28 15:09:00 PDT
Attachment 59949 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/UString.h:163:  One space before end of line comments  [whitespace/comments] [5]
JavaScriptCore/ChangeLog:7:  Line contains tab character.  [whitespace/tab] [5]
JavaScriptCore/runtime/RegExp.cpp:129:  One line control clauses should not use braces.  [whitespace/braces] [4]
JavaScriptCore/runtime/RegExp.cpp:131:  An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
Total errors found: 4 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Darin Adler 2010-06-28 15:20:38 PDT
Comment on attachment 59949 [details]
The patch

This looks great!

> +    , m_lastMatchStr(UString::null())

This shouldn't be needed. A UString will start out null.

> +    , m_lastMatchStart(-1)
>  {
> +    m_lastOVector.resize(0);

This shouldn't be needed. The vector should start out empty.

> +    , m_lastMatchStr(UString::null())

This shouldn't be needed. A UString will start out null.

> +    , m_lastMatchStart(-1)
>  {
> +    m_lastOVector.resize(0);

This shouldn't be needed. The vector should start out empty.

> +        m_lastMatchStr = UString::null();

I think this should just be:

    m_lastMatchStr = UString();

You should check which is more efficient, but I think my form is.

> +        m_lastOVector.resize(0);

It's more efficient to call shrink(0) than resize(0).

> +    // Check to see if this match call is the same as the last match invocation

We normally use sentence style for comments, with a period.

> +    if ((startOffset == m_lastMatchStart) && (s.rep() == m_lastMatchStr.rep())) {

We normally

> +        if (ovector) {
> +            (Vector<int, 32>&)*ovector = (Vector<int, 32>&)m_lastOVector;
> +        }

I don't understand why you need these typecasts. They should not be needed.

Also, we don't use braces around single line if statement bodies. It should be:

    if (ovector)
        *ovector = m_lastOVector;

> +        if (m_lastOVector.isEmpty())
> +            return -1;
> +        else
> +            return m_lastOVector.at(0);

We normally don't do else after return, so the else should be removed.

> +        if (ovector)
> +            (Vector<int, 32>&)m_lastOVector = (Vector<int, 32>&)*ovector;
> +        else
> +            (Vector<int, 32>&)m_lastOVector = (Vector<int, 32>&)nonReturnedOvector;

Same comments about typecasts.

>          unsigned m_numSubpatterns;
> +        UString m_lastMatchStr;

We normally try to avoid abbreviations. So I suggest you either call this m_lastMatchString or m_lastMatch and leave out the "str".

>      ALWAYS_INLINE bool operator==(const UString& s1, const UString& s2)
>      {
>          unsigned size = s1.size();
> +        
> +        if (size != s2.size())      // Sizes not the same, we're done.
> +            return false;
> +            
> +        if (s1.data() == s2.data()) // Quick check of data pointers being the same.
> +            return true;

Another way to do this optimization is to check if s1.rep() is == s2.rep(). That would be more efficient than the repeated calls to data() this function has. And I think it should be relatively rare to have the two data pointers to match if they are not the exact same rep. You could put the rep check before the size check, in fact.

review- because I think you should do at least one of the things I suggested above
Comment 5 Gavin Barraclough 2010-06-28 16:04:59 PDT
Comment on attachment 59949 [details]
The patch

Hey Michael,

As discussed:

* There is a tab in the changelog,
* Please remove the initialization of m_lastMatchStr & the call to resize on m_lastOVector from the RegExp constructor.
* Please remove redundant braces & else in the code checking the cache.
* Re the UString change, instrumenting & checking if a comparison of the rep()s rather than a compare of data() sounded like a good plan.

As discussed, this code can be made much cleaner by both removing the output param & never resizing the array to zero – but I agree with you that this would be better done as a separate patch.

Also, one other thing I missed,

> Are all the (Vector<int, 32>&) casts necessary? - could you check if these are still needed? – if the coding style requires a static_cast<> here instead of a C-style case (which could silently be doing nasties like casting away constness).

Oh, a second last thought! - I'm tempted to suggest ASSERTs in the switch statements, where you have removed the checks of size – these may just make it more obvious to someone else coming to the code why the lack of check is safe.  What do you think?

r-ing this for now, please attach a new patch with the fixes above.
cheers, G.
Comment 6 Michael Saboff 2010-06-28 18:54:37 PDT
Created attachment 59974 [details]
Updated patch.

Patch with updates in response to review comments.
Comment 7 WebKit Review Bot 2010-06-29 10:47:45 PDT
Attachment 59974 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/UString.h:166:  One space before end of line comments  [whitespace/comments] [5]
JavaScriptCore/runtime/UString.h:175:  One space before end of line comments  [whitespace/comments] [5]
JavaScriptCore/runtime/UString.h:178:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/runtime/UString.h:197:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 4 in 5 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 8 Darin Adler 2010-06-29 10:51:32 PDT
Comment on attachment 59974 [details]
Updated patch.

Looks great.

> +        Refactored Vector<>::operator=() method into a separate copy() method 
> +        that it calls.  The copy() method is explicitly used for deep copying.

I don't understand this change. Why can't we use the assignment operator directly?

One small problem I see with adding a copy function is that in WebKit functions with copy in their name are typically functions that return a copy of an object or some sort of new object. So using the name copy for a function with no return value that copies data into an object is a bit non-standard and thus possibly confusing.

> +    m_lastMatchString = UString::null();
> +    m_lastMatchStart = -1;
> +    m_lastOVector.resize(0);

This site needs the UString() and shrink(0) changes we made earlier in the same file.

>      template<typename T, size_t inlineCapacity>
>      Vector<T, inlineCapacity>& Vector<T, inlineCapacity>::operator=(const Vector<T, inlineCapacity>& other)
>      {
> +        if (&other != this)
> +            copy(other);
> +        
> +        return *this;
> +    }

I think this should be marked inline. We don't want to add another level of function call here.

> +    template<typename T, size_t inlineCapacity> template<size_t otherCapacity>
> +    void Vector<T, inlineCapacity>::copy(const Vector<T, otherCapacity>& other)
> +    {
>          if (&other == this)
> -            return *this;
> +            return;

Both the copy function and the assignment operator are doing this check. That seems like overkill to me unless there's a reason I'm missing this needs to be done.

review- because I don't want to add this copy function unless there's a good reason to do so, and the patch doesn't give a reason.
Comment 9 Michael Saboff 2010-06-29 11:32:06 PDT
Created attachment 60039 [details]
Latest revised patch

Eliminated Vector.h changes.  Cleaned up two review bot issues.
Comment 10 WebKit Review Bot 2010-06-29 11:33:50 PDT
Attachment 60039 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/UString.h:178:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/runtime/UString.h:197:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 2 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 11 Darin Adler 2010-06-29 11:40:53 PDT
Comment on attachment 60039 [details]
Latest revised patch

ChangeLog was mismerged here. This removed some change log entries. Because of that, I set commit-queue- on this patch.

> +        if (size1 == 0)
> +            return true;

One strange thing about WebKit style is that we write such this as:

    if (!size1)
        return true;

And the style bot is telling you about that.

> -            return s2.size() == size && memcmp(s1.data(), s2.data(), size * sizeof(UChar)) == 0;
> +            return memcmp(d1, d2, size1 * sizeof(UChar)) == 0;

The style bot has the same complaint on the existing code on this line (not something you added). I personally think memcmp is an exception. I find it easier to read calls to memcmp and strcmp if there is an "== 0" to mean "==" or "!= 0" to mean "!=" or even "< 0" to mean "<".

r=me as is if we fix the change log before committing, but you could consider using !size1 to make the style bot happy
Comment 12 Michael Saboff 2010-06-29 14:27:49 PDT
Created attachment 60054 [details]
Patch with updated ChangeLog and minor fix.

ChangeLog and style bot fixes.
Comment 13 WebKit Review Bot 2010-06-29 14:33:46 PDT
Attachment 60054 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/UString.h:197:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 14 WebKit Commit Bot 2010-06-29 15:01:39 PDT
Comment on attachment 60054 [details]
Patch with updated ChangeLog and minor fix.

Clearing flags on attachment: 60054

Committed r62148: <http://trac.webkit.org/changeset/62148>
Comment 15 WebKit Commit Bot 2010-06-29 15:01:44 PDT
All reviewed patches have been landed.  Closing bug.