Bug 38142 - RegExp caching
Summary: RegExp caching
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Linux
: P2 Enhancement
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-26 14:35 PDT by Renata Hodovan
Modified: 2010-06-22 12:53 PDT (History)
8 users (show)

See Also:


Attachments
RegExp caching (13.22 KB, patch)
2010-04-26 14:41 PDT, Renata Hodovan
eric: review-
Details | Formatted Diff | Diff
RegExp caching (16.36 KB, patch)
2010-05-04 04:30 PDT, Renata Hodovan
no flags Details | Formatted Diff | Diff
RegExp caching (16.37 KB, patch)
2010-05-04 06:12 PDT, Renata Hodovan
no flags Details | Formatted Diff | Diff
RegExp caching (21.19 KB, patch)
2010-05-16 06:31 PDT, Renata Hodovan
no flags Details | Formatted Diff | Diff
RegExp caching (22.12 KB, patch)
2010-05-17 06:08 PDT, Renata Hodovan
no flags Details | Formatted Diff | Diff
RegExp caching (22.31 KB, patch)
2010-05-18 04:59 PDT, Renata Hodovan
ggaren: review-
Details | Formatted Diff | Diff
RegExp caching (22.47 KB, patch)
2010-06-02 03:33 PDT, Renata Hodovan
no flags Details | Formatted Diff | Diff
RegExp caching (22.33 KB, patch)
2010-06-14 01:03 PDT, Renata Hodovan
ggaren: review-
Details | Formatted Diff | Diff
RegExp caching (22.41 KB, patch)
2010-06-15 06:52 PDT, Renata Hodovan
ggaren: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Renata Hodovan 2010-04-26 14:35:58 PDT
Now if we create a new RegExp object in jsc - even if with the same patternString - jsc calls the create and compile methods again and again.
We could avoid that with regexp-caching. I have a patch to solve it. I used round robin regex-policy. 
With this extension the ss runtime results didn't changed significant (8 ** TOTAL **: - 630.7ms +/- 0.4%   628.9ms +/- 0.3%). However by the Windscorpion's ws-email performance test run 2.61x fast (2658ms +/- 2% -> 1016ms +/- 2%). (130128 hit and 6 miss event).
No new regression.
Comment 1 Renata Hodovan 2010-04-26 14:41:46 PDT
Created attachment 54334 [details]
RegExp caching
Comment 2 WebKit Review Bot 2010-04-26 14:43:55 PDT
Attachment 54334 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/RegExpCache.h:30:  #ifndef header guard has wrong style, please use: RegExpCache_h  [build/header_guard] [5]
JavaScriptCore/runtime/RegExpCache.h:28:  Alphabetical sorting problem.  [build/include_order] [4]
JavaScriptCore/runtime/RegExpCache.h:35:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
JavaScriptCore/runtime/RegExpCache.h:67:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
JavaScriptCore/runtime/RegExpCache.h:113:  One space before end of line comments  [whitespace/comments] [5]
JavaScriptCore/runtime/RegExpCache.h:115:  Should have a space between // and comment  [whitespace/comments] [4]
JavaScriptCore/runtime/RegExpCache.cpp:27:  Found header this file implements before WebCore config.h. Should be: config.h, primary header, blank line, and then alphabetically sorted.  [build/include_order] [4]
Total errors found: 7 in 9 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Geoffrey Garen 2010-04-26 14:59:00 PDT
Do you have data on where ws-email creates regular expressions?
Comment 4 WebKit Review Bot 2010-04-26 18:08:47 PDT
Attachment 54334 [details] did not build on gtk:
Build output: http://webkit-commit-queue.appspot.com/results/1853135
Comment 5 Renata Hodovan 2010-04-26 23:53:16 PDT
(In reply to comment #3)
> Do you have data on where ws-email creates regular expressions?

ws-email tests random strings whether they are valid e-mail addresses. The neccessery regular expressions are too long, hence they are created from substrings. One such example follows:

 var specialChars="\\(\\)><@,;:\\\\\\\"\\.\\[\\]";
 var validChars="\[^\\s" + specialChars + "\]";
 var quotedUser="(\"[^\"]*\")";
 var atom=validChars + '+';
 var word="(" + atom + "|" + quotedUser + ")";
 var userPat=new RegExp("^" + word + "(\\." + word + ")*$");

Every time when the function is called these regular expressions are recreated. This takes a lot of time because regular expressions parsing is expensive.
Furthermore there are many fallbacks to PCRE.
Comment 6 Eric Seidel (no email) 2010-05-02 18:57:35 PDT
Comment on attachment 54334 [details]
RegExp caching

I would have called "get" "lookupOrCreate" to match the Create pattern of the rest of the code.

I assume there is no gotchas to making JSGlobalData larger?

+    RefPtr<RegExp> cacheRegExp;
should be cachedRegExp, not "cache"

I think it would be cleaner to move all the creation logic (under the if (!cacheRegExp) block) into its own function.  Then the function reads:

if (!cachedRegExp)
     return createAndCache(flags, patternString);
return cacheRegExp.release()

Should be m_
+    , ptr(-1) 
+    , isFull(false) 

This code fails for regexps over maxCacheablePatternLength in length.  It should early return creating a new one each time.

I would have made a countFlags(flags) method for:
+    if (flags.find('g') != UString::NotFound)
+        flagsCnt += 4;
+    if (flags.find('i') != UString::NotFound)
+        flagsCnt += 2;
+    if (flags.find('m') != UString::NotFound)
+        flagsCnt += 1;

Also, countFlags probably goes on RegExpKeyType.

Why RegExpKeyType instead of just RegExpKey?

I *really* like this idea.  I just think the code needs some refinement.
Comment 7 Renata Hodovan 2010-05-04 04:30:47 PDT
Created attachment 55008 [details]
RegExp caching
Comment 8 Renata Hodovan 2010-05-04 04:31:25 PDT
> I assume there is no gotchas to making JSGlobalData larger?
No, there isn't.

> Why RegExpKeyType instead of just RegExpKey?
You're right. I modified.
 
> I *really* like this idea.
Thanks :)
Comment 9 WebKit Review Bot 2010-05-04 04:36:39 PDT
Attachment 55008 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/runtime/RegExpKey.h:67:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Renata Hodovan 2010-05-04 06:12:02 PDT
Created attachment 55013 [details]
RegExp caching
Comment 11 WebKit Review Bot 2010-05-09 01:12:22 PDT
Attachment 55013 [details] did not build on win:
Build output: http://webkit-commit-queue.appspot.com/results/2214087
Comment 12 Renata Hodovan 2010-05-16 06:31:36 PDT
Created attachment 56186 [details]
RegExp caching
Comment 13 Renata Hodovan 2010-05-17 06:08:15 PDT
Created attachment 56236 [details]
RegExp caching
Comment 14 Renata Hodovan 2010-05-18 04:59:43 PDT
Created attachment 56362 [details]
RegExp caching
Comment 15 Geoffrey Garen 2010-05-18 20:31:32 PDT
Comment on attachment 56362 [details]
RegExp caching

+    RefPtr<RegExp> cachedRegExp;
+
+    if (!flags.isNull())
+        cachedRegExp = RegExp::create(m_globalData, patternString, flags);
+    else
+        cachedRegExp = RegExp::create(m_globalData, patternString);

This code would be much simpler if you changed RegExp::create to always take a flags argument, but do something efficient when it is null. That might be a good thing to do in a future patch.


+    if (patternString.size() < maxCacheablePatternLength) {
+        ++m_ptr;
+        if (m_ptr >= maxCacheableEntries) {
+            m_ptr = 0;
+            m_isFull = true;
+        }
+        if (m_isFull)
+            m_cacheMap.remove(RegExpKey(patternKeyArray[m_ptr].flagsValue, patternKeyArray[m_ptr].pattern));
+
+        RegExpKey key = RegExpKey(flags, patternString);
+        m_cacheMap.set(key, cachedRegExp);
+        patternKeyArray[m_ptr].flagsValue = key.flagsValue;
+        patternKeyArray[m_ptr].pattern = patternString.rep();
+    }
+    return cachedRegExp;

The WebKit style for code like this is to check for the failure case and return early. One advantage of this style is that less code ends up indented. So:

if (patternString.size() >= maxCacheablePatternLength)
    return cachedRegExp;

I wouldn't call the regexp 'cachedRegExp', nor would I call the function 'createAndCache', since the regexp might not be cached. How about "regeExp" and "create" instead?

You define m_isFull as a data member, but only use it locally in createAndCache. I think you can get rid of it entirely, and just rely on "if (m_ptr >= maxCacheableEntries)" in createAndCache.

lookupOrCreate and createAndCache combine to do more hash lookups and RegExpKey construction than necessary. Here's what I'd recommend:

1. lookupOrCreate should say "pair <iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0)"
2. if result.second is false, there was a pre-existing entry in the map. You can return it.
3. if result.second is true, you've added a 0 to the map, and result.first is an iterator pointing at that zero. You can now use that iterator to add a real value to the map.

I think that patternKeyArray is unnecessary complexity. I guess your goal is LRU eviction. However, that's not what you'll get. When the cache is full, you'll evict the oldest item, and whatever item happens to collide with the item you're trying to add. There's no guarantee that evicting the oldest item will solve your collision problem.

You could go to greater lengths to implement a strict LRU cache. But what I would recommend is just to evict upon collision. All this would require is changing the call to add() that I suggested into a call to set() in the case where "m_cacheMap.size() > maxCacheableEntries".

r- because I think there are still good improvements to be made here.
Comment 16 Renata Hodovan 2010-06-02 03:31:48 PDT
> This code would be much simpler if you changed RegExp::create to always take a flags argument, but do something efficient when it is null. That might be a good thing to do in a future patch.
Okay, I'm going to do it.

> +    if (patternString.size() < maxCacheablePatternLength) {
> +        ++m_ptr;
> +        if (m_ptr >= maxCacheableEntries) {
> +            m_ptr = 0;
> +            m_isFull = true;
> +        }
> +        if (m_isFull)
> +            m_cacheMap.remove(RegExpKey(patternKeyArray[m_ptr].flagsValue, patternKeyArray[m_ptr].pattern));
> +
> +        RegExpKey key = RegExpKey(flags, patternString);
> +        m_cacheMap.set(key, cachedRegExp);
> +        patternKeyArray[m_ptr].flagsValue = key.flagsValue;
> +        patternKeyArray[m_ptr].pattern = patternString.rep();
> +    }
> +    return cachedRegExp;
> 
> The WebKit style for code like this is to check for the failure case and return early. One advantage of this style is that less code ends up indented. So:
> 
> if (patternString.size() >= maxCacheablePatternLength)
>     return cachedRegExp;
> 
> I wouldn't call the regexp 'cachedRegExp', nor would I call the function 'createAndCache', since the regexp might not be cached. How about "regeExp" and "create" instead?
Thanks, I've changed it.
 
> You define m_isFull as a data member, but only use it locally in createAndCache. I think you can get rid of it entirely, and just rely on "if (m_ptr >= maxCacheableEntries)" in createAndCache.
m_isFull couldn't be local, because we have to know if the map became full and it isn't local inforamtion. But I think the name of the variable was abiguous, so I changed it to m_isFirstIteration.
 
> lookupOrCreate and createAndCache combine to do more hash lookups and RegExpKey construction than necessary. Here's what I'd recommend:
> 
> 1. lookupOrCreate should say "pair <iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0)"
> 2. if result.second is false, there was a pre-existing entry in the map. You can return it.
> 3. if result.second is true, you've added a 0 to the map, and result.first is an iterator pointing at that zero. You can now use that iterator to add a real value to the map.
That's right. Modified.

> I think that patternKeyArray is unnecessary complexity. I guess your goal is LRU eviction.
My implementation is Round Robin, not LRU. I have tried LRU before, but there was no perf. gain, and the code was more complex. However, if you prefer LRU model, I can submit that to the bugzilla.
Comment 17 Renata Hodovan 2010-06-02 03:33:17 PDT
Created attachment 57641 [details]
RegExp caching
Comment 18 Renata Hodovan 2010-06-14 01:03:40 PDT
Created attachment 58626 [details]
RegExp caching
Comment 19 Oliver Hunt 2010-06-14 10:56:39 PDT
Comment on attachment 58626 [details]
RegExp caching


> +RegExpCache::RegExpCache(JSGlobalData* globalData)
> +    : m_globalData(globalData)
> +    , m_ptr(-1)
> +    , m_isFirstIteration(true)
> +    {
> +    }
These braces should not be indented.

Any particular reason for a round robin cache vs. lru?
Comment 20 Geoffrey Garen 2010-06-14 11:32:10 PDT
+PassRefPtr<RegExp> RegExpCache::lookupOrCreate(const UString& patternString, const UString& flags)
+{
+    pair<HashMap<RegExpKey, RefPtr<RegExp> >::iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0);
+    if (patternString.size() < maxCacheablePatternLength && !result.second)
+        return (*result.first).second;

No need to check "patternString.size() < maxCacheablePatternLength" here. If an item is in the cache, it must, by logical deduction, satisfy the constraints of the cache.

However, you *should* check "patternString.size() < maxCacheablePatternLength" prior to calling HashMap::add, and avoid calling HashMap::add if the string is too long. One goal of maxCacheablePatternLength, I assume, is to avoid hashing very long strings. 

The C++ way to write "(*x).y" is "x->y".

+    ++m_ptr;

Let's give this a more descriptive name. How about "m_nextKeyToEvict"?

+    if (m_ptr >= maxCacheableEntries) {

Should be "m_ptr == maxCacheableEntries". m_ptr should never be > maxCacheableEntries.

+        // After this point the value of m_isFirstIteration will be false.
+        m_isFirstIteration = false;

I think the line of code is sufficient -- no need to add a comment with a duplicate meaning.

I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value?

I think you misunderstood my comment about LRU. I'm not asking you to switch to LRU vs Round Robin. I don't have a strong opinion on an eviction strategy here.

r- because of the maxCacheablePatternLength bug, but I think the other suggestions here, and Oliver's suggestion, are important too.
Comment 21 Zoltan Herczeg 2010-06-14 12:40:59 PDT
> I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value?

:D Geoff, you was the reason why we changed the name to clarify its purpose:

"You define m_isFull as a data member, but only use it locally in createAndCache."

So, back to the original?
Comment 22 Geoffrey Garen 2010-06-14 12:48:03 PDT
> > I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value?
> :D Geoff, you was the reason why we changed the name to clarify its purpose:
> 
> "You define m_isFull as a data member, but only use it locally in createAndCache."

Indeed, I mistook m_isFull as being used only locally. My bad.

But I don't think I suggested renaming it to m_isFirstIteration.
Comment 23 Zoltan Herczeg 2010-06-14 13:10:43 PDT
> But I don't think I suggested renaming it to m_isFirstIteration.

I didn't said that. I said we thought m_isFull is misleading (not show its global nature) and changed it to something different. Perhaps that is even more misleading :D
Comment 24 Renata Hodovan 2010-06-15 06:51:07 PDT
Oliver:
RR is far less complex than LRU and the gain for LRU is minimal over RR, but if you prefer LRU there is still time to change.

Garen:
I think, you are right. Your suggestions are done.
Comment 25 Renata Hodovan 2010-06-15 06:52:34 PDT
Created attachment 58772 [details]
RegExp caching
Comment 26 Oliver Hunt 2010-06-15 11:12:59 PDT
(In reply to comment #24)
> Oliver:
> RR is far less complex than LRU and the gain for LRU is minimal over RR, but if you prefer LRU there is still time to change.
No need to change, I was just interested in whether it was a conscious decision :D
Comment 27 Geoffrey Garen 2010-06-22 11:41:35 PDT
Comment on attachment 58772 [details]
RegExp caching

JavaScriptCore/runtime/RegExpCache.cpp:65
 +      m_cacheMap.set(key, regExp);
You could eliminate this hash lookup by passing an iterator from lookupOrCreate. I think that would be a good thing to do in a follow-up patch.

r=me
Comment 28 Zoltan Herczeg 2010-06-22 12:18:05 PDT
Landed in http://trac.webkit.org/changeset/61623
If the bots will not complain, I will close the bug soon.
Comment 29 Zoltan Herczeg 2010-06-22 12:53:54 PDT
Buildbots like the patch, even the layout ones.

Closing bug.