Bug 38142

Summary: RegExp caching
Product: WebKit Reporter: Renata Hodovan <rhodovan.u-szeged>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Enhancement CC: barraclough, ggaren, gustavo, laszlo.gombos, oliver, webkit.review.bot, xan.lopez, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: Linux   
Attachments:
Description Flags
RegExp caching
eric: review-
RegExp caching
none
RegExp caching
none
RegExp caching
none
RegExp caching
none
RegExp caching
ggaren: review-
RegExp caching
none
RegExp caching
ggaren: review-
RegExp caching ggaren: review+

Renata Hodovan
Reported 2010-04-26 14:35:58 PDT
Now if we create a new RegExp object in jsc - even if with the same patternString - jsc calls the create and compile methods again and again. We could avoid that with regexp-caching. I have a patch to solve it. I used round robin regex-policy. With this extension the ss runtime results didn't changed significant (8 ** TOTAL **: - 630.7ms +/- 0.4% 628.9ms +/- 0.3%). However by the Windscorpion's ws-email performance test run 2.61x fast (2658ms +/- 2% -> 1016ms +/- 2%). (130128 hit and 6 miss event). No new regression.
Attachments
RegExp caching (13.22 KB, patch)
2010-04-26 14:41 PDT, Renata Hodovan
eric: review-
RegExp caching (16.36 KB, patch)
2010-05-04 04:30 PDT, Renata Hodovan
no flags
RegExp caching (16.37 KB, patch)
2010-05-04 06:12 PDT, Renata Hodovan
no flags
RegExp caching (21.19 KB, patch)
2010-05-16 06:31 PDT, Renata Hodovan
no flags
RegExp caching (22.12 KB, patch)
2010-05-17 06:08 PDT, Renata Hodovan
no flags
RegExp caching (22.31 KB, patch)
2010-05-18 04:59 PDT, Renata Hodovan
ggaren: review-
RegExp caching (22.47 KB, patch)
2010-06-02 03:33 PDT, Renata Hodovan
no flags
RegExp caching (22.33 KB, patch)
2010-06-14 01:03 PDT, Renata Hodovan
ggaren: review-
RegExp caching (22.41 KB, patch)
2010-06-15 06:52 PDT, Renata Hodovan
ggaren: review+
Renata Hodovan
Comment 1 2010-04-26 14:41:46 PDT
Created attachment 54334 [details] RegExp caching
WebKit Review Bot
Comment 2 2010-04-26 14:43:55 PDT
Attachment 54334 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1 JavaScriptCore/runtime/RegExpCache.h:30: #ifndef header guard has wrong style, please use: RegExpCache_h [build/header_guard] [5] JavaScriptCore/runtime/RegExpCache.h:28: Alphabetical sorting problem. [build/include_order] [4] JavaScriptCore/runtime/RegExpCache.h:35: Code inside a namespace should not be indented. [whitespace/indent] [4] JavaScriptCore/runtime/RegExpCache.h:67: Place brace on its own line for function definitions. [whitespace/braces] [4] JavaScriptCore/runtime/RegExpCache.h:113: One space before end of line comments [whitespace/comments] [5] JavaScriptCore/runtime/RegExpCache.h:115: Should have a space between // and comment [whitespace/comments] [4] JavaScriptCore/runtime/RegExpCache.cpp:27: Found header this file implements before WebCore config.h. Should be: config.h, primary header, blank line, and then alphabetically sorted. [build/include_order] [4] Total errors found: 7 in 9 files If any of these errors are false positives, please file a bug against check-webkit-style.
Geoffrey Garen
Comment 3 2010-04-26 14:59:00 PDT
Do you have data on where ws-email creates regular expressions?
WebKit Review Bot
Comment 4 2010-04-26 18:08:47 PDT
Renata Hodovan
Comment 5 2010-04-26 23:53:16 PDT
(In reply to comment #3) > Do you have data on where ws-email creates regular expressions? ws-email tests random strings whether they are valid e-mail addresses. The neccessery regular expressions are too long, hence they are created from substrings. One such example follows: var specialChars="\\(\\)><@,;:\\\\\\\"\\.\\[\\]"; var validChars="\[^\\s" + specialChars + "\]"; var quotedUser="(\"[^\"]*\")"; var atom=validChars + '+'; var word="(" + atom + "|" + quotedUser + ")"; var userPat=new RegExp("^" + word + "(\\." + word + ")*$"); Every time when the function is called these regular expressions are recreated. This takes a lot of time because regular expressions parsing is expensive. Furthermore there are many fallbacks to PCRE.
Eric Seidel (no email)
Comment 6 2010-05-02 18:57:35 PDT
Comment on attachment 54334 [details] RegExp caching I would have called "get" "lookupOrCreate" to match the Create pattern of the rest of the code. I assume there is no gotchas to making JSGlobalData larger? + RefPtr<RegExp> cacheRegExp; should be cachedRegExp, not "cache" I think it would be cleaner to move all the creation logic (under the if (!cacheRegExp) block) into its own function. Then the function reads: if (!cachedRegExp) return createAndCache(flags, patternString); return cacheRegExp.release() Should be m_ + , ptr(-1) + , isFull(false) This code fails for regexps over maxCacheablePatternLength in length. It should early return creating a new one each time. I would have made a countFlags(flags) method for: + if (flags.find('g') != UString::NotFound) + flagsCnt += 4; + if (flags.find('i') != UString::NotFound) + flagsCnt += 2; + if (flags.find('m') != UString::NotFound) + flagsCnt += 1; Also, countFlags probably goes on RegExpKeyType. Why RegExpKeyType instead of just RegExpKey? I *really* like this idea. I just think the code needs some refinement.
Renata Hodovan
Comment 7 2010-05-04 04:30:47 PDT
Created attachment 55008 [details] RegExp caching
Renata Hodovan
Comment 8 2010-05-04 04:31:25 PDT
> I assume there is no gotchas to making JSGlobalData larger? No, there isn't. > Why RegExpKeyType instead of just RegExpKey? You're right. I modified. > I *really* like this idea. Thanks :)
WebKit Review Bot
Comment 9 2010-05-04 04:36:39 PDT
Attachment 55008 [details] did not pass style-queue: Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1 JavaScriptCore/runtime/RegExpKey.h:67: Place brace on its own line for function definitions. [whitespace/braces] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Renata Hodovan
Comment 10 2010-05-04 06:12:02 PDT
Created attachment 55013 [details] RegExp caching
WebKit Review Bot
Comment 11 2010-05-09 01:12:22 PDT
Renata Hodovan
Comment 12 2010-05-16 06:31:36 PDT
Created attachment 56186 [details] RegExp caching
Renata Hodovan
Comment 13 2010-05-17 06:08:15 PDT
Created attachment 56236 [details] RegExp caching
Renata Hodovan
Comment 14 2010-05-18 04:59:43 PDT
Created attachment 56362 [details] RegExp caching
Geoffrey Garen
Comment 15 2010-05-18 20:31:32 PDT
Comment on attachment 56362 [details] RegExp caching + RefPtr<RegExp> cachedRegExp; + + if (!flags.isNull()) + cachedRegExp = RegExp::create(m_globalData, patternString, flags); + else + cachedRegExp = RegExp::create(m_globalData, patternString); This code would be much simpler if you changed RegExp::create to always take a flags argument, but do something efficient when it is null. That might be a good thing to do in a future patch. + if (patternString.size() < maxCacheablePatternLength) { + ++m_ptr; + if (m_ptr >= maxCacheableEntries) { + m_ptr = 0; + m_isFull = true; + } + if (m_isFull) + m_cacheMap.remove(RegExpKey(patternKeyArray[m_ptr].flagsValue, patternKeyArray[m_ptr].pattern)); + + RegExpKey key = RegExpKey(flags, patternString); + m_cacheMap.set(key, cachedRegExp); + patternKeyArray[m_ptr].flagsValue = key.flagsValue; + patternKeyArray[m_ptr].pattern = patternString.rep(); + } + return cachedRegExp; The WebKit style for code like this is to check for the failure case and return early. One advantage of this style is that less code ends up indented. So: if (patternString.size() >= maxCacheablePatternLength) return cachedRegExp; I wouldn't call the regexp 'cachedRegExp', nor would I call the function 'createAndCache', since the regexp might not be cached. How about "regeExp" and "create" instead? You define m_isFull as a data member, but only use it locally in createAndCache. I think you can get rid of it entirely, and just rely on "if (m_ptr >= maxCacheableEntries)" in createAndCache. lookupOrCreate and createAndCache combine to do more hash lookups and RegExpKey construction than necessary. Here's what I'd recommend: 1. lookupOrCreate should say "pair <iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0)" 2. if result.second is false, there was a pre-existing entry in the map. You can return it. 3. if result.second is true, you've added a 0 to the map, and result.first is an iterator pointing at that zero. You can now use that iterator to add a real value to the map. I think that patternKeyArray is unnecessary complexity. I guess your goal is LRU eviction. However, that's not what you'll get. When the cache is full, you'll evict the oldest item, and whatever item happens to collide with the item you're trying to add. There's no guarantee that evicting the oldest item will solve your collision problem. You could go to greater lengths to implement a strict LRU cache. But what I would recommend is just to evict upon collision. All this would require is changing the call to add() that I suggested into a call to set() in the case where "m_cacheMap.size() > maxCacheableEntries". r- because I think there are still good improvements to be made here.
Renata Hodovan
Comment 16 2010-06-02 03:31:48 PDT
> This code would be much simpler if you changed RegExp::create to always take a flags argument, but do something efficient when it is null. That might be a good thing to do in a future patch. Okay, I'm going to do it. > + if (patternString.size() < maxCacheablePatternLength) { > + ++m_ptr; > + if (m_ptr >= maxCacheableEntries) { > + m_ptr = 0; > + m_isFull = true; > + } > + if (m_isFull) > + m_cacheMap.remove(RegExpKey(patternKeyArray[m_ptr].flagsValue, patternKeyArray[m_ptr].pattern)); > + > + RegExpKey key = RegExpKey(flags, patternString); > + m_cacheMap.set(key, cachedRegExp); > + patternKeyArray[m_ptr].flagsValue = key.flagsValue; > + patternKeyArray[m_ptr].pattern = patternString.rep(); > + } > + return cachedRegExp; > > The WebKit style for code like this is to check for the failure case and return early. One advantage of this style is that less code ends up indented. So: > > if (patternString.size() >= maxCacheablePatternLength) > return cachedRegExp; > > I wouldn't call the regexp 'cachedRegExp', nor would I call the function 'createAndCache', since the regexp might not be cached. How about "regeExp" and "create" instead? Thanks, I've changed it. > You define m_isFull as a data member, but only use it locally in createAndCache. I think you can get rid of it entirely, and just rely on "if (m_ptr >= maxCacheableEntries)" in createAndCache. m_isFull couldn't be local, because we have to know if the map became full and it isn't local inforamtion. But I think the name of the variable was abiguous, so I changed it to m_isFirstIteration. > lookupOrCreate and createAndCache combine to do more hash lookups and RegExpKey construction than necessary. Here's what I'd recommend: > > 1. lookupOrCreate should say "pair <iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0)" > 2. if result.second is false, there was a pre-existing entry in the map. You can return it. > 3. if result.second is true, you've added a 0 to the map, and result.first is an iterator pointing at that zero. You can now use that iterator to add a real value to the map. That's right. Modified. > I think that patternKeyArray is unnecessary complexity. I guess your goal is LRU eviction. My implementation is Round Robin, not LRU. I have tried LRU before, but there was no perf. gain, and the code was more complex. However, if you prefer LRU model, I can submit that to the bugzilla.
Renata Hodovan
Comment 17 2010-06-02 03:33:17 PDT
Created attachment 57641 [details] RegExp caching
Renata Hodovan
Comment 18 2010-06-14 01:03:40 PDT
Created attachment 58626 [details] RegExp caching
Oliver Hunt
Comment 19 2010-06-14 10:56:39 PDT
Comment on attachment 58626 [details] RegExp caching > +RegExpCache::RegExpCache(JSGlobalData* globalData) > + : m_globalData(globalData) > + , m_ptr(-1) > + , m_isFirstIteration(true) > + { > + } These braces should not be indented. Any particular reason for a round robin cache vs. lru?
Geoffrey Garen
Comment 20 2010-06-14 11:32:10 PDT
+PassRefPtr<RegExp> RegExpCache::lookupOrCreate(const UString& patternString, const UString& flags) +{ + pair<HashMap<RegExpKey, RefPtr<RegExp> >::iterator, bool> result = m_cacheMap.add(RegExpKey(flags, patternString), 0); + if (patternString.size() < maxCacheablePatternLength && !result.second) + return (*result.first).second; No need to check "patternString.size() < maxCacheablePatternLength" here. If an item is in the cache, it must, by logical deduction, satisfy the constraints of the cache. However, you *should* check "patternString.size() < maxCacheablePatternLength" prior to calling HashMap::add, and avoid calling HashMap::add if the string is too long. One goal of maxCacheablePatternLength, I assume, is to avoid hashing very long strings. The C++ way to write "(*x).y" is "x->y". + ++m_ptr; Let's give this a more descriptive name. How about "m_nextKeyToEvict"? + if (m_ptr >= maxCacheableEntries) { Should be "m_ptr == maxCacheableEntries". m_ptr should never be > maxCacheableEntries. + // After this point the value of m_isFirstIteration will be false. + m_isFirstIteration = false; I think the line of code is sufficient -- no need to add a comment with a duplicate meaning. I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value? I think you misunderstood my comment about LRU. I'm not asking you to switch to LRU vs Round Robin. I don't have a strong opinion on an eviction strategy here. r- because of the maxCacheablePatternLength bug, but I think the other suggestions here, and Oliver's suggestion, are important too.
Zoltan Herczeg
Comment 21 2010-06-14 12:40:59 PDT
> I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value? :D Geoff, you was the reason why we changed the name to clarify its purpose: "You define m_isFull as a data member, but only use it locally in createAndCache." So, back to the original?
Geoffrey Garen
Comment 22 2010-06-14 12:48:03 PDT
> > I guess what this flag really means is, "The cache is full, so a new addition must first evict something old." How about calling this variable "m_isFull", and reversing its value? > :D Geoff, you was the reason why we changed the name to clarify its purpose: > > "You define m_isFull as a data member, but only use it locally in createAndCache." Indeed, I mistook m_isFull as being used only locally. My bad. But I don't think I suggested renaming it to m_isFirstIteration.
Zoltan Herczeg
Comment 23 2010-06-14 13:10:43 PDT
> But I don't think I suggested renaming it to m_isFirstIteration. I didn't said that. I said we thought m_isFull is misleading (not show its global nature) and changed it to something different. Perhaps that is even more misleading :D
Renata Hodovan
Comment 24 2010-06-15 06:51:07 PDT
Oliver: RR is far less complex than LRU and the gain for LRU is minimal over RR, but if you prefer LRU there is still time to change. Garen: I think, you are right. Your suggestions are done.
Renata Hodovan
Comment 25 2010-06-15 06:52:34 PDT
Created attachment 58772 [details] RegExp caching
Oliver Hunt
Comment 26 2010-06-15 11:12:59 PDT
(In reply to comment #24) > Oliver: > RR is far less complex than LRU and the gain for LRU is minimal over RR, but if you prefer LRU there is still time to change. No need to change, I was just interested in whether it was a conscious decision :D
Geoffrey Garen
Comment 27 2010-06-22 11:41:35 PDT
Comment on attachment 58772 [details] RegExp caching JavaScriptCore/runtime/RegExpCache.cpp:65 + m_cacheMap.set(key, regExp); You could eliminate this hash lookup by passing an iterator from lookupOrCreate. I think that would be a good thing to do in a follow-up patch. r=me
Zoltan Herczeg
Comment 28 2010-06-22 12:18:05 PDT
Landed in http://trac.webkit.org/changeset/61623 If the bots will not complain, I will close the bug soon.
Zoltan Herczeg
Comment 29 2010-06-22 12:53:54 PDT
Buildbots like the patch, even the layout ones. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.