152691 – [JSC] Iterating over a Set/Map is too slow

Nikita Vasilyev

Reported 2016-01-04 09:44:21 PST

http://jsperf.com/linked-list-vs-arraw-iteration-speed/3 Iterating over a Set/Map is approximately 12 times slower than iterating over a linked list and over 40 times slower than an array. Can somebody explain to me why is that the case? Can it be improved? In Web Inspector we need to use ordered maps for Bug 152422 "Web Inspector: WebInspector.Object.addEventListener is O(n), make it O(1)"

Yusuke Suzuki

Comment 1 2016-01-04 15:37:31 PST

I guess we can improve this by making Set.prototype.forEach written in JS. Currently, it is implemented in C++, it avoid inlining the given callback.

Blaze Burg

Comment 2 2016-01-04 15:38:30 PST

(In reply to comment #1) > I guess we can improve this by making Set.prototype.forEach written in JS. > Currently, it is implemented in C++, it avoid inlining the given callback. If it's not too much work, it sounds worth a try!

Saam Barati

Comment 3 2016-01-04 16:48:22 PST

(In reply to comment #1) > I guess we can improve this by making Set.prototype.forEach written in JS. > Currently, it is implemented in C++, it avoid inlining the given callback. But if you have many calls to Set.prototype.forEach, we're not going to inline the call. Is there a fast way to get an array of the elements in the set? That way we could just iterate over that array?

Blaze Burg

Comment 4 2016-01-04 16:56:08 PST

(In reply to comment #3) > (In reply to comment #1) > > I guess we can improve this by making Set.prototype.forEach written in JS. > > Currently, it is implemented in C++, it avoid inlining the given callback. > > But if you have many calls to Set.prototype.forEach, we're not going > to inline the call. Aww shucks, we happen to do that ;-) Can't DFG create multiple inlined versions of the same function? That would be super useful for lots of megamorphic library methods.

Yusuke Suzuki

Comment 5 2016-01-04 18:13:52 PST

(In reply to comment #3) > (In reply to comment #1) > > I guess we can improve this by making Set.prototype.forEach written in JS. > > Currently, it is implemented in C++, it avoid inlining the given callback. > > But if you have many calls to Set.prototype.forEach, we're not going > to inline the call. > > Is there a fast way to get an array of the elements in the set? > That way we could just iterate over that array? I created the small patch and measure the bottlenecks with Linux perf. The test script, https://gist.github.com/Constellation/b747d897066ebbc567c8 And I executed the code with `jsc lodash.min.js benchmark.js above-code.js`. And collect the perf result with `perf record jsc lodash.min.js benchmark.js above-code.js`. Set forEach x 13,138 ops/sec ±0.21% (66 runs sampled) And the perf report shows the following. Samples: 23K of event 'cycles', Event count (approx.): 21446074385 34.04% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::execute(JSC::CallFrameClosure&) 20.48% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] vmEntryToJavaScript 9.80% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::JITCode::execute(JSC::VM*, JSC::ProtoCallFrame*) 7.95% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::setProtoFuncForEach(JSC::ExecState*) 5.65% jsc perf-22854.map [.] 0x00007f5d2c204a6f 2.76% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::JSLock::currentThreadIsHoldingLock() 0.69% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::FTL::JITCode::executableAddressAtOffset(unsigned long) 0.68% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] pthread_self@plt 0.65% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::startSampling() 0.61% jsc libpthread-2.19.so [.] pthread_self 0.60% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::stopSampling() 0.60% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter12stopSamplingEv@plt 0.54% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC7JITCode7executeEPNS_2VMEPNS_14ProtoCallFrameE@plt 0.52% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC6JSLock26currentThreadIsHoldingLockEv@plt 0.38% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter7executeERNS_16CallFrameClosureE@plt 0.29% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter13startSamplingEv@plt 0.25% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Lexer<unsigned char>::lex(JSC::JSToken*, unsigned int, bool) We can see that C++ -> JS call becomes large performance overhead. So, I've created the small change, adding builtin's Set.prototype.forEach. It results the following, Set forEach x 20,827 ops/sec ±3.69% (40 runs sampled) Samples: 23K of event 'cycles', Event count (approx.): 21255691651 62.91% jsc perf-22890.map [.] 0x00007fd117c0a3b9 24.89% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::privateFuncSetIteratorNext(JSC::ExecState*) 0.29% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::CodeBlock::updateAllPredictionsAndCountLiveness(unsigned int&, unsigned int&) 0.24% jsc [vdso] [.] 0x00000000000008e8 0.22% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::CodeBlock::predictedMachineCodeSize() 0.16% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] WTF::MetaAllocator::currentStatistics() 0.15% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Lexer<unsigned char>::lex(JSC::JSToken*, unsigned int, bool) The one of the bottlenecks was C++ -> JS calls. It can be eliminated and now privateFuncSetIteratorNext (SetIterator's next op) becomes the main contribution. It's still slow. But at least, dropping C++ -> JS calls improve the performance; 1.6x. $ WebKitBuild/set-map-old/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 212,160 ops/sec ±0.39% (66 runs sampled) Array x 376,348 ops/sec ±0.21% (66 runs sampled) Array forEach x 20,054 ops/sec ±4.26% (49 runs sampled) Set forEach x 13,294 ops/sec ±0.20% (67 runs sampled) Fastest is Array $ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 210,151 ops/sec ±0.65% (66 runs sampled) Array x 370,408 ops/sec ±0.51% (66 runs sampled) Array forEach x 20,345 ops/sec ±4.35% (46 runs sampled) Set forEach x 21,007 ops/sec ±3.51% (40 runs sampled) Fastest is Array

Yusuke Suzuki

Comment 6 2016-01-04 18:17:08 PST

Created attachment 268254 [details] Patch First attempt, this patch just intends to drop C++ -> JS calls

Yusuke Suzuki

Comment 7 2016-01-04 18:21:34 PST

Comment on attachment 268254 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=268254&action=review > Source/JavaScriptCore/builtins/MapPrototype.js:38 > + var value = [ undefined, undefined ]; This is very trickly part. When using the usual generator's interface (next() returns an object), since the implementaiton is currently written in C++, we cannot eliminate the object allocation and it incurs large overhead. So for now, I took this trickly hack. Create the placeholder and pass it.

Yusuke Suzuki

Comment 8 2016-01-04 18:23:46 PST

Comment on attachment 268254 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=268254&action=review > Source/JavaScriptCore/runtime/SetPrototype.cpp:191 > + return JSValue::encode(createIteratorResultObject(exec, jsUndefined(), true)); oops. drop this.

Yusuke Suzuki

Comment 9 2016-01-04 18:33:18 PST

Seeing the dump, since the iteration is not enough, it seems that forEach is not DFG compiled. So Array.prototype.forEach and Set.prototype.forEach becomes comparable.

Nikita Vasilyev

Comment 10 2016-01-04 18:38:27 PST

(In reply to comment #3) > ... > Is there a fast way to get an array of the elements in the set? > That way we could just iterate over that array? Array.from(set) works but it's 3 times slower than Set.prototype.forEach. I updated the jsperf test: http://jsperf.com/linked-list-vs-arraw-iteration-speed/4 On a related note, `for (item of set)` is also slower than set.forEach: http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2

Yusuke Suzuki

Comment 11 2016-01-04 19:01:41 PST

(In reply to comment #10) > (In reply to comment #3) > > ... > > Is there a fast way to get an array of the elements in the set? > > That way we could just iterate over that array? > > Array.from(set) works but it's 3 times slower than Set.prototype.forEach. > I updated the jsperf test: > http://jsperf.com/linked-list-vs-arraw-iteration-speed/4 > > On a related note, `for (item of set)` is also slower than set.forEach: > http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2 https://gist.github.com/Constellation/b747d897066ebbc567c8 [X / _ / X:gpgpu]$ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 211,788 ops/sec ±0.24% (161 runs sampled) Array x 376,235 ops/sec ±0.17% (162 runs sampled) Array forEach x 20,667 ops/sec ±2.42% (149 runs sampled) Array for-of x 16,512 ops/sec ±0.48% (160 runs sampled) Set forEach x 20,365 ops/sec ±1.18% (136 runs sampled) Set for-of x 4,705 ops/sec ±0.47% (122 runs sampled) Fastest is Array It seems that the patched Set.prototype.forEach becomes comparable to Array.prototype.forEach. And Set & for-of is significantly slower than the others. I guess it is due to the object allocation for the iterator result, but anyway, we need to measure the bottleneck.

Yusuke Suzuki

Comment 12 2016-01-04 19:04:03 PST

(In reply to comment #11) > And Set & for-of is significantly slower than the others. I guess it is due > to the object allocation for the iterator result, but anyway, we need to > measure the bottleneck. Yeah, perf report shows that createIteratorResultObject is the bottleneck for Set & for-of. Samples: 108K of event 'cycles', Event count (approx.): 95529273748 18.02% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::createIteratorResultObject(JSC::ExecState*, JSC::JSValue, bool) ◆ 15.68% jsc jsc [.] JSC::JSObject::putDirect(JSC::VM&, JSC::PropertyName, JSC::JSValue, unsigned int) ▒ 14.18% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::PrototypeMap::emptyObjectStructureForPrototype(JSC::JSObject*, unsigned int) ▒ 13.40% jsc perf-25420.map [.] 0x00007fce158006a1 ▒ 6.79% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::StructureTransitionTable::get(WTF::UniquedStringImpl*, unsigned int) const ▒ 6.64% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] WTF::HashMap<std::pair<WTF::UniquedStringImpl*, unsigned int>, JSC::Weak<JSC::Structure>, JSC::StructureTransitionTable::Hash, WTF::HashTraits<std::pair<WTF::UniquedStringIm▒ 4.94% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::MarkedBlock::sweep(JSC::MarkedBlock::SweepMode) ▒ 4.83% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::SetIteratorPrototypeFuncNext(JSC::ExecState*) ▒ 2.12% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Structure::addPropertyTransitionToExistingStructureImpl(JSC::Structure*, WTF::UniquedStringImpl*, unsigned int, int&) ▒ 1.69% jsc jsc [.] JSC::Structure::outOfLineCapacity() const ▒ 0.56% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZNK3JSC24StructureTransitionTable3getEPN3WTF17UniquedStringImplEj@plt ▒ 0.52% jsc jsc [.] _ZN3JSC9Structure40addPropertyTransitionToExistingStructureEPS0_NS_12PropertyNameEjRi@plt ▒ 0.52% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::MarkedAllocator::allocateSlowCase(unsigned long)

Nikita Vasilyev

Comment 13 2016-01-04 19:18:13 PST

(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #3) > > > ... > > > Is there a fast way to get an array of the elements in the set? > > > That way we could just iterate over that array? > > > > Array.from(set) works but it's 3 times slower than Set.prototype.forEach. > > I updated the jsperf test: > > http://jsperf.com/linked-list-vs-arraw-iteration-speed/4 > > > > On a related note, `for (item of set)` is also slower than set.forEach: > > http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2 > > https://gist.github.com/Constellation/b747d897066ebbc567c8 > > [X / _ / X:gpgpu]$ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js > tmp/benchmark.js tmp/hello.js > Linked List x 211,788 ops/sec ±0.24% (161 runs sampled) > Array x 376,235 ops/sec ±0.17% (162 runs sampled) > Array forEach x 20,667 ops/sec ±2.42% (149 runs sampled) > Array for-of x 16,512 ops/sec ±0.48% (160 runs sampled) > Set forEach x 20,365 ops/sec ±1.18% (136 runs sampled) > Set for-of x 4,705 ops/sec ±0.47% (122 runs sampled) > Fastest is Array > > It seems that the patched Set.prototype.forEach becomes comparable to > Array.prototype.forEach. > And Set & for-of is significantly slower than the others. I guess it is due > to the object allocation for the iterator result, but anyway, we need to > measure the bottleneck. First of all, thanks a lot for looking into this issue! Set forEach is still 10 times slower than Linked List. While an improvement, I'm afraid for my particular task it is still not fast enough. I'm going to use a custom made ordered set that is backed by a Linked List to preserve the order: https://bugs.webkit.org/show_bug.cgi?id=152422#c33

Yusuke Suzuki

Comment 14 2016-01-04 19:47:31 PST

(In reply to comment #13) > First of all, thanks a lot for looking into this issue! > > Set forEach is still 10 times slower than Linked List. While > an improvement, I'm afraid for my particular task it is still > not fast enough. I'm going to use a custom made ordered set > that is backed by a Linked List to preserve the order: > https://bugs.webkit.org/show_bug.cgi?id=152422#c33 OK, and thank you for filing this! We should make it as fast as user-implemented one :D

Yusuke Suzuki

Comment 15 2016-01-04 19:52:46 PST

With pre-baked Structure for the iterator result object, the performance is significantly improved. Linked List x 204,093 ops/sec ±0.43% (160 runs sampled) Array x 366,205 ops/sec ±0.43% (160 runs sampled) Array forEach x 20,080 ops/sec ±2.27% (147 runs sampled) Array for-of x 16,234 ops/sec ±1.07% (158 runs sampled) Set forEach x 20,026 ops/sec ±1.18% (136 runs sampled) Set for-of x 12,200 ops/sec ±0.41% (159 runs sampled) Fastest is Array In the long term, we should emit the iterator result in JS to allow object allocation elimination. But optimizing createIteratorResultObject is still useful. After adding the tests, I'll upload the patch.

Yusuke Suzuki

Comment 16 2016-01-05 08:06:51 PST

Created attachment 268280 [details] Patch Without xcodeproj change since now I don't have OSX laptop. In weekend, I'll add it.

Saam Barati

Comment 17 2016-01-05 09:28:56 PST

(In reply to comment #4) > (In reply to comment #3) > > (In reply to comment #1) > > > I guess we can improve this by making Set.prototype.forEach written in JS. > > > Currently, it is implemented in C++, it avoid inlining the given callback. > > > > But if you have many calls to Set.prototype.forEach, we're not going > > to inline the call. > > Aww shucks, we happen to do that ;-) Can't DFG create multiple inlined > versions of the same function? That would be super useful for lots of > megamorphic library methods. We would definitely inline "forEach" into the call site if it's not too big. But I don't think we'll inline the function passed as an argument into forEach (with sufficiently many calls to forEach with different functions). Hopefully I'm wrong here, but some simple testing could tell us the answer. This would be a useful code pattern to detect inside the DFG's inliner if we don't already.

Saam Barati

Comment 18 2016-01-05 09:37:13 PST

Created attachment 268288 [details] test Test program demonstrating that we probably won't inline all calls into the inlined forEach.

Yusuke Suzuki

Comment 19 2016-01-05 09:47:10 PST

(In reply to comment #18) > Created attachment 268288 [details] > test > > Test program demonstrating that we probably won't inline all calls > into the inlined forEach. Yeah. Nice catch! Seeing the perf result, the bottleneck in Set#forEach/Set+for-of compared to Array#forEach/Set+for-of was 1. C++ -> JS calls 2. iterator result object allocations The uploaded patch improves the performance by fixing the above 2 problems and the result of Set#forEach becomes comparable to Array#forEach. But it is still slower than for-iteration. It seems that callback in forEach is not inlined in both (Array#forEach/Set#forEach) cases.

Saam Barati

Comment 20 2016-01-05 10:45:01 PST

(In reply to comment #18) > Created attachment 268288 [details] > test > > Test program demonstrating that we probably won't inline all calls > into the inlined forEach. So the FTL will do this form of inlining. But it relies on the IC from the third tier compilation to give it information about inlining the argument to forEach. It does this because forEach gets inlined into the call site in the third tier, then, after that inlining, we can tell the function passed into forEach is monomorphic for that particular compilation.

Blaze Burg

Comment 21 2016-01-05 10:47:24 PST

(In reply to comment #20) > (In reply to comment #18) > > Created attachment 268288 [details] > > test > > > > Test program demonstrating that we probably won't inline all calls > > into the inlined forEach. > > So the FTL will do this form of inlining. But it relies on the IC > from the third tier compilation to give it information about inlining > the argument to forEach. It does this because forEach gets inlined into > the call site in the third tier, then, after that inlining, we can tell > the function passed into forEach is monomorphic for that particular > compilation. Now if only we had a way to visualize compilation levels and depots directly in the Inspector.. :)

Saam Barati

Comment 22 2016-01-05 11:01:27 PST

(In reply to comment #21) > (In reply to comment #20) > > (In reply to comment #18) > > > Created attachment 268288 [details] > > > test > > > > > > Test program demonstrating that we probably won't inline all calls > > > into the inlined forEach. > > > > So the FTL will do this form of inlining. But it relies on the IC > > from the third tier compilation to give it information about inlining > > the argument to forEach. It does this because forEach gets inlined into > > the call site in the third tier, then, after that inlining, we can tell > > the function passed into forEach is monomorphic for that particular > > compilation. > > Now if only we had a way to visualize compilation levels and depots directly > in the Inspector.. :) I think the slowness here is still mostly because the ES6 iteration protocol.

Yusuke Suzuki

Comment 23 2016-01-07 19:37:53 PST

In this weekend, I'll return home and can touch my OSX laptop. So, I'll add files to xcodeproj and upload the patch.

Yusuke Suzuki

Comment 24 2016-01-09 10:14:32 PST

Created attachment 268623 [details] Patch

Saam Barati

Comment 25 2016-01-10 15:30:44 PST

Comment on attachment 268623 [details] Patch r=me

WebKit Commit Bot

Comment 26 2016-01-10 21:54:19 PST

Comment on attachment 268623 [details] Patch Clearing flags on attachment: 268623 Committed r194838: <http://trac.webkit.org/changeset/194838>

WebKit Commit Bot

Comment 27 2016-01-10 21:54:25 PST

All reviewed patches have been landed. Closing bug.

Attachments
Patch (24.38 KB, patch) 2016-01-04 18:17 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (49.65 KB, patch) 2016-01-05 08:06 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
test (10.20 KB, application/x-javascript) 2016-01-05 09:37 PST, Saam Barati	no flags	Details
Patch (53.75 KB, patch) 2016-01-09 10:14 PST, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.