WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
152691
[JSC] Iterating over a Set/Map is too slow
https://bugs.webkit.org/show_bug.cgi?id=152691
Summary
[JSC] Iterating over a Set/Map is too slow
Nikita Vasilyev
Reported
2016-01-04 09:44:21 PST
http://jsperf.com/linked-list-vs-arraw-iteration-speed/3
Iterating over a Set/Map is approximately 12 times slower than iterating over a linked list and over 40 times slower than an array. Can somebody explain to me why is that the case? Can it be improved? In Web Inspector we need to use ordered maps for
Bug 152422
"Web Inspector: WebInspector.Object.addEventListener is O(n), make it O(1)"
Attachments
Patch
(24.38 KB, patch)
2016-01-04 18:17 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
Patch
(49.65 KB, patch)
2016-01-05 08:06 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
test
(10.20 KB, application/x-javascript)
2016-01-05 09:37 PST
,
Saam Barati
no flags
Details
Patch
(53.75 KB, patch)
2016-01-09 10:14 PST
,
Yusuke Suzuki
no flags
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Yusuke Suzuki
Comment 1
2016-01-04 15:37:31 PST
I guess we can improve this by making Set.prototype.forEach written in JS. Currently, it is implemented in C++, it avoid inlining the given callback.
Blaze Burg
Comment 2
2016-01-04 15:38:30 PST
(In reply to
comment #1
)
> I guess we can improve this by making Set.prototype.forEach written in JS. > Currently, it is implemented in C++, it avoid inlining the given callback.
If it's not too much work, it sounds worth a try!
Saam Barati
Comment 3
2016-01-04 16:48:22 PST
(In reply to
comment #1
)
> I guess we can improve this by making Set.prototype.forEach written in JS. > Currently, it is implemented in C++, it avoid inlining the given callback.
But if you have many calls to Set.prototype.forEach, we're not going to inline the call. Is there a fast way to get an array of the elements in the set? That way we could just iterate over that array?
Blaze Burg
Comment 4
2016-01-04 16:56:08 PST
(In reply to
comment #3
)
> (In reply to
comment #1
) > > I guess we can improve this by making Set.prototype.forEach written in JS. > > Currently, it is implemented in C++, it avoid inlining the given callback. > > But if you have many calls to Set.prototype.forEach, we're not going > to inline the call.
Aww shucks, we happen to do that ;-) Can't DFG create multiple inlined versions of the same function? That would be super useful for lots of megamorphic library methods.
Yusuke Suzuki
Comment 5
2016-01-04 18:13:52 PST
(In reply to
comment #3
)
> (In reply to
comment #1
) > > I guess we can improve this by making Set.prototype.forEach written in JS. > > Currently, it is implemented in C++, it avoid inlining the given callback. > > But if you have many calls to Set.prototype.forEach, we're not going > to inline the call. > > Is there a fast way to get an array of the elements in the set? > That way we could just iterate over that array?
I created the small patch and measure the bottlenecks with Linux perf. The test script,
https://gist.github.com/Constellation/b747d897066ebbc567c8
And I executed the code with `jsc lodash.min.js benchmark.js above-code.js`. And collect the perf result with `perf record jsc lodash.min.js benchmark.js above-code.js`. Set forEach x 13,138 ops/sec ±0.21% (66 runs sampled) And the perf report shows the following. Samples: 23K of event 'cycles', Event count (approx.): 21446074385 34.04% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::execute(JSC::CallFrameClosure&) 20.48% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] vmEntryToJavaScript 9.80% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::JITCode::execute(JSC::VM*, JSC::ProtoCallFrame*) 7.95% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::setProtoFuncForEach(JSC::ExecState*) 5.65% jsc perf-22854.map [.] 0x00007f5d2c204a6f 2.76% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::JSLock::currentThreadIsHoldingLock() 0.69% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::FTL::JITCode::executableAddressAtOffset(unsigned long) 0.68% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] pthread_self@plt 0.65% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::startSampling() 0.61% jsc libpthread-2.19.so [.] pthread_self 0.60% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Interpreter::stopSampling() 0.60% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter12stopSamplingEv@plt 0.54% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC7JITCode7executeEPNS_2VMEPNS_14ProtoCallFrameE@plt 0.52% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC6JSLock26currentThreadIsHoldingLockEv@plt 0.38% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter7executeERNS_16CallFrameClosureE@plt 0.29% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZN3JSC11Interpreter13startSamplingEv@plt 0.25% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Lexer<unsigned char>::lex(JSC::JSToken*, unsigned int, bool) We can see that C++ -> JS call becomes large performance overhead. So, I've created the small change, adding builtin's Set.prototype.forEach. It results the following, Set forEach x 20,827 ops/sec ±3.69% (40 runs sampled) Samples: 23K of event 'cycles', Event count (approx.): 21255691651 62.91% jsc perf-22890.map [.] 0x00007fd117c0a3b9 24.89% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::privateFuncSetIteratorNext(JSC::ExecState*) 0.29% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::CodeBlock::updateAllPredictionsAndCountLiveness(unsigned int&, unsigned int&) 0.24% jsc [vdso] [.] 0x00000000000008e8 0.22% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::CodeBlock::predictedMachineCodeSize() 0.16% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] WTF::MetaAllocator::currentStatistics() 0.15% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Lexer<unsigned char>::lex(JSC::JSToken*, unsigned int, bool) The one of the bottlenecks was C++ -> JS calls. It can be eliminated and now privateFuncSetIteratorNext (SetIterator's next op) becomes the main contribution. It's still slow. But at least, dropping C++ -> JS calls improve the performance; 1.6x. $ WebKitBuild/set-map-old/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 212,160 ops/sec ±0.39% (66 runs sampled) Array x 376,348 ops/sec ±0.21% (66 runs sampled) Array forEach x 20,054 ops/sec ±4.26% (49 runs sampled) Set forEach x 13,294 ops/sec ±0.20% (67 runs sampled) Fastest is Array $ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 210,151 ops/sec ±0.65% (66 runs sampled) Array x 370,408 ops/sec ±0.51% (66 runs sampled) Array forEach x 20,345 ops/sec ±4.35% (46 runs sampled) Set forEach x 21,007 ops/sec ±3.51% (40 runs sampled) Fastest is Array
Yusuke Suzuki
Comment 6
2016-01-04 18:17:08 PST
Created
attachment 268254
[details]
Patch First attempt, this patch just intends to drop C++ -> JS calls
Yusuke Suzuki
Comment 7
2016-01-04 18:21:34 PST
Comment on
attachment 268254
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=268254&action=review
> Source/JavaScriptCore/builtins/MapPrototype.js:38 > + var value = [ undefined, undefined ];
This is very trickly part. When using the usual generator's interface (next() returns an object), since the implementaiton is currently written in C++, we cannot eliminate the object allocation and it incurs large overhead. So for now, I took this trickly hack. Create the placeholder and pass it.
Yusuke Suzuki
Comment 8
2016-01-04 18:23:46 PST
Comment on
attachment 268254
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=268254&action=review
> Source/JavaScriptCore/runtime/SetPrototype.cpp:191 > + return JSValue::encode(createIteratorResultObject(exec, jsUndefined(), true));
oops. drop this.
Yusuke Suzuki
Comment 9
2016-01-04 18:33:18 PST
Seeing the dump, since the iteration is not enough, it seems that forEach is not DFG compiled. So Array.prototype.forEach and Set.prototype.forEach becomes comparable.
Nikita Vasilyev
Comment 10
2016-01-04 18:38:27 PST
(In reply to
comment #3
)
> ... > Is there a fast way to get an array of the elements in the set? > That way we could just iterate over that array?
Array.from(set) works but it's 3 times slower than Set.prototype.forEach. I updated the jsperf test:
http://jsperf.com/linked-list-vs-arraw-iteration-speed/4
On a related note, `for (item of set)` is also slower than set.forEach:
http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2
Yusuke Suzuki
Comment 11
2016-01-04 19:01:41 PST
(In reply to
comment #10
)
> (In reply to
comment #3
) > > ... > > Is there a fast way to get an array of the elements in the set? > > That way we could just iterate over that array? > > Array.from(set) works but it's 3 times slower than Set.prototype.forEach. > I updated the jsperf test: >
http://jsperf.com/linked-list-vs-arraw-iteration-speed/4
> > On a related note, `for (item of set)` is also slower than set.forEach: >
http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2
https://gist.github.com/Constellation/b747d897066ebbc567c8
[X / _ / X:gpgpu]$ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js tmp/benchmark.js tmp/hello.js Linked List x 211,788 ops/sec ±0.24% (161 runs sampled) Array x 376,235 ops/sec ±0.17% (162 runs sampled) Array forEach x 20,667 ops/sec ±2.42% (149 runs sampled) Array for-of x 16,512 ops/sec ±0.48% (160 runs sampled) Set forEach x 20,365 ops/sec ±1.18% (136 runs sampled) Set for-of x 4,705 ops/sec ±0.47% (122 runs sampled) Fastest is Array It seems that the patched Set.prototype.forEach becomes comparable to Array.prototype.forEach. And Set & for-of is significantly slower than the others. I guess it is due to the object allocation for the iterator result, but anyway, we need to measure the bottleneck.
Yusuke Suzuki
Comment 12
2016-01-04 19:04:03 PST
(In reply to
comment #11
)
> And Set & for-of is significantly slower than the others. I guess it is due > to the object allocation for the iterator result, but anyway, we need to > measure the bottleneck.
Yeah, perf report shows that createIteratorResultObject is the bottleneck for Set & for-of. Samples: 108K of event 'cycles', Event count (approx.): 95529273748 18.02% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::createIteratorResultObject(JSC::ExecState*, JSC::JSValue, bool) ◆ 15.68% jsc jsc [.] JSC::JSObject::putDirect(JSC::VM&, JSC::PropertyName, JSC::JSValue, unsigned int) ▒ 14.18% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::PrototypeMap::emptyObjectStructureForPrototype(JSC::JSObject*, unsigned int) ▒ 13.40% jsc perf-25420.map [.] 0x00007fce158006a1 ▒ 6.79% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::StructureTransitionTable::get(WTF::UniquedStringImpl*, unsigned int) const ▒ 6.64% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] WTF::HashMap<std::pair<WTF::UniquedStringImpl*, unsigned int>, JSC::Weak<JSC::Structure>, JSC::StructureTransitionTable::Hash, WTF::HashTraits<std::pair<WTF::UniquedStringIm▒ 4.94% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::MarkedBlock::sweep(JSC::MarkedBlock::SweepMode) ▒ 4.83% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::SetIteratorPrototypeFuncNext(JSC::ExecState*) ▒ 2.12% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::Structure::addPropertyTransitionToExistingStructureImpl(JSC::Structure*, WTF::UniquedStringImpl*, unsigned int, int&) ▒ 1.69% jsc jsc [.] JSC::Structure::outOfLineCapacity() const ▒ 0.56% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] _ZNK3JSC24StructureTransitionTable3getEPN3WTF17UniquedStringImplEj@plt ▒ 0.52% jsc jsc [.] _ZN3JSC9Structure40addPropertyTransitionToExistingStructureEPS0_NS_12PropertyNameEjRi@plt ▒ 0.52% jsc libjavascriptcoregtk-4.0.so.18.3.1 [.] JSC::MarkedAllocator::allocateSlowCase(unsigned long)
Nikita Vasilyev
Comment 13
2016-01-04 19:18:13 PST
(In reply to
comment #11
)
> (In reply to
comment #10
) > > (In reply to
comment #3
) > > > ... > > > Is there a fast way to get an array of the elements in the set? > > > That way we could just iterate over that array? > > > > Array.from(set) works but it's 3 times slower than Set.prototype.forEach. > > I updated the jsperf test: > >
http://jsperf.com/linked-list-vs-arraw-iteration-speed/4
> > > > On a related note, `for (item of set)` is also slower than set.forEach: > >
http://jsperf.com/iterating-over-a-set-vs-iterating-over-an-array/2
> >
https://gist.github.com/Constellation/b747d897066ebbc567c8
> > [X / _ / X:gpgpu]$ WebKitBuild/set-map/Release/bin/jsc tmp/lodash.min.js > tmp/benchmark.js tmp/hello.js > Linked List x 211,788 ops/sec ±0.24% (161 runs sampled) > Array x 376,235 ops/sec ±0.17% (162 runs sampled) > Array forEach x 20,667 ops/sec ±2.42% (149 runs sampled) > Array for-of x 16,512 ops/sec ±0.48% (160 runs sampled) > Set forEach x 20,365 ops/sec ±1.18% (136 runs sampled) > Set for-of x 4,705 ops/sec ±0.47% (122 runs sampled) > Fastest is Array > > It seems that the patched Set.prototype.forEach becomes comparable to > Array.prototype.forEach. > And Set & for-of is significantly slower than the others. I guess it is due > to the object allocation for the iterator result, but anyway, we need to > measure the bottleneck.
First of all, thanks a lot for looking into this issue! Set forEach is still 10 times slower than Linked List. While an improvement, I'm afraid for my particular task it is still not fast enough. I'm going to use a custom made ordered set that is backed by a Linked List to preserve the order:
https://bugs.webkit.org/show_bug.cgi?id=152422#c33
Yusuke Suzuki
Comment 14
2016-01-04 19:47:31 PST
(In reply to
comment #13
)
> First of all, thanks a lot for looking into this issue! > > Set forEach is still 10 times slower than Linked List. While > an improvement, I'm afraid for my particular task it is still > not fast enough. I'm going to use a custom made ordered set > that is backed by a Linked List to preserve the order: >
https://bugs.webkit.org/show_bug.cgi?id=152422#c33
OK, and thank you for filing this! We should make it as fast as user-implemented one :D
Yusuke Suzuki
Comment 15
2016-01-04 19:52:46 PST
With pre-baked Structure for the iterator result object, the performance is significantly improved. Linked List x 204,093 ops/sec ±0.43% (160 runs sampled) Array x 366,205 ops/sec ±0.43% (160 runs sampled) Array forEach x 20,080 ops/sec ±2.27% (147 runs sampled) Array for-of x 16,234 ops/sec ±1.07% (158 runs sampled) Set forEach x 20,026 ops/sec ±1.18% (136 runs sampled) Set for-of x 12,200 ops/sec ±0.41% (159 runs sampled) Fastest is Array In the long term, we should emit the iterator result in JS to allow object allocation elimination. But optimizing createIteratorResultObject is still useful. After adding the tests, I'll upload the patch.
Yusuke Suzuki
Comment 16
2016-01-05 08:06:51 PST
Created
attachment 268280
[details]
Patch Without xcodeproj change since now I don't have OSX laptop. In weekend, I'll add it.
Saam Barati
Comment 17
2016-01-05 09:28:56 PST
(In reply to
comment #4
)
> (In reply to
comment #3
) > > (In reply to
comment #1
) > > > I guess we can improve this by making Set.prototype.forEach written in JS. > > > Currently, it is implemented in C++, it avoid inlining the given callback. > > > > But if you have many calls to Set.prototype.forEach, we're not going > > to inline the call. > > Aww shucks, we happen to do that ;-) Can't DFG create multiple inlined > versions of the same function? That would be super useful for lots of > megamorphic library methods.
We would definitely inline "forEach" into the call site if it's not too big. But I don't think we'll inline the function passed as an argument into forEach (with sufficiently many calls to forEach with different functions). Hopefully I'm wrong here, but some simple testing could tell us the answer. This would be a useful code pattern to detect inside the DFG's inliner if we don't already.
Saam Barati
Comment 18
2016-01-05 09:37:13 PST
Created
attachment 268288
[details]
test Test program demonstrating that we probably won't inline all calls into the inlined forEach.
Yusuke Suzuki
Comment 19
2016-01-05 09:47:10 PST
(In reply to
comment #18
)
> Created
attachment 268288
[details]
> test > > Test program demonstrating that we probably won't inline all calls > into the inlined forEach.
Yeah. Nice catch! Seeing the perf result, the bottleneck in Set#forEach/Set+for-of compared to Array#forEach/Set+for-of was 1. C++ -> JS calls 2. iterator result object allocations The uploaded patch improves the performance by fixing the above 2 problems and the result of Set#forEach becomes comparable to Array#forEach. But it is still slower than for-iteration. It seems that callback in forEach is not inlined in both (Array#forEach/Set#forEach) cases.
Saam Barati
Comment 20
2016-01-05 10:45:01 PST
(In reply to
comment #18
)
> Created
attachment 268288
[details]
> test > > Test program demonstrating that we probably won't inline all calls > into the inlined forEach.
So the FTL will do this form of inlining. But it relies on the IC from the third tier compilation to give it information about inlining the argument to forEach. It does this because forEach gets inlined into the call site in the third tier, then, after that inlining, we can tell the function passed into forEach is monomorphic for that particular compilation.
Blaze Burg
Comment 21
2016-01-05 10:47:24 PST
(In reply to
comment #20
)
> (In reply to
comment #18
) > > Created
attachment 268288
[details]
> > test > > > > Test program demonstrating that we probably won't inline all calls > > into the inlined forEach. > > So the FTL will do this form of inlining. But it relies on the IC > from the third tier compilation to give it information about inlining > the argument to forEach. It does this because forEach gets inlined into > the call site in the third tier, then, after that inlining, we can tell > the function passed into forEach is monomorphic for that particular > compilation.
Now if only we had a way to visualize compilation levels and depots directly in the Inspector.. :)
Saam Barati
Comment 22
2016-01-05 11:01:27 PST
(In reply to
comment #21
)
> (In reply to
comment #20
) > > (In reply to
comment #18
) > > > Created
attachment 268288
[details]
> > > test > > > > > > Test program demonstrating that we probably won't inline all calls > > > into the inlined forEach. > > > > So the FTL will do this form of inlining. But it relies on the IC > > from the third tier compilation to give it information about inlining > > the argument to forEach. It does this because forEach gets inlined into > > the call site in the third tier, then, after that inlining, we can tell > > the function passed into forEach is monomorphic for that particular > > compilation. > > Now if only we had a way to visualize compilation levels and depots directly > in the Inspector.. :)
I think the slowness here is still mostly because the ES6 iteration protocol.
Yusuke Suzuki
Comment 23
2016-01-07 19:37:53 PST
In this weekend, I'll return home and can touch my OSX laptop. So, I'll add files to xcodeproj and upload the patch.
Yusuke Suzuki
Comment 24
2016-01-09 10:14:32 PST
Created
attachment 268623
[details]
Patch
Saam Barati
Comment 25
2016-01-10 15:30:44 PST
Comment on
attachment 268623
[details]
Patch r=me
WebKit Commit Bot
Comment 26
2016-01-10 21:54:19 PST
Comment on
attachment 268623
[details]
Patch Clearing flags on attachment: 268623 Committed
r194838
: <
http://trac.webkit.org/changeset/194838
>
WebKit Commit Bot
Comment 27
2016-01-10 21:54:25 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug