176222 – [DFG] Consider increasing the number of DFG worklist threads

RESOLVED FIXED 176222

[DFG] Consider increasing the number of DFG worklist threads

https://bugs.webkit.org/show_bug.cgi?id=176222

Summary [DFG] Consider increasing the number of DFG worklist threads

Yusuke Suzuki

Reported 2017-09-01 04:03:42 PDT

I found that DFG Worklist's thread is always one in Options.h configuration. Since DFG is a bit taking longer time than Baseline, it is common that DFG work is stuck. What do you think of increasing the number of the threads for DFG worklist? Since DFG and FTL worklists are AutomaticThread, these threads will be destroyed when they are not used. For example, I found that Octane zlib uses too much time for a8#Baseline. This is purely because DFG compiling queue is stuck. Adding one more thread to DFG worklist significantly improves zlib result. baseline patched zlib x2 482.32825+-6.07640 ^ 408.66072+-14.03856 ^ definitely 1.1803x faster sampling result. with 1 thread Sampling rate: 1000 microseconds Hottest bytecodes as <numSamples 'functionName#hash:JITType:bytecodeIndex'> 46 'a8#C9umOn:Baseline:1838' 23 'ArrayBuffer#<nil>:None:<nil>' 20 'a6#A8KgFt:FTL:786' 18 'eval#<nil>:None:<nil>' 16 'a6#A8KgFt:FTL:1984' 10 'a1#CutRMq:Baseline:52645' 9 'a1#CutRMq:Baseline:49419' 9 'a2#BZ2PSk:Baseline:719' 8 'a1#CutRMq:Baseline:50767' 8 'a6#A8KgFt:FTL:1994' with 2 threads Sampling rate: 1000 microseconds Hottest bytecodes as <numSamples 'functionName#hash:JITType:bytecodeIndex'> 29 'a1#CutRMq:Baseline:52645' 25 'ArrayBuffer#<nil>:None:<nil>' 20 'a6#A8KgFt:FTL:786' 18 'eval#<nil>:None:<nil>' 17 'a1#CutRMq:Baseline:49134' 15 'a6#A8KgFt:FTL:1984' 14 'a1#CutRMq:Baseline:50767' 14 'a1#CutRMq:Baseline:49419' 9 'a1#CutRMq:Baseline:52644' 9 'a2#BZ2PSk:Baseline:719' 5 'a6#A8KgFt:FTL:756' 5 'tearDownZlib#A5Gpvi:LLInt:14' Compile time report. DFG for a1# takes too much time and prevents a8# to be compiled. Optimized l#ESs9VR:[0x7f69f4e8e7f0->0x7f69f4e729e0, LLIntFunctionCall, 2409] with Baseline JIT into 45152 bytes in 0.796875 ms. Optimized a2#BZ2PSk:[0x7f69f4e4a5a0->0x7f69f4e673e0, LLIntFunctionCall, 808] with Baseline JIT into 15136 bytes in 0.340088 ms. Optimized l#ESs9VR:[0x7f69f4e48790->0x7f69f4e8e7f0->0x7f69f4e729e0, NoneFunctionCall, 2409] using DFGMode with DFG into 13792 bytes in 7.436035 ms. Optimized a2#BZ2PSk:[0x7f69f4e4a7f0->0x7f69f4e4a5a0->0x7f69f4e673e0, NoneFunctionCall, 808] using DFGMode with DFG into 2176 bytes in 2.496094 ms. Optimized bm#DOYUng:[0x7f69f4e4aee0->0x7f69f4e66760, LLIntFunctionCall, 296] with Baseline JIT into 8416 bytes in 0.138916 ms. Optimized bm#DOYUng:[0x7f69f4e4b130->0x7f69f4e4aee0->0x7f69f4e66760, NoneFunctionCall, 296] using DFGMode with DFG into 1440 bytes in 0.972168 ms. Optimized bh#BlYBBE:[0x7f69f4e4bcc0->0x7f69f4e66a80, LLIntFunctionCall, 2636] with Baseline JIT into 56704 bytes in 0.601074 ms. Optimized bn#ELS1xv:[0x7f69f4e4b5d0->0x7f69f4e666c0, LLIntFunctionCall, 364] with Baseline JIT into 9568 bytes in 0.124756 ms. Optimized bn#ELS1xv:[0x7f69f4e240a0->0x7f69f4e4b5d0->0x7f69f4e666c0, NoneFunctionCall, 364] using DFGMode with DFG into 1600 bytes in 1.081055 ms. Optimized a6#A8KgFt:[0x7f69f4e242f0->0x7f69f4e67160, LLIntFunctionCall, 2201] with Baseline JIT into 48512 bytes in 0.547852 ms. Optimized a8#C9umOn:[0x7f69f4e4b820->0x7f69f4e67020, LLIntFunctionCall, 8048] with Baseline JIT into 174368 bytes in 0.941162 ms. Optimized a6#A8KgFt:[0x7f69f4e24540->0x7f69f4e242f0->0x7f69f4e67160, NoneFunctionCall, 2201] using DFGMode with DFG into 5760 bytes in 6.226074 ms. Optimized a8#C9umOn:[0x7f69f4e24790->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048] using DFGMode with DFG into 12192 bytes in 13.553955 ms. Optimized a3#Cd4San:[0x7f69f4e4ba70->0x7f69f4e67340, LLIntFunctionCall, 2826] with Baseline JIT into 62016 bytes in 0.406006 ms. Optimized a3#Cd4San:[0x7f69f4e249e0->0x7f69f4e4ba70->0x7f69f4e67340, NoneFunctionCall, 2826] using DFGMode with DFG into 6976 bytes in 5.264893 ms. Optimized bb#B2qBXo:[0x7f69f4e250d0->0x7f69f4e66e40, LLIntFunctionCall, 8884] with Baseline JIT into 197824 bytes in 1.996094 ms. Optimized be#AUXYYm:[0x7f69f4e25320->0x7f69f4e66c60, LLIntFunctionCall, 829] with Baseline JIT into 19712 bytes in 0.245117 ms. Optimized bd#B8xz5T:[0x7f69f4e25570->0x7f69f4e66d00, LLIntFunctionCall, 5218] with Baseline JIT into 117024 bytes in 1.013916 ms. Optimized bc#EH8dff:[0x7f69f4e257c0->0x7f69f4e66da0, LLIntFunctionCall, 4558] with Baseline JIT into 102112 bytes in 0.533203 ms. Optimized bh#BlYBBE:[0x7f69f4e24c30->0x7f69f4e4bcc0->0x7f69f4e66a80, NoneFunctionCall, 2636] using DFGMode with DFG into 5984 bytes in 6.409180 ms. Optimized a7#EU5CMD:[0x7f69f4e4b380->0x7f69f4e670c0, LLIntFunctionCall, 1645] with Baseline JIT into 39008 bytes in 0.395020 ms. Optimized bc#EH8dff:[0x7f69f4e25a10->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558] using DFGMode with DFG into 9664 bytes in 6.846924 ms. Optimized bj#DiUrAc:[0x7f69f4e26350->0x7f69f4e66940, LLIntFunctionCall, 4466] with Baseline JIT into 104096 bytes in 0.672119 ms. Optimized a1#CutRMq:[0x7f69f4e4aa40->0x7f69f4e67480, LLIntFunctionCall, 61380] with Baseline JIT into 1300832 bytes in 14.934082 ms. Optimized bb#B2qBXo:[0x7f69f4e25c60->0x7f69f4e250d0->0x7f69f4e66e40, NoneFunctionCall, 8884] using DFGMode with DFG into 20320 bytes in 21.968994 ms. Optimized bj#DiUrAc:[0x7f69f4e26ee0->0x7f69f4e26350->0x7f69f4e66940, NoneFunctionCall, 4466] using DFGMode with DFG into 13120 bytes in 11.560059 ms. Optimized bk#AB4XcH:[0x7f69f4e489e0->0x7f69f4e668a0, LLIntFunctionCall, 21639] with Baseline JIT into 502080 bytes in 4.144043 ms. Optimized a6#A8KgFt:[0x7f69f4e26c90->0x7f69f4e242f0->0x7f69f4e67160, NoneFunctionCall, 2201] using FTLMode with FTL into 3008 bytes in 36.634033 ms (DFG: 22.869141, B3: 13.764893). Optimized bn#ELS1xv:[0x7f69f4e27a70->0x7f69f4e4b5d0->0x7f69f4e666c0, NoneFunctionCall, 364] using FTLMode with FTL into 928 bytes in 5.938965 ms (DFG: 3.736084, B3: 2.202881). Optimized a3#Cd4San:[0x7f69f4e27cc0->0x7f69f4e4ba70->0x7f69f4e67340, NoneFunctionCall, 2826] using FTLMode with FTL into 4224 bytes in 42.736084 ms (DFG: 25.821045, B3: 16.915039). Optimized bf#AVXdLH:[0x7f69f4e4ac90->0x7f69f4e66bc0, LLIntFunctionCall, 88] with Baseline JIT into 1824 bytes in 0.065186 ms. Optimized bg#BK1nBq:[0x7f69f4e25eb0->0x7f69f4e66b20, LLIntFunctionCall, 9043] with Baseline JIT into 201536 bytes in 1.428955 ms. Optimized bm#DOYUng:[0x7f69d88e4540->0x7f69f4e4aee0->0x7f69f4e66760, NoneFunctionCall, 296] using FTLMode with FTL into 928 bytes in 5.255859 ms (DFG: 3.124023, B3: 2.131836). Optimized ba#DFv2Af:[0x7f69f4e24e80->0x7f69f4e66ee0, LLIntFunctionCall, 5455] with Baseline JIT into 118848 bytes in 1.045898 ms. Optimized a1#CutRMq:[0x7f69f4e27130->0x7f69f4e4aa40->0x7f69f4e67480, NoneFunctionCall, 61380] using DFGMode with DFG into 124480 bytes in 474.189209 ms. Optimized bc#EH8dff:[0x7f69f4e27380->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558 (DidTryToEnterInLoop)] using DFGMode with DFG into 9696 bytes in 4.393066 ms. Optimized a8#C9umOn:[0x7f69f4e275d0->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048 (DidTryToEnterInLoop)] using DFGMode with DFG into 16864 bytes in 7.958984 ms. Optimized be#AUXYYm:[0x7f69f4e27820->0x7f69f4e25320->0x7f69f4e66c60, NoneFunctionCall, 829] using DFGMode with DFG into 2400 bytes in 1.468994 ms. Optimized bd#B8xz5T:[0x7f69d88e40a0->0x7f69f4e25570->0x7f69f4e66d00, NoneFunctionCall, 5218] using DFGMode with DFG into 11520 bytes in 5.637695 ms. Optimized a7#EU5CMD:[0x7f69d88e42f0->0x7f69f4e4b380->0x7f69f4e670c0, NoneFunctionCall, 1645] using DFGMode with DFG into 4928 bytes in 1.091064 ms. Optimized bf#AVXdLH:[0x7f69d88e4790->0x7f69f4e4ac90->0x7f69f4e66bc0, NoneFunctionCall, 88] using DFGMode with DFG into 864 bytes in 0.181885 ms. Optimized _llvm_bswap_i32#BhzUu3:[0x7f69f4e265a0->0x7f69f4e73980, LLIntFunctionCall, 66] with Baseline JIT into 2496 bytes in 0.056152 ms. Optimized bh#BlYBBE:[0x7f69d88e4c30->0x7f69f4e4bcc0->0x7f69f4e66a80, NoneFunctionCall, 2636] using FTLMode with FTL into 2784 bytes in 20.094971 ms (DFG: 11.422119, B3: 8.672852). Optimized a8#C9umOn:[0x7f69d88e49e0->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048 (DidTryToEnterInLoop)] using FTLMode with FTL into 8544 bytes in 94.714844 ms (DFG: 68.500000, B3: 26.214844). Optimized bc#EH8dff:[0x7f69d88e50d0->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558 (DidTryToEnterInLoop)] using FTLMode with FTL into 3936 bytes in 20.987061 ms (DFG: 12.393311, B3: 8.593750).

Attachments
Patch (3.01 KB, patch) 2017-09-01 04:42 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Yusuke Suzuki

Comment 1 2017-09-01 04:13:59 PDT

BTW, a1# takes too much time to be compiled in DFG. The log is the following. DFG(Plan) compiling a1#CutRMq:[0x7fd966b37820->0x7fd966b4ac90->0x7fd966b67480, NoneFunctionCall, 61380] with DFGMode, number of instructions = 61380 Phase live catch variable preservation phase took 0.0000 ms Phase CPS rethreading took 51.5908 ms Phase unification took 14.2761 ms Phase prediction injection took 0.0378 ms Phase static execution count estimation took 1.9519 ms Phase backwards propagation took 2.3188 ms Phase prediction propagation took 4.6748 ms Phase fixup took 5.0251 ms Phase invalidation point injection took 1.4810 ms Phase structure check hoisting took 2.5803 ms Phase strength reduction took 1.1443 ms Phase CPS rethreading took 0.0000 ms Phase control flow analysis took 83.9871 ms Phase constant folding took 27.8821 ms Phase CFG simplification took 2.1240 ms Phase local common subexpression elimination took 8.1282 ms Phase CPS rethreading took 55.1277 ms Phase varargs forwarding took 0.2300 ms Phase control flow analysis took 63.8518 ms Phase constant folding took 0.0420 ms Phase tier-up check injection took 0.0020 ms Phase fast store barrier insertion took 4.7820 ms Phase store barrier fencing took 1.4021 ms Phase clean up took 0.5210 ms Phase CPS rethreading took 0.0000 ms Phase dead code elimination took 22.2881 ms Phase phantom insertion took 6.8411 ms Phase stack layout took 0.5432 ms Phase virtual register allocation took 0.4551 ms Phase watchpoint collection took 0.1611 ms Optimized a1#CutRMq:[0x7fd966b37820->0x7fd966b4ac90->0x7fd966b67480, NoneFunctionCall, 61380] using DFGMode with DFG into 124576 bytes in 447.346924 ms.

Yusuke Suzuki

Comment 2 2017-09-01 04:42:30 PDT

Created attachment 319592 [details] Patch

Yusuke Suzuki

Comment 3 2017-09-01 21:17:51 PDT

Comment on attachment 319592 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=319592&action=review > Source/JavaScriptCore/ChangeLog:23 > + change significantly improves Octane/zlib performance. I note more rationales for this choice. One considerable design to alleviate this situation is making DFG compiling interruptible. For example, once the compilation takes longer than 30ms, DFG stops this plan, and enqueue it to the last of the compilation queue. This compilation plan will be resumed later. But this does not solve this problem well. The compilation queue is repeatedly stuck with the above super heavy DFG plan. I think adding one more thread here is the reasonable solution here. While FTL gets many thread (~8), DFG only have 1 thread. This makes DFG bottleneck to make hot JS compiled in FTL pipeline. Of course, even with the 2 threads, DFG may be stuck. But situation becomes a bit difficult since you need to get two super heavy DFG compilation at the same time. I'm not sure 2 is enough or not here, but anyway, one thread for DFG is too small number.

Saam Barati

Comment 4 2017-09-04 14:58:45 PDT

Comment on attachment 319592 [details] Patch let's give it a try. I'll keep my eye on benchmarks that may be sensitive to the increased compilation load.

Yusuke Suzuki

Comment 5 2017-09-04 16:42:48 PDT

Comment on attachment 319592 [details] Patch OK, let's check it!

WebKit Commit Bot

Comment 6 2017-09-04 17:11:53 PDT

Comment on attachment 319592 [details] Patch Clearing flags on attachment: 319592 Committed r221597: <http://trac.webkit.org/changeset/221597>

WebKit Commit Bot

Comment 7 2017-09-04 17:11:54 PDT

All reviewed patches have been landed. Closing bug.

Yusuke Suzuki

Comment 8 2017-09-04 20:04:35 PDT

https://arewefastyet.com/#machine=29&view=single&suite=octane&subtest=zlib Octane/zlib is improved in arewefastyet.

Saam Barati

Comment 9 2017-09-14 15:49:39 PDT

This bug or https://bugs.webkit.org/show_bug.cgi?id=170007 may have regressed ARES-6 by 3%. Can you look into it?

Radar WebKit Bug Importer

Comment 10 2017-09-27 13:00:42 PDT

<rdar://problem/34694474>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Yusuke Suzuki

Reported

2017-09-01 04:03 PDT

Modified

2017-09-27 13:00 PDT History

CC List

12 users Show

URL

Keywords InRadar

Depends on

Blocks