Bug 176222 - [DFG] Consider increasing the number of DFG worklist threads
Summary: [DFG] Consider increasing the number of DFG worklist threads
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2017-09-01 04:03 PDT by Yusuke Suzuki
Modified: 2017-09-27 13:00 PDT (History)
12 users (show)

See Also:


Attachments
Patch (3.01 KB, patch)
2017-09-01 04:42 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2017-09-01 04:03:42 PDT
I found that DFG Worklist's thread is always one in Options.h configuration.
Since DFG is a bit taking longer time than Baseline, it is common that DFG work is stuck.

What do you think of increasing the number of the threads for DFG worklist?
Since DFG and FTL worklists are AutomaticThread, these threads will be destroyed when they are not used.

For example, I found that Octane zlib uses too much time for a8#Baseline. This is purely because DFG compiling queue is stuck.
Adding one more thread to DFG worklist significantly improves zlib result.

                            baseline                  patched                                      

zlib           x2     482.32825+-6.07640    ^   408.66072+-14.03856      ^ definitely 1.1803x faster

sampling result.

with 1 thread
Sampling rate: 1000 microseconds
Hottest bytecodes as <numSamples   'functionName#hash:JITType:bytecodeIndex'>
    46    'a8#C9umOn:Baseline:1838'
    23    'ArrayBuffer#<nil>:None:<nil>'
    20    'a6#A8KgFt:FTL:786'
    18    'eval#<nil>:None:<nil>'
    16    'a6#A8KgFt:FTL:1984'
    10    'a1#CutRMq:Baseline:52645'
     9    'a1#CutRMq:Baseline:49419'
     9    'a2#BZ2PSk:Baseline:719'
     8    'a1#CutRMq:Baseline:50767'
     8    'a6#A8KgFt:FTL:1994'

with 2 threads
Sampling rate: 1000 microseconds
Hottest bytecodes as <numSamples   'functionName#hash:JITType:bytecodeIndex'>
    29    'a1#CutRMq:Baseline:52645'
    25    'ArrayBuffer#<nil>:None:<nil>'
    20    'a6#A8KgFt:FTL:786'
    18    'eval#<nil>:None:<nil>'
    17    'a1#CutRMq:Baseline:49134'
    15    'a6#A8KgFt:FTL:1984'
    14    'a1#CutRMq:Baseline:50767'
    14    'a1#CutRMq:Baseline:49419'
     9    'a1#CutRMq:Baseline:52644'
     9    'a2#BZ2PSk:Baseline:719'
     5    'a6#A8KgFt:FTL:756'
     5    'tearDownZlib#A5Gpvi:LLInt:14'


Compile time report. DFG for a1# takes too much time and prevents a8# to be compiled.
Optimized l#ESs9VR:[0x7f69f4e8e7f0->0x7f69f4e729e0, LLIntFunctionCall, 2409] with Baseline JIT into 45152 bytes in 0.796875 ms.
Optimized a2#BZ2PSk:[0x7f69f4e4a5a0->0x7f69f4e673e0, LLIntFunctionCall, 808] with Baseline JIT into 15136 bytes in 0.340088 ms.
Optimized l#ESs9VR:[0x7f69f4e48790->0x7f69f4e8e7f0->0x7f69f4e729e0, NoneFunctionCall, 2409] using DFGMode with DFG into 13792 bytes in 7.436035 ms.
Optimized a2#BZ2PSk:[0x7f69f4e4a7f0->0x7f69f4e4a5a0->0x7f69f4e673e0, NoneFunctionCall, 808] using DFGMode with DFG into 2176 bytes in 2.496094 ms.
Optimized bm#DOYUng:[0x7f69f4e4aee0->0x7f69f4e66760, LLIntFunctionCall, 296] with Baseline JIT into 8416 bytes in 0.138916 ms.
Optimized bm#DOYUng:[0x7f69f4e4b130->0x7f69f4e4aee0->0x7f69f4e66760, NoneFunctionCall, 296] using DFGMode with DFG into 1440 bytes in 0.972168 ms.
Optimized bh#BlYBBE:[0x7f69f4e4bcc0->0x7f69f4e66a80, LLIntFunctionCall, 2636] with Baseline JIT into 56704 bytes in 0.601074 ms.
Optimized bn#ELS1xv:[0x7f69f4e4b5d0->0x7f69f4e666c0, LLIntFunctionCall, 364] with Baseline JIT into 9568 bytes in 0.124756 ms.
Optimized bn#ELS1xv:[0x7f69f4e240a0->0x7f69f4e4b5d0->0x7f69f4e666c0, NoneFunctionCall, 364] using DFGMode with DFG into 1600 bytes in 1.081055 ms.
Optimized a6#A8KgFt:[0x7f69f4e242f0->0x7f69f4e67160, LLIntFunctionCall, 2201] with Baseline JIT into 48512 bytes in 0.547852 ms.
Optimized a8#C9umOn:[0x7f69f4e4b820->0x7f69f4e67020, LLIntFunctionCall, 8048] with Baseline JIT into 174368 bytes in 0.941162 ms.
Optimized a6#A8KgFt:[0x7f69f4e24540->0x7f69f4e242f0->0x7f69f4e67160, NoneFunctionCall, 2201] using DFGMode with DFG into 5760 bytes in 6.226074 ms.
Optimized a8#C9umOn:[0x7f69f4e24790->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048] using DFGMode with DFG into 12192 bytes in 13.553955 ms.
Optimized a3#Cd4San:[0x7f69f4e4ba70->0x7f69f4e67340, LLIntFunctionCall, 2826] with Baseline JIT into 62016 bytes in 0.406006 ms.
Optimized a3#Cd4San:[0x7f69f4e249e0->0x7f69f4e4ba70->0x7f69f4e67340, NoneFunctionCall, 2826] using DFGMode with DFG into 6976 bytes in 5.264893 ms.
Optimized bb#B2qBXo:[0x7f69f4e250d0->0x7f69f4e66e40, LLIntFunctionCall, 8884] with Baseline JIT into 197824 bytes in 1.996094 ms.
Optimized be#AUXYYm:[0x7f69f4e25320->0x7f69f4e66c60, LLIntFunctionCall, 829] with Baseline JIT into 19712 bytes in 0.245117 ms.
Optimized bd#B8xz5T:[0x7f69f4e25570->0x7f69f4e66d00, LLIntFunctionCall, 5218] with Baseline JIT into 117024 bytes in 1.013916 ms.
Optimized bc#EH8dff:[0x7f69f4e257c0->0x7f69f4e66da0, LLIntFunctionCall, 4558] with Baseline JIT into 102112 bytes in 0.533203 ms.
Optimized bh#BlYBBE:[0x7f69f4e24c30->0x7f69f4e4bcc0->0x7f69f4e66a80, NoneFunctionCall, 2636] using DFGMode with DFG into 5984 bytes in 6.409180 ms.
Optimized a7#EU5CMD:[0x7f69f4e4b380->0x7f69f4e670c0, LLIntFunctionCall, 1645] with Baseline JIT into 39008 bytes in 0.395020 ms.
Optimized bc#EH8dff:[0x7f69f4e25a10->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558] using DFGMode with DFG into 9664 bytes in 6.846924 ms.
Optimized bj#DiUrAc:[0x7f69f4e26350->0x7f69f4e66940, LLIntFunctionCall, 4466] with Baseline JIT into 104096 bytes in 0.672119 ms.
Optimized a1#CutRMq:[0x7f69f4e4aa40->0x7f69f4e67480, LLIntFunctionCall, 61380] with Baseline JIT into 1300832 bytes in 14.934082 ms.
Optimized bb#B2qBXo:[0x7f69f4e25c60->0x7f69f4e250d0->0x7f69f4e66e40, NoneFunctionCall, 8884] using DFGMode with DFG into 20320 bytes in 21.968994 ms.
Optimized bj#DiUrAc:[0x7f69f4e26ee0->0x7f69f4e26350->0x7f69f4e66940, NoneFunctionCall, 4466] using DFGMode with DFG into 13120 bytes in 11.560059 ms.
Optimized bk#AB4XcH:[0x7f69f4e489e0->0x7f69f4e668a0, LLIntFunctionCall, 21639] with Baseline JIT into 502080 bytes in 4.144043 ms.
Optimized a6#A8KgFt:[0x7f69f4e26c90->0x7f69f4e242f0->0x7f69f4e67160, NoneFunctionCall, 2201] using FTLMode with FTL into 3008 bytes in 36.634033 ms (DFG: 22.869141, B3: 13.764893).
Optimized bn#ELS1xv:[0x7f69f4e27a70->0x7f69f4e4b5d0->0x7f69f4e666c0, NoneFunctionCall, 364] using FTLMode with FTL into 928 bytes in 5.938965 ms (DFG: 3.736084, B3: 2.202881).
Optimized a3#Cd4San:[0x7f69f4e27cc0->0x7f69f4e4ba70->0x7f69f4e67340, NoneFunctionCall, 2826] using FTLMode with FTL into 4224 bytes in 42.736084 ms (DFG: 25.821045, B3: 16.915039).
Optimized bf#AVXdLH:[0x7f69f4e4ac90->0x7f69f4e66bc0, LLIntFunctionCall, 88] with Baseline JIT into 1824 bytes in 0.065186 ms.
Optimized bg#BK1nBq:[0x7f69f4e25eb0->0x7f69f4e66b20, LLIntFunctionCall, 9043] with Baseline JIT into 201536 bytes in 1.428955 ms.
Optimized bm#DOYUng:[0x7f69d88e4540->0x7f69f4e4aee0->0x7f69f4e66760, NoneFunctionCall, 296] using FTLMode with FTL into 928 bytes in 5.255859 ms (DFG: 3.124023, B3: 2.131836).
Optimized ba#DFv2Af:[0x7f69f4e24e80->0x7f69f4e66ee0, LLIntFunctionCall, 5455] with Baseline JIT into 118848 bytes in 1.045898 ms.
Optimized a1#CutRMq:[0x7f69f4e27130->0x7f69f4e4aa40->0x7f69f4e67480, NoneFunctionCall, 61380] using DFGMode with DFG into 124480 bytes in 474.189209 ms.
Optimized bc#EH8dff:[0x7f69f4e27380->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558 (DidTryToEnterInLoop)] using DFGMode with DFG into 9696 bytes in 4.393066 ms.
Optimized a8#C9umOn:[0x7f69f4e275d0->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048 (DidTryToEnterInLoop)] using DFGMode with DFG into 16864 bytes in 7.958984 ms.
Optimized be#AUXYYm:[0x7f69f4e27820->0x7f69f4e25320->0x7f69f4e66c60, NoneFunctionCall, 829] using DFGMode with DFG into 2400 bytes in 1.468994 ms.
Optimized bd#B8xz5T:[0x7f69d88e40a0->0x7f69f4e25570->0x7f69f4e66d00, NoneFunctionCall, 5218] using DFGMode with DFG into 11520 bytes in 5.637695 ms.
Optimized a7#EU5CMD:[0x7f69d88e42f0->0x7f69f4e4b380->0x7f69f4e670c0, NoneFunctionCall, 1645] using DFGMode with DFG into 4928 bytes in 1.091064 ms.
Optimized bf#AVXdLH:[0x7f69d88e4790->0x7f69f4e4ac90->0x7f69f4e66bc0, NoneFunctionCall, 88] using DFGMode with DFG into 864 bytes in 0.181885 ms.
Optimized _llvm_bswap_i32#BhzUu3:[0x7f69f4e265a0->0x7f69f4e73980, LLIntFunctionCall, 66] with Baseline JIT into 2496 bytes in 0.056152 ms.
Optimized bh#BlYBBE:[0x7f69d88e4c30->0x7f69f4e4bcc0->0x7f69f4e66a80, NoneFunctionCall, 2636] using FTLMode with FTL into 2784 bytes in 20.094971 ms (DFG: 11.422119, B3: 8.672852).
Optimized a8#C9umOn:[0x7f69d88e49e0->0x7f69f4e4b820->0x7f69f4e67020, NoneFunctionCall, 8048 (DidTryToEnterInLoop)] using FTLMode with FTL into 8544 bytes in 94.714844 ms (DFG: 68.500000, B3: 26.214844).
Optimized bc#EH8dff:[0x7f69d88e50d0->0x7f69f4e257c0->0x7f69f4e66da0, NoneFunctionCall, 4558 (DidTryToEnterInLoop)] using FTLMode with FTL into 3936 bytes in 20.987061 ms (DFG: 12.393311, B3: 8.593750).
Comment 1 Yusuke Suzuki 2017-09-01 04:13:59 PDT
BTW, a1# takes too much time to be compiled in DFG. The log is the following.

DFG(Plan) compiling a1#CutRMq:[0x7fd966b37820->0x7fd966b4ac90->0x7fd966b67480, NoneFunctionCall, 61380] with DFGMode, number of instructions = 61380
Phase live catch variable preservation phase took 0.0000 ms
Phase CPS rethreading took 51.5908 ms
Phase unification took 14.2761 ms
Phase prediction injection took 0.0378 ms
Phase static execution count estimation took 1.9519 ms
Phase backwards propagation took 2.3188 ms
Phase prediction propagation took 4.6748 ms
Phase fixup took 5.0251 ms
Phase invalidation point injection took 1.4810 ms
Phase structure check hoisting took 2.5803 ms
Phase strength reduction took 1.1443 ms
Phase CPS rethreading took 0.0000 ms
Phase control flow analysis took 83.9871 ms
Phase constant folding took 27.8821 ms
Phase CFG simplification took 2.1240 ms
Phase local common subexpression elimination took 8.1282 ms
Phase CPS rethreading took 55.1277 ms
Phase varargs forwarding took 0.2300 ms
Phase control flow analysis took 63.8518 ms
Phase constant folding took 0.0420 ms
Phase tier-up check injection took 0.0020 ms
Phase fast store barrier insertion took 4.7820 ms
Phase store barrier fencing took 1.4021 ms
Phase clean up took 0.5210 ms
Phase CPS rethreading took 0.0000 ms
Phase dead code elimination took 22.2881 ms
Phase phantom insertion took 6.8411 ms
Phase stack layout took 0.5432 ms
Phase virtual register allocation took 0.4551 ms
Phase watchpoint collection took 0.1611 ms
Optimized a1#CutRMq:[0x7fd966b37820->0x7fd966b4ac90->0x7fd966b67480, NoneFunctionCall, 61380] using DFGMode with DFG into 124576 bytes in 447.346924 ms.
Comment 2 Yusuke Suzuki 2017-09-01 04:42:30 PDT
Created attachment 319592 [details]
Patch
Comment 3 Yusuke Suzuki 2017-09-01 21:17:51 PDT
Comment on attachment 319592 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=319592&action=review

> Source/JavaScriptCore/ChangeLog:23
> +        change significantly improves Octane/zlib performance.

I note more rationales for this choice. One considerable design to alleviate this situation is making DFG compiling interruptible.
For example, once the compilation takes longer than 30ms, DFG stops this plan, and enqueue it to the last of the compilation queue.
This compilation plan will be resumed later.
But this does not solve this problem well. The compilation queue is repeatedly stuck with the above super heavy DFG plan.

I think adding one more thread here is the reasonable solution here. While FTL gets many thread (~8), DFG only have 1 thread.
This makes DFG bottleneck to make hot JS compiled in FTL pipeline. Of course, even with the 2 threads, DFG may be stuck.
But situation becomes a bit difficult since you need to get two super heavy DFG compilation at the same time.
I'm not sure 2 is enough or not here, but anyway, one thread for DFG is too small number.
Comment 4 Saam Barati 2017-09-04 14:58:45 PDT
Comment on attachment 319592 [details]
Patch

let's give it a try. I'll keep my eye on benchmarks that may be sensitive to the increased compilation load.
Comment 5 Yusuke Suzuki 2017-09-04 16:42:48 PDT
Comment on attachment 319592 [details]
Patch

OK, let's check it!
Comment 6 WebKit Commit Bot 2017-09-04 17:11:53 PDT
Comment on attachment 319592 [details]
Patch

Clearing flags on attachment: 319592

Committed r221597: <http://trac.webkit.org/changeset/221597>
Comment 7 WebKit Commit Bot 2017-09-04 17:11:54 PDT
All reviewed patches have been landed.  Closing bug.
Comment 8 Yusuke Suzuki 2017-09-04 20:04:35 PDT
https://arewefastyet.com/#machine=29&view=single&suite=octane&subtest=zlib Octane/zlib is improved in arewefastyet.
Comment 9 Saam Barati 2017-09-14 15:49:39 PDT
This bug or https://bugs.webkit.org/show_bug.cgi?id=170007 may have regressed ARES-6 by 3%. Can you look into it?
Comment 10 Radar WebKit Bug Importer 2017-09-27 13:00:42 PDT
<rdar://problem/34694474>